postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-05-28 04:35:45 -04:00

Author	SHA1	Message	Date
Álvaro Herrera	77fb3959a4	Fix typo	2025-11-18 19:31:23 +01:00
Tom Lane	35b5c62c3a	Don't allow CTEs to determine semantic levels of aggregates. The fix for bug #19055 (commit `b0cc0a71e`) allowed CTE references in sub-selects within aggregate functions to affect the semantic levels assigned to such aggregates. It turns out this broke some related cases, leading to assertion failures or strange planner errors such as "unexpected outer reference in CTE query". After experimenting with some alternative rules for assigning the semantic level in such cases, we've come to the conclusion that changing the level is more likely to break things than be helpful. Therefore, this patch undoes what `b0cc0a71e` changed, and instead installs logic to throw an error if there is any reference to a CTE that's below the semantic level that standard SQL rules would assign to the aggregate based on its contained Var and Aggref nodes. (The SQL standard disallows sub-selects within aggregate functions, so it can't reach the troublesome case and hence has no rule for what to do.) Perhaps someone will come along with a legitimate query that this logic rejects, and if so probably the example will help us craft a level-adjustment rule that works better than what `b0cc0a71e` did. I'm not holding my breath for that though, because the previous logic had been there for a very long time before bug #19055 without complaints, and that bug report sure looks to have originated from fuzzing not from real usage. Like `b0cc0a71e`, back-patch to all supported branches, though sadly that no longer includes v13. Bug: #19106 Reported-by: Kamil Monicz <kamil@monicz.dev> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19106-9dd3668a0734cd72@postgresql.org Backpatch-through: 14	2025-11-18 12:56:55 -05:00
Nathan Bossart	f63ae72bbc	Switch from tabs to spaces in postgresql.conf.sample. This file is written for 8-space tabs, since we expect that most users who edit their configuration files use 8-space tabs. However, most of PostgreSQL is written for 4-space tabs, and at least one popular web interface defaults to 4-space tabs. Rather than trying to standardize on a particular tab width for this file, let's just switch to spaces. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aReNUKdMgKxLqmq7%40nathan	2025-11-18 10:28:36 -06:00
Alexander Korotkov	75e82b2f5a	Optimize shared memory usage for WaitLSNProcInfo We need separate pairing heaps for different WaitLSNType's, because there might be waiters for different LSN's at the same time. However, one process can wait only for one type of LSN at a time. So, no need for inHeap and heapNode fields to be arrays. Discussion: https://postgr.es/m/CAPpHfdsBR-7sDtXFJ1qpJtKiohfGoj%3DvqzKVjWxtWsWidx7G_A%40mail.gmail.com Author: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>	2025-11-18 09:50:12 +02:00
Amit Kapila	3edaf29fa5	Rename two columns in pg_stat_subscription_stats. This patch renames the sync_error_count column to sync_table_error_count in the pg_stat_subscription_stats view. The new name makes the purpose explicit now that a separate column exists to track sequence synchronization errors. Additionally, the column seq_sync_error_count is renamed to sync_seq_error_count to maintain a consistent naming pattern, making it easier for users to group, and query synchronization related counters. Author: Vignesh C <vignesh21@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CALDaNm3WwJmz=-4ybTkhniB-Nf3qmFG9Zx1uKjyLLoPF5NYYXA@mail.gmail.com	2025-11-18 03:58:55 +00:00
Masahiko Sawada	a6eac2273e	Use streaming read I/O in BRIN vacuum scan. This commit implements streaming read I/O for BRIN vacuum scans. Although BRIN indexes tend to be relatively small by design, performance tests have shown performance improvements. Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> Discussion: https://postgr.es/m/CAE7r3ML01aiq9Th_1OSz7U7Aq2pWbhMLoz5T%2BPXcg8J9ZAPFFA%40mail.gmail.com	2025-11-17 13:22:20 -08:00
Tom Lane	ed931377ab	Clean up match_orclause_to_indexcol(). Remove bogus stripping of RelabelTypes: that can result in building an output SAOP tree with incorrect exposed exprType for the operands, which might confuse polymorphic operators. Moreover it demonstrably prevents folding some OR-trees to SAOPs when the RHS expressions have different base types that were coerced to the same type by RelabelTypes. Reduce prohibition on type_is_rowtype to just disallow type RECORD. We need that because otherwise we would happily fold multiple RECORD Consts into a RECORDARRAY Const even if they aren't the same record type. (We could allow that perhaps, if we checked that they all have the same typmod, but the case doesn't seem worth that much effort.) However, there is no reason at all to disallow the transformation for named composite types, nor domains over them: as long as we can find a suitable array type we're good. Remove some assertions that seem rather out of place (it's not this code's duty to verify that the RestrictInfo structure is sane). Rewrite some comments. The issues with RelabelType stripping seem severe enough to back-patch this into v18 where the code was introduced. Author: Tender Wang <tndrwang@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAHewXN=aH7GQBk4fXU-WaEeVmQWUmBAeNyBfJ3VKzPphyPKUkQ@mail.gmail.com Backpatch-through: 18	2025-11-17 13:54:52 -05:00
Daniel Gustafsson	ab805989b2	Fix typos in logical replication code comments Author: Chao Li <lic@highgo.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CAEoWx2kt8m7wV39_zOBds5SNXx9EAkDqb5cPshk7Bxw6Js4Zpg@mail.gmail.com	2025-11-17 13:37:25 +01:00
Daniel Gustafsson	721bf9ce18	Mention md5 deprecation in postgresql.conf.sample PostgreSQL 18 deprecated password_encryption='md5', but the comments for this GUC in the sample configuration file did not mention the deprecation. Update comments with a notice to make as many users as possible aware of it. Also add a comment to the related md5_password_warnings GUC while there. Author: Michael Banck <mbanck@gmx.net> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Backpatch-through: 18	2025-11-17 12:18:18 +01:00
Michael Paquier	e76defbcf0	Rework output format of pg_dependencies The existing format of pg_dependencies uses a single-object JSON structure, with each key value embedding all the knowledge about the set attributes tracked, like: {"1 => 5": 1.000000, "5 => 1": 0.423130} While this is a very compact format, it is confusing to read and it is difficult to manipulate the values within the object, particularly when tracking multiple attributes. The new output format introduced in this commit is a JSON array of objects, with: - A key named "degree", with a float value. - A key named "attributes", with an array of attribute numbers. - A key named "dependency", with an attribute number. The values use the same underlying type as previously when printed, with a new output format that shows now as follows: [{"degree": 1.000000, "attributes": [1], "dependency": 5}, {"degree": 0.423130, "attributes": [5], "dependency": 1}] This new format will become handy for a follow-up set of changes, so as it becomes possible to inject extended statistics rather than require an ANALYZE, like in a dump/restore sequence or after pg_upgrade on a new cluster. This format has been suggested by Tomas Vondra. The key names are defined in the header introduced by `1f927cce44`, to ease the integration of frontend-specific changes that are still under discussion. (Again a personal note: if anybody comes up with better name for the keys, of course feel free.) The bulk of the changes come from the regression tests, where jsonb_pretty() is now used to make the outputs generated easier to parse. Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2025-11-17 10:44:26 +09:00
Michael Paquier	1f927cce44	Rework output format of pg_ndistinct The existing format of pg_ndistinct uses a single-object JSON structure where each key is itself a comma-separated list of attnums, like: {"3, 4": 11, "3, 6": 11, "4, 6": 11, "3, 4, 6": 11} While this is a very compact format, it is confusing to read and it is difficult to manipulate the values within the object. The new output format introduced in this commit is an array of objects, with: - A key named "attributes", that contains an array of attribute numbers. - A key named "ndistinct", represented as an integer. The values use the same underlying type as previously when printed, with a new output format that shows now as follows: [{"ndistinct": 11, "attributes": [3,4]}, {"ndistinct": 11, "attributes": [3,6]}, {"ndistinct": 11, "attributes": [4,6]}, {"ndistinct": 11, "attributes": [3,4,6]}] This new format will become handy for a follow-up set of changes, so as it becomes possible to inject extended statistics rather than require an ANALYZE, like in a dump/restore sequence or after pg_upgrade on a new cluster. This format has been suggested by Tomas Vondra. The key names are defined in a new header, to ease with the integration of frontend-specific changes that are still under discussion. (Personal note: I am not specifically wedded to these key names, but if there are better name suggestions for this release, feel free.) The bulk of the changes come from the regression tests, where jsonb_pretty() is now used to make the outputs generated easier to parse. Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2025-11-17 09:52:20 +09:00
Thomas Munro	32b236644d	Define PS_USE_CLOBBER_ARGV on GNU/Hurd. Until `d2ea2d310d`, the PS_USE_PS_STRINGS option was used on the GNU/Hurd. As this option got removed and PS_USE_CLOBBER_ARGV appears to work fine nowadays on the Hurd, define this one to re-enable process title changes on this platform. In the 14 and 15 branches, the existing test for __hurd__ (added 25 years ago by commit `209aa77d`, removed in 16 by the above commit) is left unchanged for now as it was activating slightly different code paths and would need investigation by a Hurd user. Author: Michael Banck <mbanck@debian.org> Discussion: https://postgr.es/m/CA%2BhUKGJMNGUAqf27WbckYFrM-Mavy0RKJvocfJU%3DJ2XcAZyv%2Bw%40mail.gmail.com Backpatch-through: 16	2025-11-17 12:48:55 +13:00
David Rowley	9c047da51f	Get rid of long datatype in CATCACHE_STATS enabled builds "long" is 32 bits on Windows 64-bit. Switch to a datatype that's 64-bit on all platforms. While we're there, use an unsigned type as these fields count things that have occurred, of which it's not possible to have negative numbers of. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/CAApHDvoGFjSA3aNyVQ3ivbyc4ST=CC5L-_VjEUQ92HbE2Cxovg@mail.gmail.com	2025-11-17 12:26:41 +13:00
Dean Rasheed	1b92fe7bb9	Fix Assert failure in EXPLAIN ANALYZE MERGE with a concurrent update. When instrumenting a MERGE command containing both WHEN NOT MATCHED BY SOURCE and WHEN NOT MATCHED BY TARGET actions using EXPLAIN ANALYZE, a concurrent update of the target relation could lead to an Assert failure in show_modifytable_info(). In a non-assert build, this would lead to an incorrect value for "skipped" tuples in the EXPLAIN output, rather than a crash. This could happen if the concurrent update caused a matched row to no longer match, in which case ExecMerge() treats the single originally matched row as a pair of not matched rows, and potentially executes 2 not-matched actions for the single source row. This could then lead to a state where the number of rows processed by the ModifyTable node exceeds the number of rows produced by its source node, causing "skipped_path" in show_modifytable_info() to be negative, triggering the Assert. Fix this in ExecMergeMatched() by incrementing the instrumentation tuple count on the source node whenever a concurrent update of this kind is detected, if both kinds of merge actions exist, so that the number of source rows matches the number of actions potentially executed, and the "skipped" tuple count is correct. Back-patch to v17, where support for WHEN NOT MATCHED BY SOURCE actions was introduced. Bug: #19111 Reported-by: Dilip Kumar <dilipbalaut@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/19111-5b06624513d301b3@postgresql.org Backpatch-through: 17	2025-11-16 22:14:06 +00:00
Alexander Korotkov	23792d7381	Fix incorrect function name in comments Update comments to reference WaitForLSN() instead of the outdated WaitForLSNReplay() function name. Discussion: https://postgr.es/m/CABPTF7UieOYbOgH3EnQCasaqcT1T4N6V2wammwrWCohQTnD_Lw%40mail.gmail.com Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-11-15 12:27:42 +02:00
Alexander Korotkov	ede6acef49	Fix WaitLSNWakeup() fast-path check for InvalidXLogRecPtr WaitLSNWakeup() incorrectly returned early when called with InvalidXLogRecPtr (meaning "wake all waiters"), because the fast-path check compared minWaitedLSN > 0 without validating currentLSN first. This caused WAIT FOR LSN commands to wait indefinitely during standby promotion until random signals woke them. Add an XLogRecPtrIsValid() check before the comparison so that InvalidXLogRecPtr bypasses the fast-path and wakes all waiters immediately. Discussion: https://postgr.es/m/CABPTF7UieOYbOgH3EnQCasaqcT1T4N6V2wammwrWCohQTnD_Lw%40mail.gmail.com Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-11-15 12:27:42 +02:00
Nathan Bossart	478c4814a0	Comment out autovacuum_worker_slots in postgresql.conf.sample. All settings in this file should be commented out. In addition to fixing that, also fix the indentation for this line. Oversight in commit `c758119e5b`. Reported-by: Daniel Gustafsson <daniel@yesql.se> Author: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/19727040-3EE4-4719-AF4F-2548544113D7%40yesql.se Backpatch-through: 18	2025-11-14 13:45:04 -06:00
Nathan Bossart	7506bdbbf4	Add note about CreateStatistics()'s selective use of check_rights. Commit `5e4fcbe531` added a check_rights parameter to this function for use by ALTER TABLE commands that re-create statistics objects. However, we intentionally ignore check_rights when verifying relation ownership because this function's lookup could return a different answer than the caller's. This commit adds a note to this effect so that we remember it down the road. Reviewed-by: Noah Misch <noah@leadboat.com> Backpatch-through: 14	2025-11-14 13:20:09 -06:00
Bruce Momjian	43e6929bb2	doc: double-quote use of %f, %p, and %r in literal commands. Path expansion might expose characters like spaces which would cause command failure, so double-quote the examples. While %f doesn't need quoting since it uses a fixed character set, it is best to be consistent. Discussion: https://postgr.es/m/aROPCQCfvKp9Htk4@momjian.us Backpatch-through: master	2025-11-14 09:08:53 -05:00
Michael Paquier	910690415b	Revert "Drop unnamed portal immediately after execution to completion" This reverts commit `1fd981f053`, based on concerns that the logging improvements do not justify the protocol breakage of dropping an unnamed portal once its execution has completed. It seems unlikely that one would try to send an execute or describe message after the portal has been used, but if they do such post-completion messages would not be able to process as the previous versions. Let's revert this change for now so as we keep compatibility and consider a different solution. The tests added by `76bba03312` track the pre-1fd981f05369 behavior, and are still valid. Discussion: https://postgr.es/m/CA+TgmoYFJyJNQw3RT7veO3M2BWRE9Aw4hprC5rOcawHZti-f8g@mail.gmail.com	2025-11-14 14:37:10 +09:00
Thomas Munro	017249b828	Add some missing #include <limits.h>. These files relied on transitive inclusion via port/atomics.h for constants CHAR_BIT and INT_MAX. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/536409d2-c9df-4ef3-808d-1ffc3182868c@iki.fi	2025-11-13 22:56:08 +13:00
Michael Paquier	84fb27511d	Replace off_t by pgoff_t in I/O routines PostgreSQL's Windows port has never been able to handle files larger than 2GB due to the use of off_t for file offsets, only 32-bit on Windows. This causes signed integer overflow at exactly 2^31 bytes when trying to handle files larger than 2GB, for the routines touched by this commit. Note that large files are forbidden by ./configure (`3c6248a828`) and meson (recent change, see `79cd66f28c`). This restriction also exists in v16 and older versions for the now-dead MSVC scripts. The code base already defines pgoff_t as __int64 (64-bit) on Windows for this purpose, and some function declarations in headers use it, but many internals still rely on off_t. This commit switches more routines to use pgoff_t, offering more portability, for areas mainly related to file extensions and storage. These are not critical for WAL segments yet, which have currently a maximum size allowed of 1GB (well, this opens the door at allowing a larger size for them). This matters more for segment files if we want to lift the large file restriction in ./configure and meson in the future, which would make sense to remove once/if all traces of off_t are gone from the tree. This can additionally matter for out-of-core code that may want files larger than 2GB in places where off_t is four bytes in size. Note that off_t is still used in other parts of the tree like buffile.c, WAL sender/receiver, base backup, pg_combinebackup, etc. These other code paths can be addressed separately, and their update will be required if we want to remove the large file restriction in the future. This commit is a good first cut in itself towards more portability, hopefully. On Unix-like systems, pgoff_t is defined as off_t, so this change only affects Windows behavior. Author: Bryan Green <dbryan.green@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/0f238ff4-c442-42f5-adb8-01b762c94ca1@gmail.com	2025-11-13 12:41:40 +09:00
Fujii Masao	705601c5ae	Fix incorrect assignment of InvalidXLogRecPtr to a non-LSN variable. pg_logical_slot_get_changes_guts() previously assigned InvalidXLogRecPtr to the local variable upto_nchanges, which is of type int32, not XLogRecPtr. While this caused no functional issue since InvalidXLogRecPtr is defined as 0, it was semantically incorrect. This commit fixes the issue by updating pg_logical_slot_get_changes_guts() to set upto_nchanges to 0 instead of InvalidXLogRecPtr. No backpatch is needed, as the previous behavior was harmless. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Steven Niu <niushiji@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/CAHGQGwHKHuR5NGnGxU3+ebz7cbC1ZAR=AgG4Bueq==Lj6iX8Sw@mail.gmail.com	2025-11-13 08:44:33 +09:00
Nathan Bossart	180e7abe68	Remove obsolete autovacuum comment. This comment seems to refer to some stuff that was removed during development in 2005. Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/aRJFDxKJLFE_1Iai%40nathan	2025-11-12 15:13:08 -06:00
Nathan Bossart	1165a933aa	Teach DSM registry to ERROR if attaching to an uninitialized entry. If DSM entry initialization fails, backends could try to use an uninitialized DSM segment, DSA, or dshash table (since the entry is still added to the registry). To fix, keep track of whether initialization completed, and ERROR if a backend tries to attach to an uninitialized entry. We could instead retry initialization as needed, but that seemed complicated, error prone, and unlikely to help most cases. Furthermore, such problems probably indicate a coding error. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/dd36d384-55df-4fc2-825c-5bc56c950fa9%40gmail.com Backpatch-through: 17	2025-11-12 14:30:11 -06:00
Heikki Linnakangas	0bdc777e80	Clear 'xid' in dummy async notify entries written to fill up pages Before we started to freeze async notify entries (commit `8eeb4a0f7c`), no one looked at the 'xid' on an entry with invalid 'dboid'. But now we might actually need to freeze it later. Initialize them with InvalidTransactionId to begin with, to avoid that work later. Álvaro pointed this out in review of commit `8eeb4a0f7c`, but I forgot to include this change there. Author: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://www.postgresql.org/message-id/202511071410.52ll56eyixx7@alvherre.pgsql Backpatch-through: 14	2025-11-12 21:19:03 +02:00
Heikki Linnakangas	797e9ea6e5	Fix remaining race condition with CLOG truncation and LISTEN/NOTIFY Previous commit fixed a bug where VACUUM would truncate the CLOG that's still needed to check the commit status of XIDs in the async notify queue, but as mentioned in the commit message, it wasn't a full fix. If a backend is executing asyncQueueReadAllNotifications() and has just made a local copy of an async SLRU page which contains old XIDs, vacuum can concurrently truncate the CLOG covering those XIDs, and the backend still gets an error when it calls TransactionIdDidCommit() on those XIDs in the local copy. This commit fixes that race condition. To fix, hold the SLRU bank lock across the TransactionIdDidCommit() calls in NOTIFY processing. Per Tom Lane's idea. Backpatch to all supported versions. Reviewed-by: Joel Jacobson <joel@compiler.org> Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> Discussion: https://www.postgresql.org/message-id/2759499.1761756503@sss.pgh.pa.us Backpatch-through: 14	2025-11-12 20:59:44 +02:00
Heikki Linnakangas	8eeb4a0f7c	Fix bug where we truncated CLOG that was still needed by LISTEN/NOTIFY The async notification queue contains the XID of the sender, and when processing notifications we call TransactionIdDidCommit() on the XID. But we had no safeguards to prevent the CLOG segments containing those XIDs from being truncated away. As a result, if a backend didn't for some reason process its notifications for a long time, or when a new backend issued LISTEN, you could get an error like: test=# listen c21; ERROR: 58P01: could not access status of transaction 14279685 DETAIL: Could not open file "pg_xact/000D": No such file or directory. LOCATION: SlruReportIOError, slru.c:1087 To fix, make VACUUM "freeze" the XIDs in the async notification queue before truncating the CLOG. Old XIDs are replaced with FrozenTransactionId or InvalidTransactionId. Note: This commit is not a full fix. A race condition remains, where a backend is executing asyncQueueReadAllNotifications() and has just made a local copy of an async SLRU page which contains old XIDs, while vacuum concurrently truncates the CLOG covering those XIDs. When the backend then calls TransactionIdDidCommit() on those XIDs from the local copy, you still get the error. The next commit will fix that remaining race condition. This was first reported by Sergey Zhuravlev in 2021, with many other people hitting the same issue later. Thanks to: - Alexandra Wang, Daniil Davydov, Andrei Varashen and Jacques Combrink for investigating and providing reproducable test cases, - Matheus Alcantara and Arseniy Mukhin for review and earlier proposed patches to fix this, - Álvaro Herrera and Masahiko Sawada for reviews, - Yura Sokolov aka funny-falcon for the idea of marking transactions as committed in the notification queue, and - Joel Jacobson for the final patch version. I hope I didn't forget anyone. Backpatch to all supported versions. I believe the bug goes back all the way to commit `d1e027221d`, which introduced the SLRU-based async notification queue. Discussion: https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org Discussion: https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2@postgresql.org Discussion: https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com Backpatch-through: 14	2025-11-12 20:59:36 +02:00
Heikki Linnakangas	1b4699090e	Escalate ERRORs during async notify processing to FATAL Previously, if async notify processing encountered an error, we would report the error to the client and advance our read position past the offending entry to prevent trying to process it over and over again. Trying to continue after an error has a few problems however: - We have no way of telling the client that a notification was lost. They get an ERROR, but that doesn't tell you much. As such, it's not clear if keeping the connection alive after losing a notification is a good thing. Depending on the application logic, missing a notification could cause the application to get stuck waiting, for example. - If the connection is idle, PqCommReadingMsg is set and any ERROR is turned into FATAL anyway. - We bailed out of the notification processing loop on first error without processing any subsequent notifications. The subsequent notifications would not be processed until another notify interrupt arrives. For example, if there were two notifications pending, and processing the first one caused an ERROR, the second notification would not be processed until someone sent a new NOTIFY. This commit changes the behavior so that any ERROR while processing async notifications is turned into FATAL, causing the client connection to be terminated. That makes the behavior more consistent as that's what happened in idle state already, and terminating the connection is a clear signal to the application that it might've missed some notifications. The reason to do this now is that the next commits will change the notification processing code in a way that would make it harder to skip over just the offending notification entry on error. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> Discussion: https://www.postgresql.org/message-id/fedbd908-4571-4bbe-b48e-63bfdcc38f64@iki.fi Backpatch-through: 14	2025-11-12 20:59:28 +02:00
Álvaro Herrera	877a024902	Split out innards of pg_tablespace_location() This creates a src/backend/catalog/pg_tablespace.c supporting file containing a new function get_tablespace_location(), which lets the code underlying pg_tablespace_location() be reused for other purposes. Author: Manni Wood <manni.wood@enterprisedb.com> Author: Nishant Sharma <nishant.sharma@enterprisedb.com> Reviewed-by: Vaibhav Dalvi <vaibhav.dalvi@enterprisedb.com> Reviewed-by: Ian Lawrence Barwick <barwick@gmail.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com	2025-11-12 16:39:55 +01:00
Daniel Gustafsson	b4e32a076c	Fix range for commit_siblings in sample conf The range for commit_siblings was incorrectly listed as starting on 1 instead of 0 in the sample configuration file. Backpatch down to all supported branches. Author: Man Zeng <zengman@halodbtech.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/tencent_53B70BA72303AE9C6889E78E@qq.com Backpatch-through: 14	2025-11-12 13:51:53 +01:00
Daniel Gustafsson	9122ff65a1	libpq: threadsafety for SSL certificate callback In order to make the errorhandling code in backend libpq be thread- safe the global variable used by the certificate verification call- back need to be replaced with passing private data. This moves the threadsafety needle a little but forwards, the call to strerror_r also needs to be replaced with the error buffer made thread local. This is left as future work for when add the thread primitives required for this to the tree. Author: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/353226C7-97A1-4507-A380-36AA92983AE6@yesql.se	2025-11-12 12:37:40 +01:00
Michael Paquier	040a39ed25	Fix comments of output routines for pg_ndistinct and pg_dependencies Oversights in `7b504eb282` (for pg_ndistinct) and `2686ee1b7c` (for pg_dependencies). Reported-by: Man Zeng <zengman@halodbtech.com> Discussion: https://postgr.es/m/176293711658.2081918.12019224686811870203.pgcf@coridan.postgresql.org	2025-11-12 20:24:10 +09:00
Michael Paquier	2ddc8d9e9b	Move code specific to pg_dependencies to new file This new file is named pg_dependencies.c and includes all the code directly related to the data type pg_dependencies, extracted from the extended statistics code. Some patches are under discussion to change its input and output functions, and this separation makes the follow-up changes cleaner by separating the logic related to the data type and the functional dependencies statistics core logic in dependencies.c. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aQ2k8--a0FfwSwX9@paquier.xyz	2025-11-12 16:53:19 +09:00
Michael Paquier	a552312343	Move code specific to pg_ndistinct to new file This new file is named pg_ndistinct.c and includes all the code directly related to the data type pg_ndistinct, extracted from the extended statistics code. Some patches are under discussion to change its input and output functions, and this separation makes the follow-up changes cleaner by separating the logic related to the data type and the multivariate ndistinct coefficient core logic in mvdistinct.c. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aQ2k8--a0FfwSwX9@paquier.xyz	2025-11-12 16:34:52 +09:00
Amit Kapila	bfb7419b0b	Remove unused assignment in CREATE PUBLICATION grammar. Commit `96b3784973` extended the grammar for CREATE PUBLICATION to support the ALL SEQUENCES variant. However, it unnecessarily prepared publication objects for this variant, which is not required. This was a copy-paste oversight in that commit. Additionally, rename pub_obj_type_list to pub_all_obj_type_list to better reflect its purpose. Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CANhcyEWbjkFvk3mSy5LFs9+0z4K1gDwQeFj7GUjOe+L4vxs4AA@mail.gmail.com Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-11-12 03:28:17 +00:00
Thomas Munro	2421ade663	Prefer spelling "cacheable" over "cachable". Previously we had both in code and comments. Keep the more common and accepted variant. Author: Chao Li <lic@highgo.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/5EBF1771-0566-4D08-9F9B-CDCDEF4BDC98@gmail.com	2025-11-12 14:35:16 +13:00
Michael Paquier	6e1535308c	Report better object limits in error messages for injection points Previously, error messages for oversized injection point names, libraries, and functions showed buffer sizes (64, 128, 128) instead of the usable character limits (63, 127, 127) as it did not count for the zero-terminated byte, which was confusing. These messages are adjusted to show better the reality. The limit enforced for the private area was also too strict by one byte, as specifying a zone worth exactly INJ_PRIVATE_MAXLEN should be able to work because three is no zero-terminated byte in this case. This is a stylistic change (well, mostly, a private_area size of exactly 1024 bytes can be defined with this change, something that nobody seem to care about based on the lack of complaints). However, this is a testing facility let's keep the logic consistent across all the branches where this code exists, as there is an argument in favor of out-of-core extensions that use injection points. Author: Xuneng Zhou <xunengzhou@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CABPTF7VxYp4Hny1h+7ejURY-P4O5-K8WZg79Q3GUx13cQ6B2kg@mail.gmail.com Backpatch-through: 17	2025-11-12 10:18:50 +09:00
Peter Eisentraut	d2f24df19b	Clean up qsort comparison function for GUC entries guc_var_compare() is invoked from qsort() on an array of struct config_generic, but the function accesses these directly as strings (char *). This relies on the name being the first field, so this works. But we can write this more clearly by using the struct and then accessing the field through the struct. Before the reorganization of the GUC structs (commit `a13833c35f`), the old code was probably more convenient, but now we can write this more clearly and correctly. After this change, it is no longer required that the name is the first field in struct config_generic, so remove that comment. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/2c961fa1-14f6-44a2-985c-e30b95654e8d%40eisentraut.org	2025-11-11 07:55:10 +01:00
Nathan Bossart	5e4fcbe531	Check for CREATE privilege on the schema in CREATE STATISTICS. This omission allowed table owners to create statistics in any schema, potentially leading to unexpected naming conflicts. For ALTER TABLE commands that require re-creating statistics objects, skip this check in case the user has since lost CREATE on the schema. The addition of a second parameter to CreateStatistics() breaks ABI compatibility, but we are unaware of any impacted third-party code. Reported-by: Jelte Fennema-Nio <postgres@jeltef.nl> Author: Jelte Fennema-Nio <postgres@jeltef.nl> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Security: CVE-2025-12817 Backpatch-through: 13	2025-11-10 09:00:00 -06:00
Heikki Linnakangas	3e0ae46d90	Move SLRU_PAGES_PER_SEGMENT to pg_config_manual.h It seems plausible that someone might want to experiment with different values. The pressing reason though is that I'm reviewing a patch that requires pg_upgrade to manipulate SLRU files. That patch needs to access SLRU_PAGES_PER_SEGMENT from pg_upgrade code, and slru.h, where SLRU_PAGES_PER_SEGMENT is currently defined, cannot be included from frontend code. Moving it to pg_config_manual.h makes it accessible. Now that it's a little more likely that someone might change SLRU_PAGES_PER_SEGMENT, add a cluster compatibility check for it. Bump catalog version because of the new field in the control file. Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://www.postgresql.org/message-id/c7a4ea90-9f7b-4953-81be-b3fcb47db057@iki.fi	2025-11-10 16:11:41 +02:00
Daniel Gustafsson	3a872ddd64	Fix typos in nodeWindowAgg comments One of them submitted by the author, with another one other spotted during review so this fixes both. Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CAHewXN=eNx2oJ_hzxJrkSvy-1A5Qf45SM8pxERWXE+6RoZyFrw@mail.gmail.com	2025-11-10 12:51:47 +01:00
Michael Paquier	9d7e851a21	Fix comment in copyto.c Author: Tatsuya Kawata <kawatatatsuya0913@gmail.com> Discussion: https://postgr.es/m/CAHza6qeNbqgMfgDi15Dv6E6GWx+8maRAqe97OwzYz3qpEFouJQ@mail.gmail.com	2025-11-09 08:17:31 +09:00
Bruce Momjian	6204d07ad6	Remove blank line in C code. Was added in commit `5e89985928`. Reported-by: Ashutosh Bapat Author: Ashutosh Bapat Discussion: https://postgr.es/m/CAExHW5tba_biyuMrd_iPVzq-+XvsMdPcEnjQ+d+__V=cjYj8Pg@mail.gmail.com Backpatch-through: master	2025-11-07 21:54:25 -05:00
Alexander Korotkov	7742f99a02	Fix checking for recovery state in WaitForLSN() We only need to do it for WAIT_LSN_TYPE_REPLAY. WAIT_LSN_TYPE_FLUSH can work for both primary and follower.	2025-11-07 23:34:50 +02:00
Peter Eisentraut	a3ea5330fc	Fix "inconsistent DLL linkage" warning on Windows MSVC This warning was disabled in meson.build (warning 4273). If you enable it, it looks like this: ../src/backend/utils/misc/ps_status.c(27): warning C4273: '__p__environ': inconsistent dll linkage C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h(1158): note: see previous definition of '__p__environ' The declaration in ps_status.c was: #if !defined(WIN32) \|\| defined(_MSC_VER) extern char environ; #endif The declaration in the OS header file is: _DCRTIMP char* __cdecl __p__environ (void); #define _environ (*__p__environ()) So it is evident that this could be problematic. The old declaration was required by the old MSVCRT library, but we don't support that anymore with MSVC. To fix, disable the re-declaration in ps_status.c, and also in some other places that use the same code pattern but didn't trigger the warning. Then we can also re-enable the warning (delete the disablement in meson.build). Reviewed-by: Bryan Green <dbryan.green@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/bf060644-47ff-441b-97cf-c685d0827757@eisentraut.org	2025-11-07 10:14:25 +01:00
Amit Kapila	f6a4c498dc	Add seq_sync_error_count to subscription statistics. This commit adds a new column, seq_sync_error_count, to the pg_stat_subscription_stats view. This counter tracks the number of errors encountered by the sequence synchronization worker during operation. Since a single worker handles the synchronization of all sequences, this value may reflect errors from multiple sequences. This addition improves observability of sequence synchronization behavior and helps monitor potential issues during replication. Author: Vignesh C <vignesh21@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-11-07 08:05:08 +00:00
Andres Freund	5310fac6e0	bufmgr: Use atomic sub for unpinning buffers The prior commit made it legal to modify BufferDesc.state while the buffer header spinlock is held. This allows us to replace the CAS loop inUnpinBufferNoOwner() with an atomic sub. This improves scalability significantly. See the prior commits for more background. Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-11-06 16:43:16 -05:00
Andres Freund	c75ebc657f	bufmgr: Allow some buffer state modifications while holding header lock Until now BufferDesc.state was not allowed to be modified while the buffer header spinlock was held. This meant that operations like unpinning buffers needed to use a CAS loop, waiting for the buffer header spinlock to be released before updating. The benefit of that restriction is that it allowed us to unlock the buffer header spinlock with just a write barrier and an unlocked write (instead of a full atomic operation). That was important to avoid regressions in `48354581a4`. However, since then the hottest buffer header spinlock uses have been replaced with atomic operations (in particular, the most common use of PinBuffer_Locked(), in GetVictimBuffer() (formerly in BufferAlloc()), has been removed in `5e89985928`). This change will allow, in a subsequent commit, to release buffer pins with a single atomic-sub operation. This previously was not possible while such operations were not allowed while the buffer header spinlock was held, as an atomic-sub would not have allowed a race-free check for the buffer header lock being held. Using atomic-sub to unpin buffers is a nice scalability win, however it is not the primary motivation for this change (although it would be sufficient). The primary motivation is that we would like to merge the buffer content lock into BufferDesc.state, which will result in more frequent changes of the state variable, which in some situations can cause a performance regression, due to an increased CAS failure rate when unpinning buffers. The regression entirely vanishes when using atomic-sub. Naively implementing this would require putting CAS loops in every place modifying the buffer state while holding the buffer header lock. To avoid that, introduce UnlockBufHdrExt(), which can set/add flags as well as the refcount, together with releasing the lock. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-11-06 16:42:10 -05:00
David Rowley	448b6a4173	Tidyup WARNING ereports in subscriptioncmds.c A couple of ereports were making use of StringInfos as temporary storage for the portions of the WARNING message. One was doing this to avoid having 2 separate ereports. This was all fairly unnecessary and resulted in more code rather than less code. Refactor out the additional StringInfos and make check_publications_origin_tables() use 2 ereports. In passing, adjust pubnames to become a stack-allocated StringInfoData to avoid having to palloc the temporary StringInfoData. This follows on from the efforts made in `6d0eba662`. Author: Mats Kindahl <mats.kindahl@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/0b381b02-cab9-41f9-a900-ad6c8d26c1fc%40gmail.com	2025-11-07 09:50:02 +13:00
Álvaro Herrera	a2b02293bc	Use XLogRecPtrIsValid() in various places Now that commit `06edbed478` has introduced XLogRecPtrIsValid(), we can use that instead of: - XLogRecPtrIsInvalid() - direct comparisons with InvalidXLogRecPtr - direct comparisons with literal 0 This makes the code more consistent. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aQB7EvGqrbZXrMlg@ip-10-97-1-34.eu-west-3.compute.internal	2025-11-06 20:33:57 +01:00
Peter Eisentraut	aa606b9316	Disallow generated columns in COPY WHERE clause Stored generated columns are not yet computed when the filtering happens, so we need to prohibit them to avoid incorrect behavior. Virtual generated columns currently error out ("unexpected virtual generated column reference"). They could probably work if we expand them in the right place, but for now let's keep them consistent with the stored variant. This doesn't change the behavior, it only gives a nicer error message. Co-authored-by: jian he <jian.universality@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CACJufxHb8YPQ095R_pYDr77W9XKNaXg5Rzy-WP525mkq+hRM3g@mail.gmail.com	2025-11-06 13:54:42 +01:00
Heikki Linnakangas	aa9c5fd3e3	Refactor shared memory allocation for semaphores Before commit `e25626677f`, spinlocks were implemented using semaphores on some platforms (--disable-spinlocks). That made it necessary to initialize semaphores early, before any spinlocks could be used. Now that we don't support --disable-spinlocks anymore, we can allocate the shared memory needed for semaphores the same way as other shared memory structures. Since the semaphores are used only in the PGPROC array, move the semaphore shmem size estimation and initialization calls to ProcGlobalShmemSize() and InitProcGlobal(). Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com	2025-11-06 14:45:00 +02:00
Heikki Linnakangas	daf3d99d2b	Add comment to explain why PGReserveSemaphores() is called early Before commit `e25626677f`, PGReserveSemaphores() had to be called before SpinlockSemaInit() because spinlocks were implemented using semaphores on some platforms (--disable-spinlocks). Add a comment explaining that. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com Backpatch-to: 18	2025-11-06 14:20:48 +02:00
John Naylor	07b3df5d00	Cosmetic fixes in GiST README Fix a typo, add some missing conjunctions, and make a sentence flow more smoothly. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Discussion: https://postgr.es/m/CA%2BrenyXZgwzegmO5t%3DUSU%3D9Wo5bc-YqNf-6E7Nv7e577DCmYXA%40mail.gmail.com	2025-11-06 16:35:40 +07:00
Amit Kapila	5a4eba558a	Fix few issues in commit `5509055d69`. Test failure on buildfarm member prion: The test failed due to an unexpected LOCATION: line appearing between the WARNING and ERROR messages. This occurred because the prion machine uses log_error_verbosity = verbose, which includes additional context in error messages. The test was originally checking for both WARNING and ERROR messages in sequence sync, but the extra LOCATION: line disrupted this pattern. To make the test robust across different verbosity settings, it now only checks for the presence of the WARNING message after the test, which is sufficient to validate the intended behavior. Failure to sync sequences with quoted names: The previous implementation did not correctly quote sequence names when querying remote information, leading to failures when quoted sequence names were used. This fix ensures that sequence names are properly quoted during remote queries, allowing sequences with quoted identifiers to be synced correctly. Author: Vignesh C <vignesh21@gmail.com> Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CALDaNm0WcdSCoNPiE-5ek4J2dMJ5o111GPTzKCYj9G5i=ONYtQ@mail.gmail.com Discussion: https://postgr.es/m/CAOzEurQOSN=Zcp9uVnatNbAy=2WgMTJn_DYszYjv0KUeQX_e_A@mail.gmail.com	2025-11-06 08:52:31 +00:00
Michael Paquier	d6c132d83b	Document some structures in attribute_stats.c Like relation_stats.c, these structures are used to track the argument number, names and types of pg_restore_attribute_stats() and pg_clear_attribute_stats(). Extracted from a larger patch by the same author, reworded by me for consistency with relation_stats.c. Author: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2025-11-06 16:22:12 +09:00
Peter Eisentraut	05b9edcb71	Update code comment Should have been part of commit `a13833c35f`.	2025-11-06 07:16:30 +01:00
David Rowley	eaa159632d	Fix UNION planner estimate_num_groups with varno==0 `03d40e4b5` added code to provide better row estimates for when a UNION query ended up only with a single child due to other children being found to be dummy rels. In that case, ordinarily it would be ok to call estimate_num_groups() on the targetlist of the only child path, however that's not safe to do if the UNION child is the result of some other set operation as we generate targetlists containing Vars with varno==0 for those, which estimate_num_groups() can't handle. This could lead to: ERROR: XX000: no relation entry for relid 0 Fix this by avoiding doing this when the only child is the result of another set operation. In that case we'll fall back on the assume-all-rows-are-unique method. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/cfbc99e5-9d44-4806-ba3c-f36b57a85e21@gmail.com	2025-11-06 16:34:55 +13:00
Etsuro Fujita	a3ebec4e4c	Update obsolete comment in ExecScanReScan(). Commit `27cc7cd2b` removed the epqScanDone flag from the EState struct, and instead added an equivalent flag named relsubs_done to the EPQState struct; but it failed to update this comment. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Discussion: https://postgr.es/m/CAPmGK152zJ3fU5avDT5udfL0namrDeVfMTL3dxdOXw28SOrycg%40mail.gmail.com Backpatch-through: 13	2025-11-06 12:25:00 +09:00
David Rowley	6d0eba6627	Use stack allocated StringInfoDatas, where possible Various places that were using StringInfo but didn't need that StringInfo to exist beyond the scope of the function were using makeStringInfo(), which allocates both a StringInfoData and the buffer it uses as two separate allocations. It's more efficient for these cases to use a StringInfoData on the stack and initialize it with initStringInfo(), which only allocates the string buffer. This also simplifies the cleanup, in a few cases. Author: Mats Kindahl <mats.kindahl@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/4379aac8-26f1-42f2-a356-ff0e886228d3@gmail.com	2025-11-06 14:59:48 +13:00
Tom Lane	d4baa327a1	Avoid possible crash within libsanitizer. We've successfully used libsanitizer for awhile with the undefined and alignment sanitizers, but with some other sanitizers (at least thread and hwaddress) it crashes due to internal recursion before it's fully initialized itself. It turns out that that's due to the "__ubsan_default_options" hack installed by commit `f686ae82f`, and we can fix it by ensuring that __ubsan_default_options is built without any sanitizer instrumentation hooks. Reported-by: Emmanuel Sibi <emmanuelsibi.mec@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Diagnosed-by: Emmanuel Sibi <emmanuelsibi.mec@gmail.com> Fix-suggested-by: Jacob Champion <jacob.champion@enterprisedb.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/F7543B04-E56C-4D68-A040-B14CCBAD38F1@gmail.com Discussion: https://postgr.es/m/dbf77bf7-6e54-ed8a-c4ae-d196eeb664ce@gmail.com Backpatch-through: 16	2025-11-05 11:09:45 -05:00
Alexander Korotkov	447aae13b0	Implement WAIT FOR command WAIT FOR is to be used on standby and specifies waiting for the specific WAL location to be replayed. This option is useful when the user makes some data changes on primary and needs a guarantee to see these changes are on standby. WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot could prevent the replay of WAL records, implying a kind of self-deadlock. This is why separate utility command seems appears to be the most robust way to implement this functionality. It's not possible to implement this as a function. Previous experience shows that stored procedures also have limitation in this aspect. Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru> Author: Alexander Korotkov <aekorotkov@gmail.com> Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>	2025-11-05 11:44:13 +02:00
Alexander Korotkov	3b4e53a075	Add infrastructure for efficient LSN waiting Implement a new facility that allows processes to wait for WAL to reach specific LSNs, both on primary (waiting for flush) and standby (waiting for replay) servers. The implementation uses shared memory with per-backend information organized into pairing heaps, allowing O(1) access to the minimum waited LSN. This enables fast-path checks: after replaying or flushing WAL, the startup process or WAL writer can quickly determine if any waiters need to be awakened. Key components: - New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush() - Separate pairing heaps for replay and flush waiters - WaitLSN lightweight lock for coordinating shared state - Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring This infrastructure can be used by features that need to wait for WAL operations to complete. Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru> Author: Alexander Korotkov <aekorotkov@gmail.com> Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>	2025-11-05 11:44:13 +02:00
Alexander Korotkov	8af3ae0d4b	Add pairingheap_initialize() for shared memory usage The existing pairingheap_allocate() uses palloc(), which allocates from process-local memory. For shared memory use cases, the pairingheap structure must be allocated via ShmemAlloc() or embedded in a shared memory struct. Add pairingheap_initialize() to initialize an already- allocated pairingheap structure in-place, enabling shared memory usage. Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru> Author: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>	2025-11-05 11:44:13 +02:00
Richard Guo	0ea5eee376	Avoid creating duplicate ordered append paths In generate_orderedappend_paths(), the function does not handle the case where the paths in total_subpaths and fractional_subpaths are identical. This situation is not uncommon, and as a result, it may generate two exactly identical ordered append paths. Fix by checking whether total_subpaths and fractional_subpaths contain the same paths, and skipping creation of the ordered append path for the fractional case when they are identical. Given the lack of field complaints about this, I'm a bit hesitant to back-patch, but let's clean it up in HEAD. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-OYsgA75tGGiBARt87G0y_z_GBTSLrzudcJxAzndYkYw@mail.gmail.com	2025-11-05 18:10:54 +09:00
Richard Guo	c1777f2d6d	Fix assertion failure in generate_orderedappend_paths() In generate_orderedappend_paths(), there is an assumption that a child relation's row estimate is always greater than zero. There is an Assert verifying this assumption, and the estimate is also used to convert an absolute tuple count into a fraction. However, this assumption is not always valid -- for example, upper relations can have their row estimates unset, resulting in a value of zero. This can cause an assertion failure in debug builds or lead to the tuple fraction being computed as infinity in production builds. To fix, use the row estimate from the cheapest_total path to compute the tuple fraction. The row estimate in this path should already have been forced to a valid value. In passing, update the comment for generate_orderedappend_paths() to note that the function also considers the cheapest-fractional case when not all tuples need to be retrieved. That is, it collects all the cheapest fractional paths and builds an ordered append path for each interesting ordering. Backpatch to v18, where this issue was introduced. Bug: #19102 Reported-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/19102-93480667e1200169@postgresql.org Backpatch-through: 18	2025-11-05 18:09:21 +09:00
Amit Kapila	5509055d69	Add sequence synchronization for logical replication. This patch introduces sequence synchronization. Sequences that are synced will have 2 states: - INIT (needs [re]synchronizing) - READY (is already synchronized) A new sequencesync worker is launched as needed to synchronize sequences. A single sequencesync worker is responsible for synchronizing all sequences. It begins by retrieving the list of sequences that are flagged for synchronization, i.e., those in the INIT state. These sequences are then processed in batches, allowing multiple entries to be synchronized within a single transaction. The worker fetches the current sequence values and page LSNs from the remote publisher, updates the corresponding sequences on the local subscriber, and finally marks each sequence as READY upon successful synchronization. Sequence synchronization occurs in 3 places: 1) CREATE SUBSCRIPTION - The command syntax remains unchanged. - The subscriber retrieves sequences associated with publications. - Published sequences are added to pg_subscription_rel with INIT state. - Initiate the sequencesync worker to synchronize all sequences. 2) ALTER SUBSCRIPTION ... REFRESH PUBLICATION - The command syntax remains unchanged. - Dropped published sequences are removed from pg_subscription_rel. - Newly published sequences are added to pg_subscription_rel with INIT state. - Initiate the sequencesync worker to synchronize only newly added sequences. 3) ALTER SUBSCRIPTION ... REFRESH SEQUENCES - A new command introduced for PG19 by `f0b3573c3a`. - All sequences in pg_subscription_rel are reset to INIT state. - Initiate the sequencesync worker to synchronize all sequences. - Unlike "ALTER SUBSCRIPTION ... REFRESH PUBLICATION" command, addition and removal of missing sequences will not be done in this case. Author: Vignesh C <vignesh21@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-11-05 05:59:58 +00:00
Michael Paquier	1fd981f053	Drop unnamed portal immediately after execution to completion Previously, unnamed portals were kept until the next Bind message or the end of the transaction. This could cause temporary files to persist longer than expected and make logging not reflect the actual SQL responsible for the temporary file. This patch changes exec_execute_message() to drop unnamed portals immediately after execution to completion at the end of an Execute message, making their removal more aggressive. This forces temporary file cleanups to happen at the same time as the completion of the portal execution, with statement logging correctly reflecting to which statements these temporary files were attached to (see the diffs in the TAP test updated by this commit for an idea). The documentation is updated to describe the lifetime of unnamed portals, and test cases are updated to verify temporary file removal and proper statement logging after unnamed portal execution. This changes how unnamed portals are handled in the protocol, hence no backpatch is done. Author: Frédéric Yhuel <frederic.yhuel@dalibo.com> Co-Authored-by: Sami Imseih <samimseih@gmail.com> Co-Authored-by: Mircea Cadariu <cadariu.mircea@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0tTrTUoEr3kDXCuKsvqYGq8OOHiBwoD-dyJocq95uEOTQ%40mail.gmail.com	2025-11-05 14:35:16 +09:00
Richard Guo	59dec6c0b0	Fix comments for ChangeVarNodes() and related functions The comment for ChangeVarNodes() refers to a parameter named change_RangeTblRef, which does not exist in the code. The comment for ChangeVarNodesExtended() contains an extra space, while the comment for replace_relid_callback() has an awkward line break and a typo. This patch fixes these issues and revises some sentences for smoother wording. Oversights in commits `ab42d643c` and `fc069a3a6`. Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs480j16HC1JtjKCgj5WshivT8ZJYkOfTyZAM0POjFomJkg@mail.gmail.com Backpatch-through: 18	2025-11-05 12:29:31 +09:00
Michael Paquier	2fc3107962	Add assertions checking for the startup process in WAL replay routines These assertions may prove to become useful to make sure that no process other than the startup process calls the routines where these checks are added, as we expect that these do not interfere with a WAL receiver switched to a "stopping" state by a startup process. The assumption that only the startup process can use this code has existed for many years, without a check enforcing it. Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/aQmGeVLYl51y1m_0@paquier.xyz	2025-11-05 10:41:50 +09:00
Andres Freund	dae00f333b	aio: Improve assertions related to io_method First, the assertions in assign_io_method() were the wrong way round. Second, the lengthof() assertion checked the length of io_method_options, which is the wrong array to check and is always longer than pgaio_method_ops_table. While add it, add a static assert to ensure pgaio_method_ops_table and io_method_options stay in sync. Per coverity and Tom Lane. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Backpatch-through: 18	2025-11-04 20:03:53 -05:00
Andres Freund	2d83d729d5	jit: Fix accidentally-harmless type confusion In `2a0faed9d7`, which added JIT compilation support for expressions, I accidentally used sizeof(LLVMBasicBlockRef *) instead of sizeof(LLVMBasicBlockRef) as part of computing the size of an allocation. That turns out to have no real negative consequences due to LLVMBasicBlockRef being a pointer itself (and thus having the same size). It still is wrong and confusing, so fix it. Reported by coverity. Backpatch-through: 13	2025-11-04 20:03:53 -05:00
Jeff Davis	d115de9d89	Special case C_COLLATION_OID in pg_newlocale_from_collation(). Allow pg_newlocale_from_collation(C_COLLATION_OID) to work even if there's no catalog access, which some extensions expect. Not known to be a bug without extensions involved, but backport to 18. Also corrects an issue in master with dummy_c_locale (introduced in commit `5a38104b36`) where deterministic was not set. That wasn't a bug, but could have been if that structure was used more widely. Reported-by: Alexander Kukushkin <cyberdemn@gmail.com> Reviewed-by: Alexander Kukushkin <cyberdemn@gmail.com> Discussion: https://postgr.es/m/CAFh8B=nj966ECv5vi_u3RYij12v0j-7NPZCXLYzNwOQp9AcPWQ@mail.gmail.com Backpatch-through: 18	2025-11-04 16:48:16 -08:00
Masahiko Sawada	8ae0f6a0c3	Add CHECK_FOR_INTERRUPTS in Evict{Rel,All}UnpinnedBuffers. This commit adds CHECK_FOR_INTERRUPTS to the shared buffer iteration loops in EvictRelUnpinnedBuffers and EvictAllUnpinnedBuffers. These functions, used by pg_buffercache's pg_buffercache_evict_relation and pg_buffercache_evict_all, can now be interrupted during long-running operations. Backpatch to version 18, where these functions and their corresponding pg_buffercache functions were introduced. Author: Yuhang Qiu <iamqyh@gmail.com> Discussion: https://postgr.es/m/8DC280D4-94A2-4E7B-BAB9-C345891D0B78%40gmail.com Backpatch-through: 18	2025-11-04 15:47:25 -08:00
David Rowley	fdda78e361	Fix possible usage of incorrect UPPERREL_SETOP RelOptInfo `03d40e4b5` allowed dummy UNION [ALL] children to be removed from the plan by checking for is_dummy_rel(). That commit neglected to still account for the relids from the dummy rel so that the correct UPPERREL_SETOP RelOptInfo could be found and used for adding the Paths to. Not doing this could result in processing of subsequent UNIONs using the same RelOptInfo as a previously processed UNION, which could result in add_path() freeing old Paths that are needed by the previous UNION. The same fix was independently submitted (2 mins later) by Richard Guo. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/bee34aec-659c-46f1-9ab7-7bbae0b7616c@gmail.com	2025-11-05 11:48:09 +13:00
Álvaro Herrera	0a3d27bfe0	Fix snapshot handling bug in recent BRIN fix Commit `a95e3d84c0` added ActiveSnapshot push+pop when processing work-items (BRIN autosummarization), but forgot to handle the case of a transaction failing during the run, which drops the snapshot untimely. Fix by making the pop conditional on an element being actually there. Author: Álvaro Herrera <alvherre@kurilemu.de> Backpatch-through: 13 Discussion: https://postgr.es/m/202511041648.nofajnuddmwk@alvherre.pgsql	2025-11-04 20:31:43 +01:00
Tomas Vondra	1213cb4753	Trim TIDs during parallel GIN builds more eagerly The parallel GIN builds perform "freezing" of TID lists when merging chunks built earlier. This means determining what part of the list can no longer change, depending on the last received chunk. The frozen part can be evicted from memory and written out. The code attempted to freeze items right before merging the old and new TID list, after already attempting to trim the current buffer. That means part of the data may get frozen based on the new TID list, but will be trimmed later (on next loop). This increases memory usage. This inverts the order, so that we freeze data first (before trimming). The benefits are likely relatively small, but it's also virtually free with no other downsides. Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com	2025-11-04 20:06:01 +01:00
Tomas Vondra	c98dffcb7c	Limit the size of TID lists during parallel GIN build When building intermediate TID lists during parallel GIN builds, split the sorted lists into smaller chunks, to limit the amount of memory needed when merging the chunks later. The leader may need to keep in memory up to one chunk per worker, and possibly one extra chunk (before evicting some of the data). The code processing item pointers uses regular palloc/repalloc calls, which means it's subject to the MaxAllocSize (1GB) limit. We could fix this by allowing huge allocations, but that'd require changes in many places without much benefit. Larger chunks do not actually improve performance, so the memory usage would be wasted. Fixed by limiting the chunk size to not hit MaxAllocSize. Each worker gets a fair share. This requires remembering the number of participating workers, in a place that can be accessed from the callback. Luckily, the bs_worker_id field in GinBuildState was unused, so repurpose that. Report by Greg Smith, investigation and fix by me. Batchpatched to 18, where parallel GIN builds were introduced. Reported-by: Gregory Smith <gregsmithpgsql@gmail.com> Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com Backpatch-through: 18	2025-11-04 18:51:17 +01:00
Jeff Davis	4bfaea11d2	Remove redundant memset() introduced by `a0942f4`. Reported-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2kAkNaDa01O0nKsQmkfEmxsDvm09SU=f1T0CV8ew3qJEA@mail.gmail.com	2025-11-04 09:46:00 -08:00
Tom Lane	ff4597acd4	Allow "SET list_guc TO NULL" to specify setting the GUC to empty. We have never had a SET syntax that allows setting a GUC_LIST_INPUT parameter to be an empty list. A locution such as SET search_path = ''; doesn't mean that; it means setting the GUC to contain a single item that is an empty string. (For search_path the net effect is much the same, because search_path ignores invalid schema names and '' must be invalid.) This is confusing, not least because configuration-file entries and the set_config() function can easily produce empty-list values. We considered making the empty-string syntax do this, but that would foreclose ever allowing empty-string items to be valid in list GUCs. While there isn't any obvious use-case for that today, it feels like the kind of restriction that might hurt someday. Instead, let's accept the forbidden-up-to-now value NULL and treat that as meaning an empty list. (An objection to this could be "what if we someday want to allow NULL as a GUC value?". That seems unlikely though, and even if we did allow it for scalar GUCs, we could continue to treat it as meaning an empty list for list GUCs.) Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrei Klychkov <andrew.a.klychkov@gmail.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Discussion: https://postgr.es/m/CA+mfrmwsBmYsJayWjc8bJmicxc3phZcHHY=yW5aYe=P-1d_4bg@mail.gmail.com	2025-11-04 12:37:40 -05:00
Peter Eisentraut	040cc5f3c7	Tighten check for generated column in partition key expression A generated column may end up being part of the partition key expression, if it's specified as an expression e.g. "(<generated column name>)" or if the partition key expression contains a whole-row reference, even though we do not allow a generated column to be part of partition key expression. Fix this hole. Co-authored-by: jian he <jian.universality@gmail.com> Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Discussion: https://www.postgresql.org/message-id/flat/CACJufxF%3DWDGthXSAQr9thYUsfx_1_t9E6N8tE3B8EqXcVoVfQw%40mail.gmail.com	2025-11-04 14:46:58 +01:00
Álvaro Herrera	a95e3d84c0	BRIN autosummarization may need a snapshot It's possible to define BRIN indexes on functions that require a snapshot to run, but the autosummarization feature introduced by commit `7526e10224` fails to provide one. This causes autovacuum to leave a BRIN placeholder tuple behind after a failed work-item execution, making such indexes less efficient. Repair by obtaining a snapshot prior to running the task, and add a test to verify this behavior. Author: Álvaro Herrera <alvherre@kurilemu.de> Reported-by: Giovanni Fabris <giovanni.fabris@icon.it> Reported-by: Arthur Nascimento <tureba@gmail.com> Backpatch-through: 13 Discussion: https://postgr.es/m/202511031106.h4fwyuyui6fz@alvherre.pgsql	2025-11-04 13:23:26 +01:00
Peter Eisentraut	c09a06918d	Error message stylistic correction Fixup for commit `ef5e60a9d3`: The inconsistent use of articles was a bit awkward.	2025-11-04 12:25:04 +01:00
Michael Paquier	65f4976189	Add assertion check for WAL receiver state during stream-archive transition When the startup process switches from streaming to archive as WAL source, we avoid calling ShutdownWalRcv() if the WAL receiver is not streaming, based on WalRcvStreaming(). WALRCV_STOPPING is a state set by ShutdownWalRcv(), called only by the startup process, meaning that it should not be possible to reach this state while in WaitForWALToBecomeAvailable(). This commit adds an assertion to make sure that a WAL receiver is never in a WALRCV_STOPPING state should the startup process attempt to reset InstallXLogFileSegmentActive. Idea suggested by Noah Misch. Author: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org	2025-11-04 13:14:46 +09:00
Michael Paquier	e0ca61e7c4	Add WalRcvGetState() to retrieve the state of a WAL receiver This has come up as useful as an alternative of WalRcvStreaming(), to be able to do sanity checks based on the state of a WAL receiver. This will be used in a follow-up commit. Author: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org	2025-11-04 12:57:36 +09:00
Michael Paquier	17b2d5ec75	Fix unconditional WAL receiver shutdown during stream-archive transition Commit `b4f584f9d2` (affecting v15~, later backpatched down to 13 as of `3635a0a35a`) introduced an unconditional WAL receiver shutdown when switching from streaming to archive WAL sources. This causes problems during a timeline switch, when a WAL receiver enters WALRCV_WAITING state but remains alive, waiting for instructions. The unconditional shutdown can break some monitoring scenarios as the WAL receiver gets repeatedly terminated and re-spawned, causing pg_stat_wal_receiver.status to show a "streaming" instead of "waiting" status, masking the fact that the WAL receiver is waiting for a new TLI and a new LSN to be able to continue streaming. This commit changes the WAL receiver behavior so as the shutdown becomes conditional, with InstallXLogFileSegmentActive being always reset to prevent the regression fixed by `b4f584f9d2`: only terminate the WAL receiver when it is actively streaming (WALRCV_STREAMING, WALRCV_STARTING, or WALRCV_RESTARTING). When in WALRCV_WAITING state, just reset InstallXLogFileSegmentActive flag to allow archive restoration without killing the process. WALRCV_STOPPED and WALRCV_STOPPING are not reachable states in this code path. For the latter, the startup process is the one in charge of setting WALRCV_STOPPING via ShutdownWalRcv(), waiting for the WAL receiver to reach a WALRCV_STOPPED state after switching walRcvState, so WaitForWALToBecomeAvailable() cannot be reached while a WAL receiver is in a WALRCV_STOPPING state. A regression test is added to check that a WAL receiver is not stopped on timeline jump, that fails when the fix of this commit is reverted. Reported-by: Ryan Bird <ryanzxg@gmail.com> Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org Backpatch-through: 13	2025-11-04 10:47:38 +09:00
Noah Misch	8b18ed6dfb	Doc: cover index CONCURRENTLY causing errors in INSERT ... ON CONFLICT. Author: Mikhail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com Backpatch-through: 13	2025-11-03 12:57:09 -08:00
Masahiko Sawada	e7ccb247b3	Fix outdated comment of COPY in gram.y. Author: ChangAo Chen <cca5507@qq.com> Discussion: https://postgr.es/m/tencent_392C0E92EC52432D0A336B9D52E66426F009@qq.com	2025-11-03 10:34:49 -08:00
Álvaro Herrera	cf8be02253	Prevent setting a column as identity if its not-null constraint is invalid We don't allow null values to appear in identity-generated columns in other ways, so we shouldn't let unvalidated not-null constraints do it either. Oversight in commit `a379061a22`. Author: jian he <jian.universality@gmail.com> Backpatch-through: 18 Discussion: https://postgr.es/m/CACJufxGQM_+vZoYJMaRoZfNyV=L2jxosjv_0TLAScbuLJXWRfQ@mail.gmail.com	2025-11-03 15:58:19 +01:00
Michael Paquier	ad25744f43	Add wal_fpi_bytes to VACUUM and ANALYZE logs The new wal_fpi_bytes counter calculates the total amount of full page images inserted in WAL records, in bytes. This commit adds this information to VACUUM and ANALYZE logs alongside the existing counters, building upon `f9a09aa295`. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aQMMSSlFXy4Evxn3@paquier.xyz	2025-11-03 19:42:03 +09:00
Peter Eisentraut	fce7c73fba	Sort guc_parameters.dat alphabetically by name The order in this list was previously pretty random and had grown organically over time. This made it unnecessarily cumbersome to maintain these lists, as there was no clear guidelines about where to put new entries. Also, after the merger of the type-specific GUC structs, the list still reflected the previous type-specific super-order. By using alphabetical order, the place for new entries becomes clear, and often related entries will be listed close together. This patch reorders the existing entries in guc_parameters.dat, and it also augments the generation script to error if an entry is found at the wrong place. Note: The order is actually checked after lower-casing, to handle the likes of "DateStyle". Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org	2025-11-03 10:04:14 +01:00
Tom Lane	8f29467c57	Change "long" numGroups fields to be Cardinality (i.e., double). We've been nibbling away at removing uses of "long" for a long time, since its width is platform-dependent. Here's one more: change the remaining "long" fields in Plan nodes to Cardinality, since the three surviving examples all represent group-count estimates. The upstream planner code was converted to Cardinality some time ago; for example the corresponding fields in Path nodes are type Cardinality, as are the arguments of the make_foo_path functions. Downstream in the executor, it turns out that these all feed to the table-size argument of BuildTupleHashTable. Change that to "double" as well, and fix it so that it safely clamps out-of-range values to the uint32 limit of simplehash.h, as was not being done before. Essentially, this is removing all the artificial datatype-dependent limitations on these values from upstream processing, and applying just one clamp at the moment where we're forced to do so by the datatype choices of simplehash.h. Also, remove BuildTupleHashTable's misguided attempt to enforce work_mem/hash_mem_limit. It doesn't have enough information (particularly not the expected tuple width) to do that accurately, and it has no real business second-guessing the caller's choice. For all these plan types, it's really the planner's responsibility to not choose a hashed implementation if the hashtable is expected to exceed hash_mem_limit. The previous patch improved the accuracy of those estimates, and even if BuildTupleHashTable had more information it should arrive at the same conclusions. Reported-by: Jeff Janes <jeff.janes@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com	2025-11-02 16:57:43 -05:00
Tom Lane	1ea5bdb00b	Improve planner's estimates of tuple hash table sizes. For several types of plan nodes that use TupleHashTables, the planner estimated the expected size of the table as basically numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)). This is pretty far off, especially for small data widths, because it doesn't account for the overhead of the simplehash.h hash table nor for any per-tuple "additional space" the plan node may request. Jeff Janes noted a case where the estimate was off by about a factor of three, even though the obvious hazards such as inaccurate estimates of numEntries or dataWidth didn't apply. To improve matters, create functions provided by the relevant executor modules that can estimate the required sizes with reasonable accuracy. (We're still not accounting for effects like allocator padding, but this at least gets the first-order effects correct.) I added functions that can estimate the tuple table sizes for nodeSetOp and nodeSubplan; these rely on an estimator for TupleHashTables in general, and that in turn relies on one for simplehash.h hash tables. That feels like kind of a lot of mechanism, but if we take any short-cuts we're violating modularity boundaries. The other places that use TupleHashTables are nodeAgg, which took pains to get its numbers right already, and nodeRecursiveunion. I did not try to improve the situation for nodeRecursiveunion because there's nothing to improve: we are not making an estimate of the hash table size, and it wouldn't help us to do so because we have no non-hashed alternative implementation. On top of that, our estimate of the number of entries to be hashed in that module is so suspect that we'd likely often choose the wrong implementation if we did have two ways to do it. Reported-by: Jeff Janes <jeff.janes@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com	2025-11-02 16:57:26 -05:00
Peter Geoghegan	b8f1c62807	Document nbtree row comparison design. Add comments explaining when and where it is safe for nbtree to treat row compare keys as if they were simple scalar inequality keys on the row's most significant column. This is particularly important within _bt_advance_array_keys, which deals with required inequality keys in a general and uniform way, without any special handling for row compares. Also spell out the implications of _bt_check_rowcompare's approach of _conditionally_ evaluating lower-order row compare subkeys, particularly when one of its lower-order subkeys might see NULL index tuple values (these may or may not affect whether the qual as a whole is satisfied). The behavior in this area isn't particularly intuitive, so these issues seem worth going into. In passing, add a few more defensive/documenting row comparison related assertions to _bt_first and _bt_check_rowcompare. Follow-up to commits `bd3f59fd` and `ec986020`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Reviewed-By: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAH2-Wznwkak_K7pcAdv9uH8ZfNo8QO7+tHXOaCUddMeTfaCCFw@mail.gmail.com Backpatch-through: 18	2025-11-02 15:27:05 -05:00
Peter Geoghegan	4f08586c7a	Remove obsolete nbtree equality key comments. _bt_first reliably uses the same equality key (on each index column) for initial positioning purposes as the one that _bt_checkkeys can use to end the scan following commit `f09816a0`. _bt_first no longer applies its own independent rules to determine which initial positioning key to use on each column (for equality and inequality keys alike). Preprocessing is now fully in control of determining which keys start and end each scan, ensuring that _bt_first and _bt_checkkeys have symmetric behavior. Remove obsolete comments that described why _bt_first was expected to use at least one of the available required equality keys for initial positioning purposes. The rules in this area are now maximally strict and uniform, so there's no reason to draw attention to equality keys. Any column with a required equality key cannot have a redundant required inequality key (nor can it have a redundant required equality key). Oversight in commit `f09816a0`, which removed similar comments from _bt_first, but missed these comments. Author: Peter Geoghegan <pg@bowt.ie> Backpatch-through: 18	2025-11-02 13:34:18 -05:00
Peter Eisentraut	8a27d418f8	Mark function arguments of type "Datum " as "const Datum " where possible Several functions in the codebase accept "Datum " parameters but do not modify the pointed-to data. These have been updated to take "const Datum " instead, improving type safety and making the interfaces clearer about their intent. This change helps the compiler catch accidental modifications and better documents immutability of arguments. Most of "Datum " parameters have a pairing "bool isnull" parameter, they are constified as well. No functional behavior is changed by this patch. Author: Chao Li <lic@highgo.com> Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2msfT0knvzUa72ZBwu9LR_RLY4on85w2a9YpE-o2By5HQ@mail.gmail.com	2025-10-31 10:47:25 +01:00
Peter Eisentraut	aa4535307e	formatting.c cleanup: Change fill_str() return type to void The return value is not used anywhere. In passing, add a comment explaining the function's arguments. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-31 09:55:12 +01:00
Peter Eisentraut	da2052ab9a	formatting.c cleanup: Rename DCH_S_* to DCH_SUFFIX_* For clarity. Also rename several related macros and turn them into inline functions. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-31 08:06:46 +01:00
Peter Eisentraut	378212c68a	formatting.c cleanup: Change several int fields to enums This makes their purpose more self-documenting. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-31 08:06:46 +01:00
Peter Eisentraut	ce5f6817e4	formatting.c cleanup: Change TmFromChar.clock field to bool This makes the purpose clearer and avoids having two extra symbols, one of which (CLOCK_24_HOUR) was unused. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-31 08:06:46 +01:00
Tom Lane	c106ef0807	Use BumpContext contexts in TupleHashTables, and do some code cleanup. For all extant uses of TupleHashTables, execGrouping.c itself does nothing with the "tablecxt" except to allocate new hash entries in it, and the callers do nothing with it except to reset the whole context. So this is an ideal use-case for a BumpContext, and the hash tables are frequently big enough for the savings to be significant. (Commit `cc721c459` already taught nodeAgg.c this idea, but neglected the other callers of BuildTupleHashTable.) While at it, let's clean up some ill-advised leftovers from rebasing TupleHashTables on simplehash.h: * Many comments and variable names were based on the idea that the tablecxt holds the whole TupleHashTable, whereas now it only holds the hashed tuples (plus any caller-defined "additional storage"). Rename to names like tuplescxt and tuplesContext, and adjust the comments. Also adjust the memory context names to be like "<Foo> hashed tuples". * Make ResetTupleHashTable() reset the tuplescxt rather than relying on the caller to do so; that was fairly bizarre and seems like a recipe for leaks. This is less efficient in the case where nodeAgg.c uses the same tuplescxt for several different hashtables, but only microscopically so because mcxt.c will short-circuit the extra resets via its isReset flag. I judge the extra safety and intellectual cleanliness well worth those few cycles. * Remove the long-obsolete "allow_jit" check added by ac88807f9; instead, just Assert that metacxt and tuplescxt are different. We need that anyway for this definition of ResetTupleHashTable() to be safe. There is a side issue of the extent to which this change invalidates the planner's estimates of hashtable memory consumption. However, those estimates are already pretty bad, so improving them seems like it can be a separate project. This change is useful to do first to establish consistent executor behavior that the planner can expect. A loose end not addressed here is that the "entrysize" calculation in BuildTupleHashTable seems wrong: "sizeof(TupleHashEntryData) + additionalsize" corresponds neither to the size of the simplehash entries nor to the total space needed per tuple. It's questionable why BuildTupleHashTable is second-guessing its caller's nbuckets choice at all, since the original source of the number should have had more information. But that all seems wrapped up with the planner's estimation logic, so let's leave it for the planned followup patch. Reported-by: Jeff Janes <jeff.janes@gmail.com> Reported-by: David Rowley <dgrowleyml@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com Discussion: https://postgr.es/m/2268409.1761512111@sss.pgh.pa.us	2025-10-30 11:21:22 -04:00
Peter Eisentraut	e1ac846f3d	Mark ItemPointer arguments as const throughout This is a follow up `991295f`. I searched over src/ and made all ItemPointer arguments as const as much as possible. Note: We cut out from the original patch the pieces that would have created incompatibilities in the index or table AM APIs. Those could be considered separately. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw%40mail.gmail.com	2025-10-30 14:12:06 +01:00
Álvaro Herrera	a27c40bfe8	Simplify coding in ProcessQuery The original is pretty baroque for no apparent reason; arguably, commit `2f9661311b` should have done this. Noted while reviewing related code for bug #18984. This is cosmetic (though I'm surprised that my compiler generates shorter assembly this way), so no backpatch. Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org	2025-10-30 11:26:35 +01:00
Peter Eisentraut	8ce795fcb7	Fix some confusing uses of const There are a few places where we have typedef struct FooData { ... } FooData; typedef FooData Foo; and then function declarations with bar(const Foo x) which isn't incorrect but probably meant bar(const FooData x) meaning that the thing x points to is immutable, not x itself. This patch makes those changes where appropriate. In one case (execGrouping.c), the thing being pointed to was not immutable, so in that case remove the const altogether, to avoid further confusion. Co-authored-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/CAEoWx2m2E0xE8Kvbkv31ULh_E%2B5zph-WA_bEdv3UR9CLhw%2B3vg%40mail.gmail.com Discussion: https://www.postgresql.org/message-id/CAEoWx2kTDz%3Db6T2xHX78vy_B_osDeCC5dcTCi9eG0vXHp5QpdQ%40mail.gmail.com	2025-10-30 11:20:04 +01:00
Peter Eisentraut	3479a0f823	const-qualify ItemPointer comparison functions Add const qualifiers to ItemPointerEquals() and ItemPointerCompare(). This will allow further changes up the stack. It also complements commit `aeb767ca0b`, as we now have all of itemptr.h appropriately const-qualified. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw@mail.gmail.com	2025-10-30 10:13:47 +01:00
Peter Eisentraut	e2cf524e4a	formatting.c cleanup: Improve formatting of some struct declarations This makes future editing easier. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-30 08:35:33 +01:00
Peter Eisentraut	9a1a5dfee8	formatting.c cleanup: Remove unnecessary zeroize macros Replace with initializer or memset(). Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-30 08:35:28 +01:00
Peter Eisentraut	38506f55fd	formatting.c cleanup: Remove unnecessary extra line breaks in error message literals Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-30 08:35:18 +01:00
Michael Paquier	5ab0b6a248	Expose wal_fpi_bytes in EXPLAIN (WAL) The new wal_fpi_bytes counter calculates the total amount of full page images inserted in WAL records, in bytes. This commit exposes this information in EXPLAIN (ANALYZE, WAL) alongside the existing counters, for both the text and JSON/YAML outputs, building upon `f9a09aa295`. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discusssion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com	2025-10-30 15:34:01 +09:00
Michael Paquier	d432094689	Fix regression with slot invalidation checks This commit reverts `818fefd8fd`, that has been introduced to address a an instability in some of the TAP tests due to the presence of random standby snapshot WAL records, when slots are invalidated by InvalidatePossiblyObsoleteSlot(). Anyway, this commit had also the consequence of introducing a behavior regression. After `818fefd8fd`, the code may determine that a slot needs to be invalidated while it may not require one: the slot may have moved from a conflicting state to a non-conflicting state between the moment when the mutex is released and the moment when we recheck the slot, in InvalidatePossiblyObsoleteSlot(). Hence, the invalidations may be more aggressive than they actually have to. `105b2cb336` has tackled the test instability in a way that should be hopefully sufficient for the buildfarm, even for slow members: - In v18, the test relies on an injection point that bypasses the creation of the random records generated for standby snapshots, eliminating the random factor that impacted the test. This option was not available when `818fefd8fd` was discussed. - In v16 and v17, the problem was bypassed by disallowing a slot to become active in some of the scenarios tested. While on it, this commit adds a comment to document that it is fine for a recheck to use xmin and LSN values stored in the slot, without storing and reusing them across multiple checks. Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com Backpatch-through: 16	2025-10-30 13:13:28 +09:00
Richard Guo	257ee78341	Disable parallel plans for RIGHT_SEMI joins RIGHT_SEMI joins rely on the HEAP_TUPLE_HAS_MATCH flag to guarantee that only the first match for each inner tuple is considered. However, in a parallel hash join, the inner relation is stored in a shared global hash table that can be probed by multiple workers concurrently. This allows different workers to inspect and set the match flags of the same inner tuples at the same time. If two workers probe the same inner tuple concurrently, both may see the match flag as unset and emit the same tuple, leading to duplicate output rows and violating RIGHT_SEMI join semantics. For now, we disable parallel plans for RIGHT_SEMI joins. In the long term, it may be possible to support parallel execution by performing atomic operations on the match flag, for example using a CAS or similar mechanism. Backpatch to v18, where RIGHT_SEMI join was introduced. Bug: #19094 Reported-by: Lori Corbani <Lori.Corbani@jax.org> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19094-6ed410eb5b256abd@postgresql.org Backpatch-through: 18	2025-10-30 11:58:45 +09:00
David Rowley	50eb4e1181	Fix bogus use of "long" in AllocSetCheck() Because long is 32-bit on 64-bit Windows, it isn't a good datatype to store the difference between 2 pointers. The under-sized type could overflow and lead to scary warnings in MEMORY_CONTEXT_CHECKING builds, such as: WARNING: problem in alloc set ExecutorState: bad single-chunk %p in block %p However, the problem lies only in the code running the check, not from an actual memory accounting bug. Fix by using "Size" instead of "long". This means using an unsigned type rather than the previous signed type. If the block's freeptr was corrupted, we'd still catch that if the unsigned type wrapped. Unsigned allows us to avoid further needless complexities around comparing signed and unsigned types. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Backpatch-through: 13 Discussion: https://postgr.es/m/CAApHDvo-RmiT4s33J=aC9C_-wPZjOXQ232V-EZFgKftSsNRi4w@mail.gmail.com	2025-10-30 14:48:10 +13:00
Jeff Davis	3853a6956c	Use C11 char16_t and char32_t for Unicode code points. Reviewed-by: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/bedcc93d06203dfd89815b10f815ca2de8626e85.camel%40j-davis.com	2025-10-29 14:17:13 -07:00
Álvaro Herrera	94f95d91b0	CheckNNConstraintFetch: Fill all of ConstrCheck in a single pass Previously, we'd fill all fields except ccbin, and only later obtain and detoast ccbin, with hypothetical failures being possible. If ccbin is null (rare catalog corruption I have never witnessed) or its a corrupted toast entry, we leak a tiny bit of memory in CacheMemoryContext from having strdup'd the constraint name. Repair these by only attempting to fill the struct once ccbin has been detoasted. Author: Ranier Vilela <ranier.vf@gmail.com> Discussion: https://postgr.es/m/CAEudQAr=i3_Z4GvmediX900+sSySTeMkvuytYShhQqEwoGyvhA@mail.gmail.com	2025-10-29 11:41:39 +01:00
Peter Eisentraut	a13833c35f	Reorganize GUC structs Instead of having five separate GUC structs, one for each type, with the generic part contained in each of them, flip it around and have one common struct, with the type-specific part has a subfield. The very original GUC design had type-specific structs and type-specific lists, and the membership in one of the lists defined the type. But now the structs themselves know the type (from the .vartype field), and they are all loaded into a common hash table at run time, and so this original separation no longer makes sense. It creates a bunch of inconsistencies in the code about whether the type-specific or the generic struct is the primary struct, and a lot of casting in between, which makes certain assumptions about the struct layouts. After the change, all these casts are gone and all the data is accessed via normal field references. Also, various code is simplified because only one kind of struct needs to be processed. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org	2025-10-29 09:52:29 +01:00
Peter Eisentraut	2724830929	formatting.c cleanup: Remove unnecessary extra parentheses Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-29 09:29:00 +01:00
Peter Eisentraut	6271d9922e	formatting.c cleanup: Use array syntax instead of pointer arithmetic for easier readability Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-29 09:28:55 +01:00
Peter Eisentraut	b9def57a3c	formatting.c cleanup: Add some const pointer qualifiers Co-authored-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-29 09:28:50 +01:00
Peter Eisentraut	d98b3cdbaf	formatting.c cleanup: Use size_t for string length variables and arguments Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-29 09:28:43 +01:00
Michael Paquier	d3111cb753	Fix correctness issue with computation of FPI size in WAL stats XLogRecordAssemble() may be called multiple times before inserting a record in XLogInsertRecord(), and the amount of FPIs generated inside a record whose insertion is attempted multiple times may vary. The logic added in `f9a09aa295` touched directly pgWalUsage in XLogRecordAssemble(), meaning that it could be possible for pgWalUsage to be incremented multiple times for a single record. This commit changes the code to use the same logic as the number of FPIs added to a record, where XLogRecordAssemble() returns this information and feeds it to XLogInsertRecord(), updating pgWalUsage only when a record is inserted. Reported-by: Shinya Kato <shinya11.kato@gmail.com> Discussion: https://postgr.es/m/CAOzEurSiSr+rusd0GzVy8Bt30QwLTK=ugVMnF6=5WhsSrukvvw@mail.gmail.com	2025-10-29 09:13:31 +09:00
Peter Eisentraut	03fbb0814c	formatting.c cleanup: Move loop variables definitions into for statement Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-28 19:20:17 +01:00
Peter Eisentraut	95924672d5	formatting.c cleanup: Remove dashes in comments This saves some vertical space and makes the comments style more consistent with the rest of the code. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org	2025-10-28 19:20:02 +01:00
Álvaro Herrera	d5845aa8ad	Don't error out when dropping constraint if relchecks is already zero I have never seen this be a problem in practice, but it came up when purposely corrupting catalog contents to study the fix for a nearby bug: we'd try to decrement relchecks, but since it's zero we error out and fail to drop the constraint. The fix is to downgrade the error to warning, skip decrementing the counter, and otherwise proceed normally. Given lack of field complaints, no backpatch. Author: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/202508291058.q2zscdcs64fj@alvherre.pgsql	2025-10-28 19:13:32 +01:00
Jeff Davis	4da12e9e2e	Move comment about casts from pg_wchar. Suggested-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CA+hUKGLXQUYK7Cq5KbLGgTWo7pORs7yhBWO1AEnZt7xTYbLRhg@mail.gmail.com	2025-10-28 10:49:20 -07:00
Peter Eisentraut	35e53b6841	Check that index can return in get_actual_variable_range() Some recent changes were made to remove the explicit dependency on btree indexes in some parts of the code. One of these changes was made in commit `9ef1851685`, which allows non-btree indexes to be used in get_actual_variable_range(). A follow-up commit `ee1ae8b99f` fixes the cases where an index doesn’t have a sortopfamily as this is a prerequisite to be used in get_actual_variable_range(). However, it was found that indexes that have amcanorder = true but do not allow index-only-scans (amcanreturn returns false or is NULL) will pass all of the conditions, while they should be rejected since get_actual_variable_range() uses the index-only-scan machinery in get_actual_variable_endpoint(). Such an index might cause errors like ERROR: no data returned for index-only scan during query planning. The fix is to add a check in get_actual_variable_range() to reject indexes that do not allow index-only scans. Author: Maxime Schoemans <maxime.schoemans@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/20ED852A-C2D9-41EB-8671-8C8B9D418BE9%40enterprisedb.com	2025-10-28 10:07:29 +01:00
Michael Paquier	f9a09aa295	Add wal_fpi_bytes to pg_stat_wal and pg_stat_get_backend_wal() This new counter, called "wal_fpi_bytes", tracks the total amount in bytes of full page images (FPIs) generated in WAL. This data becomes available globally via pg_stat_wal, and for backend statistics via pg_stat_get_backend_wal(). Previously, this information could only be retrieved with pg_waldump or pg_walinspect, which may not be available depending on the environment, and are expensive to execute. It offers hints about how much FPIs impact the WAL generated, which could be a large percentage for some workloads, as well as the effects of wal_compression or page holes. Bump catalog version. Bump PGSTAT_FILE_FORMAT_ID, due to the addition of wal_fpi_bytes in PgStat_WalCounters. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com	2025-10-28 16:21:51 +09:00
Amit Kapila	3e8e05596a	Add worker type argument to logical replication worker functions. Extend logicalrep_worker_stop, logicalrep_worker_wakeup, and logicalrep_worker_find to accept a worker type argument. This change enables differentiation between logical replication worker types, such as apply workers and table sync workers. While preserving existing behavior, it lays the groundwork for upcoming patch to add sequence synchronization workers. Author: Vignesh C <vignesh21@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-10-28 05:47:50 +00:00
Nathan Bossart	123661427b	Fix a couple of comments. These were discovered while reviewing Aleksander Alekseev's proposed changes to pgindent. Oversights in commits `393e0d2314` and `25a30bbd42`. Discussion: https://postgr.es/m/aP-H6kSsGOxaB21k%40nathan	2025-10-27 10:30:05 -05:00
Peter Eisentraut	10b5bb3bff	Add some const qualifications Add some const qualifications afforded by the previous change that added a const qualification to PageAddItemExtended(). Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org	2025-10-27 09:55:59 +01:00
Peter Eisentraut	76acf4b722	Remove Item type This type is just char * underneath, it provides no real value, no type safety, and just makes the code one level more mysterious. It is more idiomatic to refer to blobs of memory by a combination of void * and size_t, so change it to that. Also, since this type hides the pointerness, we can't apply qualifiers to what is pointed to, which requires some unconstify nonsense. This change allows fixing that. Extension code that uses the Item type can change its code to use void * to be backward compatible. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org	2025-10-27 09:55:59 +01:00
Amit Kapila	e0dc4bbfb8	Fix GUC check_hook validation for synchronized_standby_slots. Previously, the check_hook for synchronized_standby_slots attempted to validate that each specified slot existed and was physical. However, these checks were not performed during server startup. As a result, if users configured non-existent slots before startup, the misconfiguration would go undetected initially. This could later cause parallel query failures, as newly launched workers would detect the issue and raise an ERROR. This patch improves the check_hook by validating the syntax and format of slot names. Validation of slot existence and type is deferred to the WAL sender process, aligning with the behavior of the check_hook for primary_slot_name. Reported-by: Fabrice Chapuis <fabrice636861@gmail.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Backpatch-through: 17, where it was introduced Discussion: https://postgr.es/m/CAA5-nLCeO4MQzWipCXH58qf0arruiw0OeUc1+Q=Z=4GM+=v1NQ@mail.gmail.com	2025-10-27 06:48:32 +00:00
Jeff Davis	371a302eec	Comment typo fixes: pg_wchar_t should be pg_wchar. Reported-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CA+hUKGJ5Xh0KxLYXDZuPvw1_fHX=yuzb4xxtam1Cr6TPZZ1o+w@mail.gmail.com	2025-10-26 12:31:50 -07:00
David Rowley	39dcfda2d2	Fix incorrect logic for caching ResultRelInfos for triggers When dealing with ResultRelInfos for partitions, there are cases where there are mixed requirements for the ri_RootResultRelInfo. There are cases when the partition itself requires a NULL ri_RootResultRelInfo and in the same query, the same partition may require a ResultRelInfo with its parent set in ri_RootResultRelInfo. This could cause the column mapping between the partitioned table and the partition not to be done which could result in crashes if the column attnums didn't match exactly. The fix is simple. We now check that the ri_RootResultRelInfo matches what the caller passed to ExecGetTriggerResultRel() and only return a cached ResultRelInfo when the ri_RootResultRelInfo matches what the caller wants, otherwise we'll make a new one. Author: David Rowley <dgrowleyml@gmail.com> Author: Amit Langote <amitlangote09@gmail.com> Reported-by: Dmitry Fomin <fomin.list@gmail.com> Discussion: https://postgr.es/m/7DCE78D7-0520-4207-822B-92F60AEA14B4@gmail.com Backpatch-through: 15	2025-10-26 10:59:50 +13:00
Tom Lane	9f9a04368f	Fix off-by-one Asserts in FreePageBtreeInsertInternal/Leaf. These two functions expect there to be room to insert another item in the FreePageBtree's array, but their assertions were too weak to guarantee that. This has little practical effect granting that the callers are not buggy, but it seems to be misleading late-model Coverity into complaining about possible array overrun. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/799984.1761150474@sss.pgh.pa.us Backpatch-through: 13	2025-10-23 12:32:06 -04:00
Amit Kapila	f0b3573c3a	Introduce "REFRESH SEQUENCES" for subscriptions. This patch adds support for a new SQL command: ALTER SUBSCRIPTION ... REFRESH SEQUENCES This command updates the sequence entries present in the pg_subscription_rel catalog table with the INIT state to trigger resynchronization. In addition to the new command, the following subscription commands have been enhanced to automatically refresh sequence mappings: ALTER SUBSCRIPTION ... REFRESH PUBLICATION ALTER SUBSCRIPTION ... ADD PUBLICATION ALTER SUBSCRIPTION ... DROP PUBLICATION ALTER SUBSCRIPTION ... SET PUBLICATION These commands will perform the following actions: Add newly published sequences that are not yet part of the subscription. Remove sequences that are no longer included in the publication. This ensures that sequence replication remains aligned with the current state of the publication on the publisher side. Note that the actual synchronization of sequence data/values will be handled in a subsequent patch that introduces a dedicated sequence sync worker. Author: Vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-10-23 08:30:27 +00:00
Fujii Masao	abc2b71383	Add comments explaining overflow entries in the replication lag tracker. Commit `883a95646a` introduced overflow entries in the replication lag tracker to fix an issue where lag columns in pg_stat_replication could stall when the replay LSN stopped advancing. This commit adds comments clarifying the purpose and behavior of overflow entries to improve code readability and understanding. Since commit `883a95646a` was recently applied and backpatched to all supported branches, this follow-up commit is also backpatched accordingly. Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CABPTF7VxqQA_DePxyZ7Y8V+ErYyXkmwJ1P6NC+YC+cvxMipWKw@mail.gmail.com Backpatch-through: 13	2025-10-23 13:24:56 +09:00
Tatsuo Ishii	20628b62e4	Fix coding style with "else". The "else" code block having single statement with comments on a separate line should have been surrounded by braces. Reported-by: Chao Li <lic@highgo.com> Suggested-by: David Rowley <dgrowleyml@gmail.com> Author: Tatsuo Ishii <ishii@postgresql.org> Discussion: https://postgr.es/m/20251020.125847.997839131426057290.ishii%40postgresql.org	2025-10-23 10:58:41 +09:00
David Rowley	6911f80379	Fix incorrect zero extension of Datum in JIT tuple deform code When JIT deformed tuples (controlled via the jit_tuple_deforming GUC), types narrower than sizeof(Datum) would be zero-extended up to Datum width. This wasn't the same as what fetch_att() does in the standard tuple deforming code. Logically the values are the same when fetching via the DatumGet*() marcos, but negative numbers are not the same in binary form. In the report, the problem was manifesting itself with: ERROR: could not find memoization table entry in a query which had a "Cache Mode: binary" Memoize node. However, it's currently unclear what else is affected. Anything that uses datum_image_eq() or datum_image_hash() on a Datum from a tuple deformed by JIT could be affected, but it may not be limited to that. The fix for this is simple: use signed extension instead of zero extension. Many thanks to Emmanuel Touzery for reporting this issue and providing steps and backup which allowed the problem to easily be recreated. Reported-by: Emmanuel Touzery <emmanuel.touzery@plandela.si> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/DB8P194MB08532256D5BAF894F241C06393F3A@DB8P194MB0853.EURP194.PROD.OUTLOOK.COM Backpatch-through: 13	2025-10-23 13:11:02 +13:00
Tom Lane	fe9c051fd3	Avoid assuming that time_t can fit in an int. We had several places that used cast-to-unsigned-int as a substitute for properly checking for overflow. Coverity has started objecting to that practice as likely introducing Y2038 bugs. An extra comparison is surely not much compared to the cost of time(NULL), nor is this coding practice particularly readable. Let's do it honestly, with explicit logic covering the cases of first-time-through and clock-went-backwards. I don't feel a need to back-patch though: our released versions will be out of support long before 2038, and besides which I think the code would accidentally work anyway for another 70 years or so.	2025-10-22 17:50:11 -04:00
Nathan Bossart	4c5e1d0785	Remove make_temptable_name_n(). This small function is only used in one place, and it fails to handle quoted table names (although the table name portion of the input should never be quoted in current usage). In addition to removing make_temptable_name_n() in favor of open-coding it where needed, this commit ensures the "diff" table name is properly quoted in order to future-proof this area a bit. Author: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Discussion: https://postgr.es/m/CAJ7c6TO3a5q2NKRsjdJ6sLf8isVe4aMaaX1-Hj2TdHdhFw8zRA%40mail.gmail.com	2025-10-22 12:31:55 -05:00
Fujii Masao	f33e60a53a	Make invalid primary_slot_name follow standard GUC error reporting. Previously, if primary_slot_name was set to an invalid slot name and the configuration file was reloaded, both the postmaster and all other backend processes reported a WARNING. With many processes running, this could produce a flood of duplicate messages. The problem was that the GUC check hook for primary_slot_name reported errors at WARNING level via ereport(). This commit changes the check hook to use GUC_check_errdetail() and GUC_check_errhint() for error reporting. As with other GUC parameters, this causes non-postmaster processes to log the message at DEBUG3, so by default, only the postmaster's message appears in the log file. Backpatch to all supported versions. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <lic@highgo.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/CAHGQGwFud-cvthCTfusBfKHBS6Jj6kdAPTdLWKvP2qjUX6L_wA@mail.gmail.com Backpatch-through: 13	2025-10-22 20:09:43 +09:00
Tatsuo Ishii	2d7b247cb4	Fix multi WinGetFuncArgInFrame/Partition calls with IGNORE NULLS. Previously it was mistakenly assumed that there's only one window function argument which needs to be processed by WinGetFuncArgInFrame or WinGetFuncArgInPartition when IGNORE NULLS option is specified. To eliminate the limitation, WindowObject->notnull_info is modified from "uint8 " to "uint8 " so that WindowObject->notnull_info could store pointers to "uint8 " which holds NOT NULL info corresponding to each window function argument. Moreover, WindowObject->num_notnull_info is changed from "int" to "int64 *" so that WindowObject->num_notnull_info could store the number of NOT NULL info corresponding to each function argument. Memories for these data structures will be allocated when WinGetFuncArgInFrame or WinGetFuncArgInPartition is called. Thus no memory except the pointers is allocated for function arguments which do not call these functions Also fix the set mark position logic in WinGetFuncArgInPartition to not raise a "cannot fetch row before WindowObject's mark position" error in IGNORE NULLS case. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Tatsuo Ishii <ishii@postgresql.org> Discussion: https://postgr.es/m/2952409.1760023154%40sss.pgh.pa.us	2025-10-22 12:06:33 +09:00
Fujii Masao	883a95646a	Fix stalled lag columns in pg_stat_replication when replay LSN stops advancing. Previously, when the replay LSN reported in feedback messages from a standby stopped advancing, for example, due to a recovery conflict, the write_lag and flush_lag columns in pg_stat_replication would initially update but then stop progressing. This prevented users from correctly monitoring replication lag. The problem occurred because when any LSN stopped updating, the lag tracker's cyclic buffer became full (the write head reached the slowest read head). In that state, the lag tracker could no longer compute round-trip lag values correctly. This commit fixes the issue by handling the slowest read entry (the one causing the buffer to fill) as a separate overflow entry and freeing space so the write and other read heads can continue advancing in the buffer. As a result, write_lag and flush_lag now continue updating even if the reported replay LSN remains stalled. Backpatch to all supported versions. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <lic@highgo.com> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/CAHGQGwGdGQ=1-X-71Caee-LREBUXSzyohkoQJd4yZZCMt24C0g@mail.gmail.com Backpatch-through: 13	2025-10-22 11:27:15 +09:00
Michael Paquier	2b75c38b70	Add error_on_null(), checking if the input is the null value This polymorphic function produces an error if the input value is detected as being the null value; otherwise it returns the input value unchanged. This function can for example become handy in SQL function bodies, to enforce that exactly one row was returned. Author: Joel Jacobson <joel@compiler.org> Reviewed-by: Vik Fearing <vik@postgresfriends.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/ece8c6d1-2ab1-45d5-ba12-8dec96fc8886@app.fastmail.com Discussion: https://postgr.es/m/de94808d-ed58-4536-9e28-e79b09a534c7@app.fastmail.com	2025-10-22 09:55:17 +09:00
David Rowley	2470ca435c	Use CompactAttribute more often, when possible `5983a4cff` added CompactAttribute for storing commonly used fields from FormData_pg_attribute. `5983a4cff` didn't go to the trouble of adjusting every location where we can use CompactAttribute rather than FormData_pg_attribute, so here we change the remaining ones. There are some locations where I've left the code using FormData_pg_attribute. These are mostly in the ALTER TABLE code. Using CompactAttribute here seems more risky as often the TupleDesc is being changed and those changes may not have been flushed to the CompactAttribute yet. I've also left record_recv(), record_send(), record_cmp(), record_eq() and record_image_eq() alone as it's not clear to me that accessing the CompactAttribute is a win here due to the FormData_pg_attribute still having to be accessed for most cases. Switching the relevant parts to use CompactAttribute would result in having to access both for common cases. Careful benchmarking may reveal that something can be done to make this better, but in absence of that, the safer option is to leave these alone. In ReorderBufferToastReplace(), there was a check to skip attnums < 0 while looping over the TupleDesc. Doing this is redundant since TupleDescs don't store < 0 attnums. Removing that code allows us to move to using CompactAttribute. The change in validateDomainCheckConstraint() just moves fetching the FormData_pg_attribute into the ERROR path, which is cold due to calling errstart_cold() and results in code being moved out of the common path. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAApHDvrMy90o1Lgkt31F82tcSuwRFHq3vyGewSRN=-QuSEEvyQ@mail.gmail.com	2025-10-22 11:36:26 +13:00
Jeff Davis	ff53907c35	Make char2wchar() static. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com	2025-10-21 09:32:12 -07:00
Jeff Davis	844385d12e	Remove obsolete global database_ctype_is_c. Now that tsearch uses the database default locale, there's no need to track the database CTYPE separately. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com	2025-10-21 09:32:04 -07:00
Jeff Davis	e113f9c102	tsearch: use database default collation for parsing. Previously, tsearch used the database's CTYPE setting, which only matches the database default collation if the locale provider is libc. Note that tsearch types (tsvector and tsquery) are not collatable types. The locale affects parsing the original text, which is a lossy process, so a COLLATE clause on the already-parsed value would not make sense. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com	2025-10-21 09:31:49 -07:00
Nathan Bossart	e94a7afe44	Re-pgindent brin.c. Backpatch-through: 13	2025-10-21 09:56:26 -05:00
Álvaro Herrera	b7cc6474e9	Make smgr access for a BufferManagerRelation safer in relcache inval Currently there's no bug, because we have no code path where we invalidate relcache entries where it'd cause a problem. But it's more robust to do it this way in case we introduce such a path later, as some Postgres forks reportedly already have. Author: Daniil Davydov <3danissimo@gmail.com> Reviewed-by: Stepan Neretin <slpmcf@gmail.com> Discussion: https://postgr.es/m/CAJDiXgj3FNzAhV+jjPqxMs3jz=OgPohsoXFj_fh-L+nS+13CKQ@mail.gmail.com	2025-10-21 10:51:55 +03:00
David Rowley	9fd29d7ff4	Fix BRIN 32-bit counter wrap issue with huge tables A BlockNumber (32-bit) might not be large enough to add bo_pagesPerRange to when the table contains close to 2^32 pages. At worst, this could result in a cancellable infinite loop during the BRIN index scan with power-of-2 pagesPerRange, and slow (inefficient) BRIN index scans and scanning of unneeded heap blocks for non power-of-2 pagesPerRange. Backpatch to all supported versions. Author: sunil s <sunilfeb26@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAOG6S4-tGksTQhVzJM19NzLYAHusXsK2HmADPZzGQcfZABsvpA@mail.gmail.com Backpatch-through: 13	2025-10-21 20:46:14 +13:00
Michael Paquier	e4e496e88c	Fix comment in pg_get_shmem_allocations_numa() The comment fixed in this commit described the function as dealing with database blocks, but in reality it processes shared memory allocations. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aH4DDhdiG9Gi0rG7@ip-10-97-1-34.eu-west-3.compute.internal Backpatch-through: 18	2025-10-21 16:12:30 +09:00
Richard Guo	18d2614093	Fix pushdown of degenerate HAVING clauses `67a54b9e8` taught the planner to push down HAVING clauses even when grouping sets are present, as long as the clause does not reference any columns that are nullable by the grouping sets. However, there was an oversight: if any empty grouping sets are present, the aggregation node can produce a row that did not come from the input, and pushing down a HAVING clause in this case may cause us to fail to filter out that row. Currently, non-degenerate HAVING clauses are not pushed down when empty grouping sets are present, since the empty grouping sets would nullify the vars they reference. However, degenerate (variable-free) HAVING clauses are not subject to this restriction and may be incorrectly pushed down. To fix, explicitly check for the presence of empty grouping sets and retain degenerate clauses in HAVING when they are present. This ensures that we don't emit a bogus aggregated row. A copy of each such clause is also put in WHERE so that query_planner() can use it in a gating Result node. To facilitate this check, this patch expands the groupingSets tree of the query to a flat list of grouping sets before applying the HAVING pushdown optimization. This does not add any additional planning overhead, since we need to do this expansion anyway. In passing, make a small tweak to preprocess_grouping_sets() by reordering its initial operations a bit. Backpatch to v18, where this issue was introduced. Reported-by: Yuhang Qiu <iamqyh@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/0879D9C9-7FE2-4A20-9593-B23F7A0B5290@gmail.com Backpatch-through: 18	2025-10-21 12:35:36 +09:00
Masahiko Sawada	4bea91f21f	Support COPY TO for partitioned tables. Previously, COPY TO command didn't support directly specifying partitioned tables so users had to use COPY (SELECT ...) TO variant. This commit adds direct COPY TO support for partitioned tables, improving both usability and performance. Performance tests show it's faster than the COPY (SELECT ...) TO variant as it avoids the overheads of query processing and sending results to the COPY TO command. When used with partitioned tables, COPY TO copies the same rows as SELECT * FROM table. Row-level security policies of the partitioned table are applied in the same way as when executing COPY TO on a plain table. Author: jian he <jian.universality@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Melih Mutlu <m.melihmutlu@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CACJufxEZt%2BG19Ors3bQUq-42-61__C%3Dy5k2wk%3DsHEFRusu7%3DiQ%40mail.gmail.com	2025-10-20 10:38:52 -07:00
Tom Lane	92cf557ffa	Add static assertion that RELSEG_SIZE fits in an int. Our configure script intended to ensure this, but it supposed that expr(1) would report an error for integer overflow. Maybe that was true when the code was written (commit `3c6248a82` of 2008-05-02), but all the modern expr's I tried will deliver bigger-than-int32 results without complaint. Moreover, if you use --with-segsize-blocks then there's no check at all. Ideally we'd add a test in configure itself to check that the value fits in int, but to do that we'd need to suppose that test(1) handles bigger-than-int32 numbers correctly. Probably modern ones do, but that's an assumption I could do without; and I'm not too trusting about meson either. Instead, let's install a static assertion, so that even people who ignore all the compiler warnings you get from such values will be forced to confront the fact that it won't work. This has been hazardous for awhile, but given that we hadn't heard a complaint about it till now, I don't feel a need to back-patch. Reported-by: Casey Shobe <casey.allen.shobe@icloud.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/C5DC82D6-C76D-4E8F-BC2E-DF03EFC4FA24@icloud.com	2025-10-19 18:28:46 -04:00
Tatsuo Ishii	dd766a441d	Fix Coverity issue reported in commit `2273fa32bc`. Coverity complains that the return value from gettuple_eval_partition (stored in variable "datum") in a do..while loop in WinGetFuncArgInPartition is overwritten when exiting the while loop. This commit tries to fix the issue by changing the gettuple_eval_partition call to: (void) gettuple_eval_partition() explicitly stating that we discard the return value. We are just interested in whether we are inside or outside of partition, NULL or NOT NULL here. Also enhance some comments for easier code reading. Reported-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aPCOabSE4VfJLaky%40paquier.xyz	2025-10-19 09:29:26 +09:00
Jeff Davis	e533524b23	Add pg_database_locale() to retrieve database default locale. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com	2025-10-18 16:25:23 -07:00
Jeff Davis	67a8b49e96	Add pg_iswxdigit(), useful for tsearch. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com	2025-10-18 16:25:11 -07:00
David Rowley	e3b9e44689	Tidyup truncate_useless_pathkeys() function This removes a few static functions and replaces them with 2 functions which aim to be more reusable. The upper planner's pathkey requirements can be simplified down to operations which require pathkeys in the same order as the pathkeys for the given operation, and operations which can make use of a Path's pathkeys in any order. Here we also add some short-circuiting to truncate_useless_pathkeys(). At any point we discover that all pathkeys are useful to a single operation, we can stop checking the remaining operations as we're not going to be able to find any further useful pathkeys - they're all possibly useful already. Adjusting this seems to warrant trying to put the checks roughly in order of least-expensive-first so that the short-circuits have the most chance of skipping the more expensive checks. In passing clean up has_useful_pathkeys() as it seems to have grown a redundant check for group_pathkeys. This isn't needed as standard_qp_callback will set query_pathkeys if there's any requirement to have group_pathkeys. All this code does is waste run-time effort and take up needless space. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAApHDvpbsEoTksvW5901MMoZo-hHf78E5up3uDOfkJnxDe_WAw@mail.gmail.com	2025-10-19 10:13:13 +13:00
David Rowley	5c0a20003b	Fix reset of incorrect hash iterator in GROUPING SETS queries This fixes an unlikely issue when fetching GROUPING SET results from their internally stored hash tables. It was possible in rare cases that the hash iterator would be set up incorrectly which could result in a crash. This was introduced in `4d143509c`, so backpatch to v18. Many thanks to Yuri Zamyatin for reporting and helping to debug this issue. Bug: #19078 Reported-by: Yuri Zamyatin <yuri@yrz.am> Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org Backpatch-through: 18	2025-10-18 16:07:04 +13:00
Tomas Vondra	b85c4700fc	Fix hashjoin memory balancing logic Commit `a1b4f289be` improved the hashjoin sizing to also consider the memory used by BufFiles for batches. The code however had multiple issues, making it ineffective or not working as expected in some cases. * The amount of memory needed by buffers was calculated using uint32, so it would overflow for nbatch >= 262144. If this happened the loop would exit prematurely and the memory usage would not be reduced. The nbatch overflow is fixed by reworking the condition to not use a multiplication at all, so there's no risk of overflow. An explicit cast was added to a similar calculation in ExecHashIncreaseBatchSize. * The loop adjusting the nbatch value used hash_table_bytes to calculate the old/new size, but then updated only space_allowed. The consequence is the total memory usage was not reduced, but all the memory saved by reducing the number of batches was used for the internal hash table. This was fixed by using only space_allowed. This is also more correct, because hash_table_bytes does not account for skew buckets. * The code was also doubling multiple parameters (e.g. the number of buckets for hash table), but was missing overflow protections. The loop now checks for overflow, and terminates if needed. It'd be possible to cap the value and continue the loop, but it's not worth the complexity. And the overflow implies the in-memory hash table is already very large anyway. While at it, rework the comment explaining how the memory balancing works, to make it more concise and easier to understand. The initial nbatch overflow issue was reported by Vaibhav Jain. The other issues were noticed by me and Melanie Plageman. Fix by me, with a lot of review and feedback by Melanie. Backpatch to 18, where the hashjoin memory balancing was introduced. Reported-by: Vaibhav Jain <jainva@google.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Backpatch-through: 18 Discussion: https://postgr.es/m/CABa-Az174YvfFq7rLS+VNKaQyg7inA2exvPWmPWqnEn6Ditr_Q@mail.gmail.com	2025-10-17 22:21:50 +02:00
Peter Eisentraut	e1a912c86d	Change config_generic.vartype to be initialized at compile time Previously, this was initialized at run time so that it did not have to be maintained by hand in guc_tables.c. But since that table is now generated anyway, we might as well generate this bit as well. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org	2025-10-17 10:33:54 +02:00
Peter Eisentraut	0a7bde4610	Use designated initializers for guc_tables This makes the generating script simpler and the output easier to read. In the future, it will make it easier to reorder and rearrange the underlying C structures. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org	2025-10-17 10:29:42 +02:00
Daniel Gustafsson	6aa184c80f	Replace defunct URL with stable archive.org URL in rbtree.c The URL for "Sorting and Searching Algorithms: A Cookbook" by Thomas Niemann has started returning 404, and since we refer to the page for license terms this replaces the now defunct link with one to the copy on archive.org. Author: Chao Li <lic@highgo.com> Discussion: https://postgr.es/m/6DED3DEF-875E-4D1D-8F8F-7353D5AF7B79@gmail.com	2025-10-17 09:38:49 +02:00
Nathan Bossart	812221b204	Remove partColsUpdated. This information appears to have been unused since commit `c5b7ba4e67`. We could not find any references in third-party code, either. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aO_CyFRpbVMtgJWM%40nathan	2025-10-16 11:31:38 -05:00
Amit Kapila	41c674d2e3	Refactor logical worker synchronization code into a separate file. To support the upcoming addition of a sequence synchronization worker, this patch extracts common synchronization logic shared by table sync workers and the new sequence sync worker into a dedicated file. This modularization improves code reuse, maintainability, and clarity in the logical workers framework. Author: vignesh C <vignesh21@gmail.com> Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-10-16 05:10:50 +00:00
Amit Langote	905e932f09	Fix EPQ crash from missing partition directory in EState EvalPlanQualStart() failed to propagate es_partition_directory into the child EState used for EPQ rechecks. When execution time partition pruning ran during the EPQ scan, executor code dereferenced a NULL partition directory and crashed. Previously, propagating es_partition_directory into the EPQ EState was unnecessary because CreatePartitionPruneState(), which sets it on demand, also initialized the exec-pruning context. After commit `d47cbf474`, CreatePartitionPruneState() now initializes only the init- time pruning context, leaving exec-pruning context initialization to ExecInitNode(). Since EvalPlanQualStart() runs only ExecInitNode() and not CreatePartitionPruneState(), it can encounter a NULL es_partition_directory. Other executor fields initialized during CreatePartitionPruneState() are already copied into the child EState thanks to commit `8741e48e5d`, but es_partition_directory was missed. Fix by borrowing the parent estate's es_partition_directory in EvalPlanQualStart(), and by clearing that field in EvalPlanQualEnd() so the parent remains responsible for freeing the directory. Add an isolation test permutation that triggers EPQ with execution- time partition pruning, the case that reproduces this crash. Bug: #19078 Reported-by: Yuri Zamyatin <yuri@yrz.am> Diagnosed-by: David Rowley <dgrowleyml@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Co-authored-by: Amit Langote <amitlangote09@gmail.com> Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org Backpatch-through: 18	2025-10-16 14:01:44 +09:00
Nathan Bossart	079480dc20	Fix lookup code for REINDEX INDEX. This commit adjusts RangeVarCallbackForReindexIndex() to handle an extremely unlikely race condition involving concurrent OID reuse. In short, if REINDEX INDEX is executed at the same time that the index is re-created with the same name and OID but a different parent table OID, we might lock the wrong parent table. To fix, simply detect when this happens and emit an ERROR. Unfortunately, we can't gracefully handle this situation because we will have already locked the index, and we must lock the parent table before the index to avoid deadlocks. While at it, I've replaced all but one early return in this callback function with ERRORs that should be unreachable. While I haven't verified the presence of a live bug, the checks in question appear to be unnecessary, and the early returns seem prone to breaking the parent table locking code in subtle ways. If nothing else, this simplifies the code a bit. This is a bug fix and could be back-patched, but given the presumed rarity of the race condition and the lack of reports, I'm not going to bother. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan	2025-10-15 16:32:40 -05:00
Jeff Davis	af164f31b9	Add pg_iswalpha() and related functions. Per-character pg_locale_t APIs. Useful for tsearch parsing and potentially other places. Significant overlap with the regc_wc_isalpha() and related functions in regc_pg_locale.c, but this change leaves those intact for now. Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com	2025-10-15 12:54:01 -07:00
Nathan Bossart	688dc6299a	Fix lookups in pg_{clear,restore}_{attribute,relation}_stats(). Presently, these functions look up the relation's OID, lock it, and then check privileges. Not only does this approach provide no guarantee that the locked relation matches the arguments of the lookup, but it also allows users to briefly lock relations for which they do not have privileges, which might enable denial-of-service attacks. This commit adjusts these functions to use RangeVarGetRelidExtended(), which is purpose-built to avoid both of these issues. The new RangeVarGetRelidCallback function is somewhat complicated because it must handle both tables and indexes, and for indexes, we must check privileges on the parent table and lock it first. Also, it needs to handle a couple of extremely unlikely race conditions involving concurrent OID reuse. A downside of this change is that the coding doesn't allow for locking indexes in AccessShare mode anymore; everything is locked in ShareUpdateExclusive mode. Per discussion, the original choice of lock levels was intended for a now defunct implementation that used in-place updates, so we believe this change is okay. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan Backpatch-through: 18	2025-10-15 12:47:33 -05:00
Peter Eisentraut	5f4c3b33a9	Change reset_extra into a config_generic common field This is not specific to the GUC parameter type, so it can be part of the generic struct rather than the type-specific struct (like the related "extra" field). This allows for some code simplifications. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org	2025-10-15 15:20:28 +02:00
Peter Eisentraut	dd3ae37830	Add log_autoanalyze_min_duration The log output functionality of log_autovacuum_min_duration applies to both VACUUM and ANALYZE, so it is not possible to separate the VACUUM and ANALYZE log output thresholds. Logs are likely to be output only for VACUUM and not for ANALYZE. Therefore, we decided to separate the threshold for log output of VACUUM by autovacuum (log_autovacuum_min_duration) and the threshold for log output of ANALYZE by autovacuum (log_autoanalyze_min_duration). Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Kasahara Tatsuhito <kasaharatt@oss.nttdata.com> Discussion: https://www.postgresql.org/message-id/flat/CAOzEurQtfV4MxJiWT-XDnimEeZAY+rgzVSLe8YsyEKhZcajzSA@mail.gmail.com	2025-10-15 14:31:12 +02:00
Peter Eisentraut	29dc7a6687	Add some const qualifiers in guc-related source files, in anticipation of some further restructuring. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org	2025-10-15 10:05:53 +02:00
Peter Eisentraut	1a79518888	Modernize some for loops in guc-related source files, in anticipation of some further restructuring. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org	2025-10-15 10:00:37 +02:00
Amit Kapila	2436b8c047	Standardize use of REFRESH PUBLICATION in code and messages. This patch replaces ALTER SUBSCRIPTION REFRESH with ALTER SUBSCRIPTION REFRESH PUBLICATION in comments and error messages to improve clarity and support future extensibility. The change aligns with upcoming addition REFRESH SEQUENCES for sequence synchronization. Author: vignesh C <vignesh21@gmail.com> Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-10-15 03:42:27 +00:00
Melanie Plageman	3e4705484e	Make heap_page_is_all_visible independent of LVRelState This function only requires a few fields from LVRelState, so pass them in individually. This change allows calling heap_page_is_all_visible() from code such as pruneheap.c, which does not have access to an LVRelState. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek	2025-10-14 17:43:41 -04:00
Melanie Plageman	43b05b38ea	Inline TransactionIdFollows/Precedes[OrEquals]() These functions appeared prominently in a profile of a patch that sets the visibility map on-access. Inline them to remove call overhead and make them cheaper to use in hot paths. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek	2025-10-14 17:03:48 -04:00
Melanie Plageman	c8dd6542ba	Add helper for freeze determination to heap_page_prune_and_freeze After scanning the line pointers on a heap page during the first phase of vacuum, we use the information collected to decide whether to use the assembled freeze plans. Move this decision logic into a helper function to improve readability. While here, rename a PruneState member and disambiguate some local variables in heap_page_prune_and_freeze(). Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek	2025-10-14 15:08:50 -04:00
Jeff Davis	8efe982fe2	pg_regc_locale.c: rename some static functions. Use the more specific prefix "regc_" rather than the generic prefix "pg_". A subsequent commit will create generic versions of some of these functions that can be called from other modules. Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com	2025-10-14 11:04:04 -07:00
Tatsuo Ishii	5f3808646f	Use ereport rather than elog in WinCheckAndInitializeNullTreatment. Previously WinCheckAndInitializeNullTreatment() used elog() to emit an error message. ereport() should be used instead because it's a user-facing error. Also use existing get_func_name() to get a function's name, rather than own implementation. Moreover add an assertion to validate winobj parameter, just like other window function API. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: Chao Li <lic@highgo.com> Discussion: https://postgr.es/m/2952409.1760023154%40sss.pgh.pa.us	2025-10-14 19:15:24 +09:00
Richard Guo	1206df04c2	Rename apply_at to apply_agg_at for clarity The field name "apply_at" in RelAggInfo was a bit ambiguous. Rename it to "apply_agg_at" to improve clarity and make its purpose clearer. Per complaint from David Rowley, Robert Haas. Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA+TgmoZ0KR2_XCWHy17=HHcQ3p2Mamc9c6Dnnhf1J6wPYFD9ng@mail.gmail.com	2025-10-14 16:35:22 +09:00
Melanie Plageman	add323da40	Eliminate XLOG_HEAP2_VISIBLE from vacuum phase III Instead of emitting a separate XLOG_HEAP2_VISIBLE WAL record for each page that becomes all-visible in vacuum's third phase, specify the VM changes in the already emitted XLOG_HEAP2_PRUNE_VACUUM_CLEANUP record. Visibility checks are now performed before marking dead items unused. This is safe because the heap page is held under exclusive lock for the entire operation. This reduces the number of WAL records generated by VACUUM phase III by up to 50%. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com	2025-10-13 18:01:06 -04:00
Tom Lane	03bf7a12c5	Fix incorrect message-printing in win32security.c. log_error() would probably fail completely if used, and would certainly print garbage for anything that needed to be interpolated into the message, because it was failing to use the correct printing subroutine for a va_list argument. This bug likely went undetected because the error cases this code is used for are rarely exercised - they only occur when Windows security API calls fail catastrophically (out of memory, security subsystem corruption, etc). The FRONTEND variant can be fixed just by calling vfprintf() instead of fprintf(). However, there was no va_list variant of write_stderr(), so create one by refactoring that function. Following the usual naming convention for such things, call it vwrite_stderr(). Author: Bryan Green <dbryan.green@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAF+pBj8goe4fRmZ0V3Cs6eyWzYLvK+HvFLYEYWG=TzaM+tWPnw@mail.gmail.com Backpatch-through: 13	2025-10-13 17:56:45 -04:00
Peter Geoghegan	7a662a46eb	Remove unused nbtree array advancement variable. Remove a variable that is no longer in use following commit `9a2e2a28`. It's not immediately clear why there were no compiler warnings about this oversight. Author: Peter Geoghegan <pg@bowt.ie> Backpatch-through: 18	2025-10-12 14:04:08 -04:00
Álvaro Herrera	3231fd0455	Stop creating constraints during DETACH CONCURRENTLY Commit `71f4c8c6f7` (which implemented DETACH CONCURRENTLY) added code to create a separate table constraint when a table is detached concurrently, identical to the partition constraint, on the theory that such a constraint was needed in case the optimizer had constructed any query plans that depended on the constraint being there. However, that theory was apparently bogus because any such plans would be invalidated. For hash partitioning, those constraints are problematic, because their expressions reference the OID of the parent partitioned table, to which the detached table is no longer related; this causes all sorts of problems (such as inability of restoring a pg_dump of that table, and the table no longer working properly if the partitioned table is later dropped). We'd like to get rid of all those constraints. In fact, for branch master, do that -- no longer create any substitute constraints. However, out of fear that some users might somehow depend on these constraints for other partitioning strategies, for stable branches (back to 14, which added DETACH CONCURRENTLY), only do it for hash partitioning. (If you repeatedly DETACH CONCURRENTLY and then ATTACH a partition, then with this constraint addition you don't need to scan the table in the ATTACH step, which presumably is good. But if users really valued this feature, they would have requested that it worked for non-concurrent DETACH also.) Author: Haiyang Li <mohen.lhy@alibaba-inc.com> Reported-by: Fei Changhong <feichanghong@qq.com> Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com> Backpatch-through: 14 Discussion: https://postgr.es/m/18371-7fef49f63de13f02@postgresql.org Discussion: https://postgr.es/m/19070-781326347ade7c57@postgresql.org	2025-10-11 20:30:12 +02:00
Álvaro Herrera	ff47f9c16c	dbase_redo: Fix Valgrind-reported memory leak Introduced by my (Álvaro's) commit `9e4f914b5e`, which was itself backpatched to pg10, though only pg15 and up contain the problem because of commit `9c08aea6a3`. This isn't a particularly significant leak, but given the fix is trivial, we might as well backpatch to all branches where it applies, so do that. Author: Nathan Bossart <nathandbossart@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/x4odfdlrwvsjawscnqsqjpofvauxslw7b4oyvxgt5owoyf4ysn@heafjusodrz7	2025-10-11 16:39:22 +02:00
Peter Geoghegan	843e50208a	Remove overzealous _bt_killitems assertion. An assertion in _bt_killitems expected the scan's currPos state to contain a valid LSN, saved from when currPos's page was initially read. The assertion failed to account for the fact that even logged relations can have leaf pages with an invalid LSN when built with wal_level set to "minimal". Remove the faulty assertion. Oversight in commit `e6eed40e` (though note that the assertion was backpatched to stable branches before 18 by commit `7c319f54`). Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Matthijs van der Vleuten <postgresql@zr40.nl> Bug: #19082 Discussion: https://postgr.es/m/19082-628e62160dbbc1c1@postgresql.org Backpatch-through: 13	2025-10-10 14:52:25 -04:00
Michael Paquier	3a36543d7d	Fix two typos in xlogstats.h and xlogstats.c Issue found while browsing this area of the code, introduced and copy-pasted around by `2258e76f90`. Backpatch-through: 15	2025-10-10 11:51:45 +09:00
Michael Paquier	912af1c7e9	Remove state.tmp when failing to save a replication slot An error happening while a slot data is saved on disk in SaveSlotToPath() could cause a state.tmp file (temporary file holding the slot state data, renamed to its permanent name at the end of the function) to remain around after it has been created. This temporary file is created with O_EXCL, meaning that if an existing state.tmp is found, its creation would fail. This would prevent the slot data to be saved, requiring a manual intervention to remove state.tmp before being able to save again a slot. Possible scenarios where this temporary file could remain on disk is for example a ENOSPC case (no disk space) while writing, syncing or renaming it. The bug reports point to a write failure as the principal cause of the problems. Using O_TRUNC has been argued back in 2019 as a potential solution to discard any temporary file that could exist. This solution was rejected as O_EXCL can also act as a safety measure when saving the slot state, crash recovery offering cleanup guarantees post-crash. This commit uses the alternative approach that has been suggested by Andres Freund back in 2019. When the temporary state file cannot be written, synced, closed or renamed (note: not when created!), an unlink() is used to remove the temporary state file while holding the in-progress I/O LWLock, so as any follow-up attempts to save a slot's data would not choke on an existing file that remained around because of a previous failure. This problem has been reported a few times across the years, going back to 2019, but for some reason I have never come back to do something about it and it has been forgotten. A recent report has reminded me that this was still a problem. Reported-by: Kevin K Biju <kevinkbiju@gmail.com> Reported-by: Sergei Kornilov <sk@zsrv.org> Reported-by: Grigory Smolkin <g.smolkin@postgrespro.ru> Discussion: https://postgr.es/m/CAM45KeHa32soKL_G8Vk38CWvTBeOOXcsxAPAs7Jt7yPRf2mbVA@mail.gmail.com Discussion: https://postgr.es/m/3559061693910326@qy4q4a6esb2lebnz.sas.yp-c.yandex.net Discussion: https://postgr.es/m/08bbfab1-a61d-3750-fc18-4ab2c1aa7f09@postgrespro.ru Backpatch-through: 13	2025-10-10 09:23:59 +09:00
Andres Freund	c819d1017d	bufmgr: Fix valgrind checking for buffers pinned in StrategyGetBuffer() In `5e89985928` I made StrategyGetBuffer() pin buffers with a single CAS, instead of using PinBuffer_Locked(). Unfortunately I missed that PinBuffer_Locked() marked the page as defined for valgrind. Fix this oversight by centralizing the valgrind initialization into TrackNewBufferPin(), which also allows us to reduce the number of places doing VALGRIND_MAKE_MEM_DEFINED. Per buildfarm animal skink and Amit Langote. Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff Discussion: https://postgr.es/m/CA+HiwqGKJ6nEXEPQW7EpykVsEtzxp5-up_xhtcUAkWFtATVQvQ@mail.gmail.com	2025-10-09 19:17:13 -04:00
Melanie Plageman	d96f87332b	Eliminate COPY FREEZE use of XLOG_HEAP2_VISIBLE Instead of emitting a separate WAL XLOG_HEAP2_VISIBLE record for setting bits in the VM, specify the VM block changes in the XLOG_HEAP2_MULTI_INSERT record. This halves the number of WAL records emitted by COPY FREEZE. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com	2025-10-09 16:29:01 -04:00
David Rowley	1b073cba49	Cleanup VACUUM option processing error messages The processing of the PARALLEL option for VACUUM was not quite following what the DefElem code had intended. defGetInt32() already has code to handle missing parameters and returns a perfectly good error message for when that happens. Here we get rid of the ExecVacuum() error: ERROR: parallel option requires a value between 0 and N and leave defGetInt32() handle it, which will give: ERROR: parallel requires an integer value defGetInt32() was already handling the non-integer parameter case, so it may as well handle the missing parameter case too. Additionally, parameterize the option name to make translator work easier, and also use errhint_internal() rather than errhint() for the BUFFER_USAGE_LIMIT option since there isn't any work for a translator to do for "%s". Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAApHDvovH14tNWB+WvP6TSbfi7-=TysQ9h5tQ5AgavwyWRWKHA@mail.gmail.com	2025-10-10 09:25:23 +13:00
Tom Lane	89d57c1fb3	Clean up memory leakage that occurs in context callback functions. An error context callback function might leak some memory into ErrorContext, since those functions are run with ErrorContext as current context. In the case where the elevel is ERROR, this is no problem since the code level that catches the error should do FlushErrorState to clean up, and that will reset ErrorContext. However, if the elevel is less than ERROR then no such cleanup occurs. In principle, repeated leaks while emitting log messages or client notices could accumulate arbitrarily much leaked data, if no ERROR occurs in the session. To fix, let errfinish() perform an ErrorContext reset if it is at the outermost error nesting level. (If it isn't, we'll delay cleanup until the outermost nesting level is exited.) The only actual leakage of this sort that I've been able to observe within our regression tests was recently introduced by commit `f727b63e8`. While it seems plausible that there are other such leaks not reached in the regression tests, the lack of field reports suggests that they're not a big problem. Accordingly, I won't take the risk of back-patching this now. We can always back-patch later if we get field reports of leaks. Reported-by: Andres Freund <andres@anarazel.de> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/jngsjonyfscoont4tnwi2qoikatpd5hifsg373vmmjvugwiu6g@m6opxh7uisgd	2025-10-09 15:37:42 -04:00
Masahiko Sawada	b46efe9048	Fix access-to-already-freed-memory issue in pgoutput. While pgoutput caches relation synchronization information in RelationSyncCache that resides in CacheMemoryContext, each entry's information (such as row filter expressions and column lists) is stored in the entry's private memory context (entry_cxt in RelationSyncEntry), which is a descendant memory context of the decoding context. If a logical decoding invoked via SQL functions like pg_logical_slot_get_binary_changes fails with an error, subsequent logical decoding executions could access already-freed memory of the entry's cache, resulting in a crash. With this change, it's ensured that RelationSyncCache is cleaned up even in error cases by using a memory context reset callback function. Backpatch to 15, where entry_cxt was introduced for column filtering and row filtering. While the backbranches v13 and v14 have a similar issue where RelationSyncCache persists even after an error when pgoutput is used via SQL API, we decided not to backport this fix. This decision was made because v13 is approaching its final minor release, and we won't have an chance to fix any new issues that might arise. Additionally, since using pgoutput via SQL API is not a common use case, the risk outwights the benefit. If we receive bug reports, we can consider backporting the fixes then. Author: vignesh C <vignesh21@gmail.com> Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com Backpatch-through: 15	2025-10-09 10:59:27 -07:00
Tom Lane	71540dcdcb	Avoid uninitialized-variable warnings from older compilers. Some of the buildfarm is still unhappy with WinGetFuncArgInPartition even after `2273fa32b`. While it seems to be just very old compilers, we can suppress the warnings and arguably make the code more readable by not initializing these variables till closer to where they are used. While at it, make a couple of cosmetic comment improvements.	2025-10-09 10:33:55 -04:00
Richard Guo	f997d777ad	Remove unnecessary include of "utils/fmgroids.h" In initsplan.c, no macros for built-in function OIDs are used, so this include is unnecessary and can be removed. This was my oversight in commit `8e1185910`. Discussion: https://postgr.es/m/CAMbWs4_-sag-cAKrLJ+X+5njL1=oudk=+KfLmsLZ5a2jckn=kg@mail.gmail.com	2025-10-09 17:49:20 +09:00
Amit Kapila	96b3784973	Add "ALL SEQUENCES" support to publications. This patch adds support for the ALL SEQUENCES clause in publications, enabling synchronization/replication of all sequences that is useful for upgrades. Publications can now include all sequences via FOR ALL SEQUENCES. psql enhancements: \d shows publications for a given sequence. \dRp indicates if a publication includes all sequences. ALL SEQUENCES can be combined with ALL TABLES, but not with other options like TABLE or TABLES IN SCHEMA. We can extend support for more granular clauses in future. The view pg_publication_sequences provides information about the mapping between publications and sequences. This patch enables publishing of sequences; subscriber-side support will be added in upcoming patches. Author: vignesh C <vignesh21@gmail.com> Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-10-09 03:48:54 +00:00
Amit Langote	ef5e60a9d3	Fix internal error from CollateExpr in SQL/JSON DEFAULT expressions SQL/JSON functions such as JSON_VALUE could fail with "unrecognized node type" errors when a DEFAULT clause contained an explicit COLLATE expression. That happened because assign_collations_walker() could invoke exprSetCollation() on a JsonBehavior expression whose DEFAULT still contained a CollateExpr, which exprSetCollation() does not handle. For example: SELECT JSON_VALUE('{"a":1}', '$.c' RETURNING text DEFAULT 'A' COLLATE "C" ON EMPTY); Fix by validating in transformJsonBehavior() that the DEFAULT expression's collation matches the enclosing JSON expression’s collation. In exprSetCollation(), replace the recursive call on the JsonBehavior expression with an assertion that its collation already matches the target, since the parser now enforces that condition. Reported-by: Jian He <jian.universality@gmail.com> Author: Jian He <jian.universality@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Discussion: https://postgr.es/m/CACJufxHVwYYSyiVQ6o+PsRX6zQ7rAFinh_fv1kCfTsT1xG4Zeg@mail.gmail.com Backpatch-through: 17	2025-10-09 01:07:59 -04:00
David Rowley	a5a68dd6d5	Make truncate_useless_pathkeys() consider WindowFuncs truncate_useless_pathkeys() seems to have neglected to account for PathKeys that might be useful for WindowClause evaluation. Modify it so that it properly accounts for that. Making this work required adjusting two things: 1. Change from checking query_pathkeys to check sort_pathkeys instead. 2. Add explicit check for window_pathkeys For #1, query_pathkeys gets set in standard_qp_callback() according to the sort order requirements for the first operation to be applied after the join planner is finished, so this changes depending on which upper planner operations a particular query needs. If the query has window functions and no GROUP BY, then query_pathkeys gets set to window_pathkeys. Before this change, this meant PathKeys useful for the ORDER BY were not accounted for in queries with window functions. Because of #1, #2 is now required so that we explicitly check to ensure we don't truncate away PathKeys useful for window functions. Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com	2025-10-09 12:38:33 +13:00
Andres Freund	5e89985928	bufmgr: Don't lock buffer header in StrategyGetBuffer() Previously StrategyGetBuffer() acquired the buffer header spinlock for every buffer, whether it was reusable or not. If reusable, it'd be returned, with the lock held, to GetVictimBuffer(), which then would pin the buffer with PinBuffer_Locked(). That's somewhat violating the spirit of the guidelines for holding spinlocks (i.e. that they are only held for a few lines of consecutive code) and necessitates using PinBuffer_Locked(), which scales worse than PinBuffer() due to holding the spinlock. This alone makes it worth changing the code. However, the main reason to change this is that a future commit will make PinBuffer_Locked() slower (due to making UnlockBufHdr() slower), to gain scalability for the much more common case of pinning a pre-existing buffer. By pinning the buffer with a single atomic operation, iff the buffer is reusable, we avoid any potential regression for miss-heavy workloads. There strictly are fewer atomic operations for each potential buffer after this change. The price for this improvement is that freelist.c needs two CAS loops and needs to be able to set up the resource accounting for pinned buffers. The latter is achieved by exposing a new function for that purpose from bufmgr.c, that seems better than exposing the entire private refcount infrastructure. The improvement seems worth the complexity. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-10-08 17:04:07 -04:00
Andres Freund	3baae90013	bufmgr: fewer calls to BufferDescriptorGetContentLock We're planning to merge buffer content locks into BufferDesc.state. To reduce the size of that patch, centralize calls to BufferDescriptorGetContentLock(). The biggest part of the change is in assertions, by introducing BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This seems like an improvement even without aforementioned plans. Additionally replace some direct calls to LWLockAcquire() with calls to LockBuffer(). Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-10-08 16:06:19 -04:00
Andres Freund	2a2e1b470b	bufmgr: Fix signedness of mask variable in BufferSync() BM_PERMANENT is defined as 1U<<31, which is a negative number when interpreted as a signed integer. Unfortunately the mask variable in BufferSync() was signed. This has been wrong for a long time, but failed to fail, due to integer conversion rules. However, in an upcoming patch the width of the state variable will be increased, with the wrong signedness leading to never flushing permanent buffers - luckily caught in a test. It seems better to fix this separately, instead of doing so as part of a large, otherwise mechanical, patch. Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-10-08 14:34:30 -04:00
Andres Freund	3c2b97b29e	bufmgr: Introduce FlushUnlockedBuffer There were several copies of code locking a buffer, flushing its contents, and unlocking the buffer. It seems worth centralizing that into a helper function. Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-10-08 14:34:30 -04:00
Andres Freund	819dc118c0	Improve ReadRecentBuffer() scalability While testing a new potential use for ReadRecentBuffer(), Andres reported that it scales badly when called concurrently for the same buffer by many backends. Instead of a naive (but wrong) coding with PinBuffer(), it used the spinlock, so that it could be careful to pin only if the buffer was valid and holding the expected block, to avoid breaking invariants in eg GetVictimBuffer(). Unfortunately that made it less scalable than PinBuffer(), which uses compare-exchange instead. We can fix that by giving PinBuffer() a new skip_if_not_valid mode that doesn't pin invalid buffers. It might occasionally skip when it shouldn't due to the unlocked read of the header flags, but that's unlikely and perfectly acceptable for an opportunistic optimisation routine, and it can only succeed when it really should due to the compare-exchange loop. Note that this fixes ReadRecentBuffer()'s failure to bump the usage count. While this could be seen as a bug, there currently aren't cases affected by this in core, so it doesn't seem worth backpatching that portion. Author: Thomas Munro <thomas.munro@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-10-08 13:10:40 -04:00
Masahiko Sawada	d3b6183dd9	Add mem_exceeded_count column to pg_stat_replication_slots. This commit introduces a new column mem_exceeded_count to the pg_stat_replication_slots view. This counter tracks how often the memory used by logical decoding exceeds the logical_decoding_work_mem limit. The new statistic helps users determine whether exceeding the logical_decoding_work_mem limit is a rare occurrences or a frequent issue, information that wasn't available through existing statistics. Bumps catversion. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/978D21E8-9D3B-40EA-A4B1-F87BABE7868C@yesql.se	2025-10-08 10:05:04 -07:00
Robert Haas	94f3ad3961	Add planner_setup_hook and planner_shutdown_hook. These hooks allow plugins to get control at the earliest point at which the PlannerGlobal object is fully initialized, and then just before it gets destroyed. This is useful in combination with the extendable plan state facilities (see extendplan.h) and perhaps for other purposes as well. Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com	2025-10-08 09:05:38 -04:00
Robert Haas	c83ac02ec7	Add ExplainState argument to pg_plan_query() and planner(). This allows extensions to have access to any data they've stored in the ExplainState during planning. Unfortunately, it won't help with EXPLAIN EXECUTE is used, but since that case is less common, this still seems like an improvement. Since planner() has quite a few arguments now, also add some documentation of those arguments and the return value. Author: Robert Haas <rhaas@postgresql.org> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com	2025-10-08 08:33:29 -04:00
Richard Guo	8e11859102	Implement Eager Aggregation Eager aggregation is a query optimization technique that partially pushes aggregation past a join, and finalizes it once all the relations are joined. Eager aggregation may reduce the number of input rows to the join and thus could result in a better overall plan. In the current planner architecture, the separation between the scan/join planning phase and the post-scan/join phase means that aggregation steps are not visible when constructing the join tree, limiting the planner's ability to exploit aggregation-aware optimizations. To implement eager aggregation, we collect information about aggregate functions in the targetlist and HAVING clause, along with grouping expressions from the GROUP BY clause, and store it in the PlannerInfo node. During the scan/join planning phase, this information is used to evaluate each base or join relation to determine whether eager aggregation can be applied. If applicable, we create a separate RelOptInfo, referred to as a grouped relation, to represent the partially-aggregated version of the relation and generate grouped paths for it. Grouped relation paths can be generated in two ways. The first method involves adding sorted and hashed partial aggregation paths on top of the non-grouped paths. To limit planning time, we only consider the cheapest or suitably-sorted non-grouped paths in this step. Alternatively, grouped paths can be generated by joining a grouped relation with a non-grouped relation. Joining two grouped relations is currently not supported. To further limit planning time, we currently adopt a strategy where partial aggregation is pushed only to the lowest feasible level in the join tree where it provides a significant reduction in row count. This strategy also helps ensure that all grouped paths for the same grouped relation produce the same set of rows, which is important to support a fundamental assumption of the planner. For the partial aggregation that is pushed down to a non-aggregated relation, we need to consider all expressions from this relation that are involved in upper join clauses and include them in the grouping keys, using compatible operators. This is essential to ensure that an aggregated row from the partial aggregation matches the other side of the join if and only if each row in the partial group does. This ensures that all rows within the same partial group share the same "destiny", which is crucial for maintaining correctness. One restriction is that we cannot push partial aggregation down to a relation that is in the nullable side of an outer join, because the NULL-extended rows produced by the outer join would not be available when we perform the partial aggregation, while with a non-eager-aggregation plan these rows are available for the top-level aggregation. Pushing partial aggregation in this case may result in the rows being grouped differently than expected, or produce incorrect values from the aggregate functions. If we have generated a grouped relation for the topmost join relation, we finalize its paths at the end. The final paths will compete in the usual way with paths built from regular planning. The patch was originally proposed by Antonin Houska in 2017. This commit reworks various important aspects and rewrites most of the current code. However, the original patch and reviews were very useful. Author: Richard Guo <guofenglinux@gmail.com> Author: Antonin Houska <ah@cybertec.at> (in an older version) Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> (in an older version) Reviewed-by: Andy Fan <zhihuifan1213@163.com> (in an older version) Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> (in an older version) Discussion: https://postgr.es/m/CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com	2025-10-08 17:04:23 +09:00
Michael Paquier	138da727a1	Improve description of some WAL records for GIN The following information is added in the description of some GIN records: - In INSERT_LISTPAGE, the number of tuples and the right link block. - In UPDATE_META_PAGE, the number of tuples, the previous tail block, and the right link block. - In SPLIT, the left and right children blocks. Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/CALdSSPgnAt5L=D_xGXRXLYO5FK1H31_eYEESxdU1n-r4g+6GqA@mail.gmail.com	2025-10-08 14:02:26 +09:00
Michael Paquier	b71bae41a0	Add stats_reset to pg_stat_user_functions It is possible to call pg_stat_reset_single_function_counters() for a single function, but the reset time was missing the system view showing its statistics. Like all the fields of pg_stat_user_functions, the GUC track_functions needs to be enabled to show the statistics about function executions. Bump catalog version. Bump PGSTAT_FILE_FORMAT_ID, as a result of the new field added to PgStat_StatFuncEntry. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aONjnsaJSx-nEdfU@paquier.xyz	2025-10-08 12:43:40 +09:00
Amit Kapila	035b09131d	Fix typo in function header comment. Reported-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CA+TgmoZYh_nw-2j_Fi9y6ZAvrpN+W1aSOFNM7Rus2Q-zTkCsQw@mail.gmail.com	2025-10-08 03:17:05 +00:00
Tatsuo Ishii	2273fa32bc	Fix Coverity issues reported in commit `25a30bbd42`. Fix several issues pointed out by Coverity (reported by Tome Lane). - In row_is_in_frame(), return value of window_gettupleslot() was not checked. - WinGetFuncArgInPartition() tried to derefference "isout" pointer even if it could be NULL in some places. Besides the issues, I also fixed a compiler warning reported by Álvaro Herrera. Moreover, in WinGetFuncArgInPartition refactor the do...while loop so that the codes inside the loop simpler. Also simplify the case when abs_pos < 0. Author: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: Paul Ramsey <pramsey@cleverelephant.ca> Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Reported-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/1686755.1759679957%40sss.pgh.pa.us Discussion: https://postgr.es/m/202510051612.gw67jlc2iqpw%40alvherre.pgsql	2025-10-08 09:26:49 +09:00
Robert Haas	64095d1574	Remove PlannerInfo's join_search_private method. Instead, use the new mechanism that allows planner extensions to store private state inside a PlannerInfo, treating GEQO as an in-core planner extension. This is a useful test of the new facility, and also buys back a few bytes of storage. To make this work, we must remove innerrel_is_unique_ext's hack of testing whether join_search_private is set as a proxy for whether the join search might be retried. Add a flag that extensions can use to explicitly signal their intentions instead. Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com	2025-10-07 12:43:45 -04:00
Robert Haas	0132dddab3	Allow private state in certain planner data structures. Extension that make extensive use of planner hooks may want to coordinate their efforts, for example to avoid duplicate computation, but that's currently difficult because there's no really good way to pass data between different hooks. To make that easier, allow for storage of extension-managed private state in PlannerGlobal, PlannerInfo, and RelOptInfo, along very similar lines to what we have permitted for ExplainState since commit `c65bc2e1d1`. Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com	2025-10-07 12:09:30 -04:00
Robert Haas	8c49a484e8	Assign each subquery a unique name prior to planning it. Previously, subqueries were given names only after they were planned, which makes it difficult to use information from a previous execution of the query to guide future planning. If, for example, you knew something about how you want "InitPlan 2" to be planned, you won't know whether the subquery you're currently planning will end up being "InitPlan 2" until after you've finished planning it, by which point it's too late to use the information that you had. To fix this, assign each subplan a unique name before we begin planning it. To improve consistency, use textual names for all subplans, rather than, as we did previously, a mix of numbers (such as "InitPlan 1") and names (such as "CTE foo"), and make sure that the same name is never assigned more than once. We adopt the somewhat arbitrary convention of using the type of sublink to set the plan name; for example, a query that previously had two expression sublinks shown as InitPlan 2 and InitPlan 1 will now end up named expr_1 and expr_2. Because names are assigned before rather than after planning, some of the regression test outputs show the numerical part of the name switching positions: what was previously SubPlan 2 was actually the first one encountered, but we finished planning it later. We assign names even to subqueries that aren't shown as such within the EXPLAIN output. These include subqueries that are a FROM clause item or a branch of a set operation, rather than something that will be turned into an InitPlan or SubPlan. The purpose of this is to make sure that, below the topmost query level, there's always a name for each subquery that is stable from one planning cycle to the next (assuming no changes to the query or the database schema). Author: Robert Haas <rhaas@postgresql.org> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: http://postgr.es/m/3641043.1758751399@sss.pgh.pa.us	2025-10-07 09:18:54 -04:00
David Rowley	9c9d41af4d	Teach planner to short-circuit EXCEPT/INTERSECT with dummy inputs When either inputs of an INTERSECT [ALL] operator are proven not to return any results (a dummy rel), then mark the entire INTERSECT operation as dummy. Likewise, if an EXCEPT [ALL] operation's left input is proven empty, then mark the entire operation as dummy. With EXCEPT ALL, we can easily handle the right input being dummy as we can return the left input without any processing. That can lead to significant performance gains during query execution. We can't easily handle dummy right inputs for EXCEPT (without ALL), as that would require deduplication of the left input. Wiring up those Paths is likely more complex than it's worth as the gains during execution aren't that great, so let's leave that one to be handled by the normal Path generation code. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAApHDvri53PPF76c3M94_QNWbJfXjyCnjXuj_2=LYM-0m8WZtw@mail.gmail.com	2025-10-07 17:17:52 +13:00
David Rowley	928df067d1	Fix incorrect targetlist in dummy UNIONs The prior code, added in `03d40e4b5` attempted to use the targetlist of the first UNION child when all UNION children were proven as dummy rels. That's not going to work when some operation atop of the Result node must find target entries within the Result's targetlist. This could have been something as simple as trying to sort the results of the UNION operation, which would lead to: ERROR: could not find pathkey item to sort Instead, use the top-level UNION's targetlist and fix the varnos in setrefs.c. Because set operation targetlists always use varno==0, we can rewrite those to become varno==1, i.e. use the Vars from the first UNION child. This does result in showing Vars from relations that are not present in the final plan, but that's no different to what we see when normal base relations are proven dummy. Without this fix it would be possible to see the following error in EXPLAIN VERBOSE when all UNION inputs were proven empty. ERROR: bogus varno: 0 Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrUASy9sfULMEsM2udvZJP6AoBRCZvHYXYxZTy2tX9FYw@mail.gmail.com	2025-10-07 14:15:04 +13:00
Masahiko Sawada	771cfe22a0	Avoid unnecessary GinFormTuple() calls for incompressible posting lists. Previously, we attempted to form a posting list tuple even when ginCompressPostingList() failed to compress the posting list due to its size. While there was no functional failure, it always wasted one GinFormTuple() call when item pointers didn't fit in a posting list tuple. This commit ensures that a GIN index tuple is formed only when all item pointers in the posting list are successfully compressed. Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAE7r3M+C=jcpTD93f_RBHrQp3C+=TAXFs+k4tTuZuuxboK8AvA@mail.gmail.com	2025-10-06 14:02:01 -07:00
Nathan Bossart	ec8719ccbf	Optimize hex_encode() and hex_decode() using SIMD. The hex_encode() and hex_decode() functions serve as the workhorses for hexadecimal data for bytea's text format conversion functions, and some workloads are sensitive to their performance. This commit adds new implementations that use routines from port/simd.h, which testing indicates are much faster for larger inputs. For small or invalid inputs, we fall back on the existing scalar versions. Since we are using port/simd.h, these optimizations apply to both x86-64 and AArch64. Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com> Co-authored-by: Susmitha Devanga <devanga.susmitha@fujitsu.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://postgr.es/m/aLhVWTRy0QPbW2tl%40nathan	2025-10-06 12:28:50 -05:00
Amit Kapila	b93172ca59	Expose sequence page LSN via pg_get_sequence_data. This patch enhances the pg_get_sequence_data function to include the page-level LSN (Log Sequence Number) of the sequence. This additional metadata will be used by upcoming patches to support synchronization of sequences during logical replication. By exposing the LSN, we enable more accurate tracking of sequence changes, which is essential for maintaining consistency across replicated nodes. Author: vignesh C <vignesh21@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://www.postgresql.org/message-id/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2025-10-06 08:30:16 +00:00
Michael Paquier	7072a8855e	Remove block information from description of some WAL records for GIN The WAL records XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE included some information about the blocks added to the record. This information is already provided by XLogRecGetBlockRefInfo() with much more details about the blocks included in each record, like the compression information, for example. This commit removes the block information that existed in the record descriptions specific to GIN. Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/CALdSSPgk=9WRoXhZy5fdk+T1hiau7qbL_vn94w_L1N=gtEdbsg@mail.gmail.com	2025-10-06 16:14:59 +09:00
Michael Paquier	a5b543258a	Add stats_reset to pg_stat_all_{tables,indexes} and related views It is possible to call pg_stat_reset_single_table_counters() on a relation (index or table) but the reset time was missing from the system views showing their statistics. This commit adds the reset time as an attribute of pg_stat_all_tables, pg_stat_all_indexes, and other relations related to them. Bump catalog version. Bump PGSTAT_FILE_FORMAT_ID, as a result of the new field added to PgStat_StatTabEntry. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aN8l182jKxEq1h9f@paquier.xyz	2025-10-06 15:31:21 +09:00
Michael Paquier	0c7f103028	Fix two comments in numeric.c The comments at the top of numeric_int4_safe() and numeric_int8_safe() mentioned respectively int4_numeric() and int8_numeric(). The intention is to refer to numeric_int4() and numeric_int8(). Oversights in `4246a977ba`. Reported-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxFfVt7Jx9_j=juxXyP-6tznN8OcvS9E-QSgp0BrD8KUgA@mail.gmail.com	2025-10-06 11:18:30 +09:00
Álvaro Herrera	1a8b5b11e4	Don't include access/htup_details.h in executor/tuptable.h This is not at all needed; I suspect it was a simple mistake in commit `5408e233f0`. It causes htup_details.h to bleed into a huge number of places via execnodes.h. Remove it and fix fallout. Discussion: https://postgr.es/m/202510021240.ptc2zl5cvwen@alvherre.pgsql	2025-10-05 18:00:38 +02:00
Álvaro Herrera	1b6f61bd89	Don't include execnodes.h in brin.h or gin.h These headers don't need execnodes.h for anything. I think they never have. Discussion: https://postgr.es/m/202510021240.ptc2zl5cvwen@alvherre.pgsql	2025-10-05 17:35:25 +02:00
David Rowley	03d40e4b52	Teach UNION planner to remove dummy inputs This adjusts UNION planning so that the planner produces more optimal plans when one or more of the UNION's subqueries have been proven to be empty (a dummy rel). If any of the inputs are empty, then that input can be removed from the Append / MergeAppend. Previously, a const-false "Result" node would appear to represent this. Removing empty inputs has a few extra benefits when only 1 union child remains as it means the Append or MergeAppend can be removed in setrefs.c, making the plan slightly faster to execute. Also, we can provide better n_distinct estimates by looking at the sole remaining input rel's statistics. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAApHDvri53PPF76c3M94_QNWbJfXjyCnjXuj_2=LYM-0m8WZtw@mail.gmail.com	2025-10-04 14:30:03 +13:00
David Rowley	5092aae431	Use bms_add_members() instead of bms_union() when possible bms_union() causes a new set to be allocated. What this caller needs is members added to an existing set. bms_add_members() is the tool for that job. This is just a matter of fixing an inefficiency due to surplus memory allocations. No bugs being fixed. The only other place I found that might be valid to apply this change is in markNullableIfNeeded(), but I opted not to do that due to the risk to reward ratio not looking favorable. The risk being that there could be another pointer pointing to the Bitmapset. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAApHDvoCcoS-p5tZNJLTxFOKTYNjqVh7Dwf+5ikDUBwnvWftRw@mail.gmail.com	2025-10-04 12:19:31 +13:00
Nathan Bossart	74b41f5a77	Make some use of anonymous unions [DSM registry]. Make some use of anonymous unions, which are allowed as of C11, as examples and encouragement for future code, and to test compilers. This commit changes the DSMRegistryEntry struct. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/aNKsDg0fJwqhZdXX%40nathan	2025-10-03 10:14:33 -05:00
John Naylor	54ab748651	Fix reuse-after-free hazard in dead_items_reset In similar vein to commit `ccc8194e42`, a reset instance of a shared memory TID store happened to occupy the same private memory as the old one for the entry point, since the chunk freed after the last round of index vacuuming was put on the context's freelist. The failure to update the vacrel->dead_items pointer was evident by nudging the system to allocate memory in a different area. This was not discovered at the time of the earlier commit since our regression tests didn't cover multiple index passes with parallel vacuum. Backpatch to v17, when TidStore came in. Author: Kevin Oommen Anish <kevin.o@zohocorp.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Tested-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/199a07cbdfc.7a1c4aac25838.1675074408277594551%40zohocorp.com Backpatch-through: 17	2025-10-03 16:05:02 +07:00
Richard Guo	605bfb7dbe	Fix incorrect function reference in comment The comment incorrectly references the defunct function BufFileOpenShared(), which was replaced in commit `dcac5e7ac`. This patch updates the comment to refer to the current function BufFileOpenFileSet(). Author: Zhang Mingli <zmlpostgres@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/1cb48b4c-54ab-40cc-b355-0b3c2af6d3f7@Spark	2025-10-03 16:34:42 +09:00
Tatsuo Ishii	25a30bbd42	Add IGNORE NULLS/RESPECT NULLS option to Window functions. Add IGNORE NULLS/RESPECT NULLS option (null treatment clause) to lead, lag, first_value, last_value and nth_value window functions. If unspecified, the default is RESPECT NULLS which includes NULL values in any result calculation. IGNORE NULLS ignores NULL values. Built-in window functions are modified to call new API WinCheckAndInitializeNullTreatment() to indicate whether they accept IGNORE NULLS/RESPECT NULLS option or not (the API can be called by user defined window functions as well). If WinGetFuncArgInPartition's allowNullTreatment argument is true and IGNORE NULLS option is given, WinGetFuncArgInPartition() or WinGetFuncArgInFrame() will return evaluated function's argument expression on specified non NULL row (if it exists) in the partition or the frame. When IGNORE NULLS option is given, window functions need to visit and evaluate same rows over and over again to look for non null rows. To mitigate the issue, 2-bit not null information array is created while executing window functions to remember whether the row has been already evaluated to NULL or NOT NULL. If already evaluated, we could skip the evaluation work, thus we could get better performance. Author: Oliver Ford <ojford@gmail.com> Co-authored-by: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: Krasiyan Andreev <krasiyan@gmail.com> Reviewed-by: Andrew Gierth <andrew@tao11.riddles.org.uk> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Fetter <david@fetter.org> Reviewed-by: Vik Fearing <vik@postgresfriends.org> Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com> Reviewed-by: Chao Li <lic@highgo.com> Discussion: https://postgr.es/m/flat/CAGMVOdsbtRwE_4+v8zjH1d9xfovDeQAGLkP_B6k69_VoFEgX-A@mail.gmail.com	2025-10-03 09:47:36 +09:00
Michael Paquier	3f431109dc	Remove useless pointer update in ginxlog.c Oversight in `2c03216d83`, when the redo code of GIN got refactored for the new WAL format where block information has been standardized, as the payload data got tracked for each block after the change, and not in the whole record. This is just a cleanup. Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/CALdSSPgnAt5L=D_xGXRXLYO5FK1H31_eYEESxdU1n-r4g+6GqA@mail.gmail.com	2025-10-02 17:16:20 +09:00
John Naylor	48566180ef	Generate EUC_CN mappings from gb18030-2022.ucm In the wake of `cfa6cd292`, EUC_CN was the only encoding that used gb-18030-2000.xml to generate the .map files. Since EUC_CN is a subset of GB18030, we can easily use the same UCM file. This allows deleting the XML file from our repository. Author: Chao Li <lic@highgo.com> Discussion: https://postgr.es/m/CANWCAZaNRXZ-5NuXmsaMA2mKvMZnCGHZqQusLkpE%2B8YX%2Bi5OYg%40mail.gmail.com	2025-10-02 12:36:24 +07:00
David Rowley	91df0465a6	Fix typo in pgstat_relation.c header comment Looks like a copy and paste error from pgstat_function.c Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aNuaVMAdTGbgBgqh@ip-10-97-1-34.eu-west-3.compute.internal	2025-10-01 00:23:38 +13:00
Peter Eisentraut	57d46dff9b	Make some use of anonymous unions [reorderbuffer xact_time] Make some use of anonymous unions, which are allowed as of C11, as examples and encouragement for future code, and to test compilers. This commit changes the ReorderBufferTXN struct. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/f00a9968-388e-4f8c-b5ef-5102e962d997%40eisentraut.org	2025-09-30 12:35:50 +02:00
Peter Eisentraut	4b7e6c73b0	Make some use of anonymous unions [pg_locale_t] Make some use of anonymous unions, which are allowed as of C11, as examples and encouragement for future code, and to test compilers. This commit changes the pg_locale_t type. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/f00a9968-388e-4f8c-b5ef-5102e962d997%40eisentraut.org	2025-09-30 12:35:50 +02:00
Álvaro Herrera	3bf31dd243	Do a tiny bit of header file maintenance Stop including utils/relcache.h in access/genam.h, and stop including htup_details.h in nodes/tidbitmap.h. Both these files (genam.h and tidbitmap.h) are widely used in other header files, so it's in our best interest that they remain as lean as reasonable. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/202509291356.o5t6ny2hoa3q@alvherre.pgsql	2025-09-30 12:28:29 +02:00
Michael Paquier	bb68cde413	Reorder XLogNeedsFlush() checks to be more consistent During recovery, XLogNeedsFlush() checks the minimum recovery LSN point instead of the flush LSN point. The same condition checks are used when updating the minimum recovery point in UpdateMinRecoveryPoint(), but are written in reverse order. This commit makes the order of the checks consistent between XLogNeedsFlush() and UpdateMinRecoveryPoint(), improving the code clarity. Note that the second check (as ordered by this commit) relies on InRecovery, which is true only in the startup process. So this makes XLogNeedsFlush() cheaper in the startup process with the first check acting as a shortcut while doing crash recovery, where LocalMinRecoveryPoint is an invalid LSN. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/aMIHNRTP6Wj6vw1s%40paquier.xyz	2025-09-30 09:38:32 +09:00
Tom Lane	ef38a4d975	Add GROUP BY ALL. GROUP BY ALL is a form of GROUP BY that adds any TargetExpr that does not contain an aggregate or window function into the groupClause of the query, making it exactly equivalent to specifying those same expressions in an explicit GROUP BY list. This feature is useful for certain kinds of data exploration. It's already present in some other DBMSes, and the SQL committee recently accepted it into the standard, so we can be reasonably confident in the syntax being stable. We do have to invent part of the semantics, as the standard doesn't allow for expressions in GROUP BY, so they haven't specified what to do with window functions. We assume that those should be treated like aggregates, i.e., left out of the constructed GROUP BY list. In passing, wordsmith some existing documentation about GROUP BY, and update some neglected synopsis entries in select_into.sgml. Author: David Christensen <david@pgguru.net> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAHM0NXjz0kDwtzoe-fnHAqPB1qA8_VJN0XAmCgUZ+iPnvP5LbA@mail.gmail.com	2025-09-29 16:55:17 -04:00
David Rowley	b91067c899	Remove unused parameter from find_window_run_conditions() ... and check_and_push_window_quals(). Similar to `4be9024d5`, but it seems there was yet another unused parameter. Author: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/DD5BEKORUG34.2M8492NMB9DB8@gmail.com	2025-09-30 08:37:42 +13:00
Noah Misch	a95393ecdb	Fix StatisticsObjIsVisibleExt() for pg_temp. Neighbor get_statistics_object_oid() ignores objects in pg_temp, as has been the standard for non-relation, non-type namespace searches since CVE-2007-2138. Hence, most operations that name a statistics object correctly decline to map an unqualified name to a statistics object in pg_temp. StatisticsObjIsVisibleExt() did not. Consequently, pg_statistics_obj_is_visible() wrongly returned true for such objects, psql \dX wrongly listed them, and getObjectDescription()-based ereport() and pg_describe_object() wrongly omitted namespace qualification. Any malfunction beyond that would depend on how a human or application acts on those wrong indications. Commit `d99d58cdc8` introduced this. Back-patch to v13 (all supported versions). Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/20250920162116.2e.nmisch@google.com Backpatch-through: 13	2025-09-29 11:15:44 -07:00
David Rowley	2cb49c609b	Improve planner's width estimates for set operations For UNION, EXCEPT and INTERSECT, we were not very good at estimating the PathTarget.width for the set operation. Since the targetlist of the set operation is made up of Vars with varno==0, this would result in get_expr_width() applying a default estimate based on the Var's type rather than taking width estimates from any relation's statistics. Here we attempt to improve the situation by looking at the width estimates for the set operation child paths and calculating the average width of the relevant child paths weighted over the estimated number of rows. For UNION and INTERSECT, the relevant paths to look at are all child paths. For EXCEPT, since we don't return rows from the right-hand child (only possibly remove left-hand rows matching those), we use only the left-hand child for width estimates. This also adjusts the hashed-UNION Path's PathTarget to use the same PathTarget as its Append subpath. Both PathTargets will be the same and are void of any resjunk columns, per generate_append_tlist(). Making the AggPath use the same PathTarget saves having to adjust the "width" of the AggPath's PathTarget too. This was reported as a bug by sunw.fnst, but it's not something we ever claimed to do properly. Plus, if we were to adjust this in back branches, plans could change as the estimated input sizes to Sorts and Hash Aggregates could go up or down. Plan choices aren't something we want to destabilize in stable versions. Reported-by: sunw.fnst <936739278@qq.com> Author: David Rowley <drowleyml@gmail.com> Discussion: https://postgr.es/m/tencent_34CF8017AB81944A4C08DD089D410AB6C306@qq.com	2025-09-29 14:36:39 +13:00
Michael Paquier	7bd2975fa9	Add support for tracking of entry count in pgstats Stats kinds can set a new option called "track_entry_count" (disabled by default, available for variable-numbered stats) that will make pgstats track the number of entries that exist in its shared hashtable. As there is only one code path where a new entry is added, and one code path where entries are freed, the count tracking is straight-forward in its implementation. Reads of these counters are optimistic, and may change across two calls. The counter is incremented when an entry is created (not when reused), and is decremented when an entry is freed from the hashtable (marked for drop with its refcount reaching 0), which is something that pgstats decides internally. A first use case of this facility would be pg_stat_statements, where we need to be able to cap the number of entries that would be stored in the shared hashtable, based on its "max" GUC. The module currently relies on hash_get_num_entries(), which offers a cheap way to count how many entries are in its hash table, but we cannot do that in pgstats for variable-sized stats kinds as a single hashtable is used for all the stats kinds. Independently of PGSS, this is useful for other custom stats kinds that want to cap, control, or track the number of entries they have, without depending on a potentially expensive sequential scan to know the number of entries while holding an extra exclusive lock. Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Keisuke Kuroda <keisuke.kuroda.3862@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aMPKWR81KT5UXvEr@paquier.xyz	2025-09-29 08:57:57 +09:00
Tom Lane	b0fb2c6aa5	Refactor to avoid code duplication in transformPLAssignStmt. transformPLAssignStmt contained many lines cribbed directly from transformSelectStmt. I had supposed that we could manage to keep the two copies in sync, but the bug just fixed in `7504d2be9` shows that that hope was foolish. Let's refactor so there's just one copy. The main stumbling block to doing this is that transformPLAssignStmt has a chunk of custom code that has to run after transformTargetList but before we potentially modify the tlist further during analysis of ORDER BY and GROUP BY. Rather than make transformSelectStmt fully aware of PLAssignStmt processing, I put that code into a callback function. It still feels a little bit ugly, but it's not too awful, and surely it's better than a hundred lines of duplicated code. The steps involved in processing a PLAssignStmt remain exactly the same as before, just in different places. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/31027.1758919078@sss.pgh.pa.us	2025-09-27 17:17:51 -04:00
Tom Lane	7504d2be9e	Fix missed copying of groupDistinct in transformPLAssignStmt. Because we failed to do this, DISTINCT in GROUP BY DISTINCT would be ignored in PL/pgSQL assignment statements. It's not surprising that no one noticed, since such statements will throw an error if the query produces more than one row. That eliminates most scenarios where advanced forms of GROUP BY could be useful, and indeed makes it hard even to find a simple test case. Nonetheless it's wrong. This is directly the fault of `be45be9c3` which added the groupDistinct field, but I think much of the blame has to fall on `c9d529848`, in which I incautiously supposed that we'd manage to keep two copies of a big chunk of parse-analysis logic in sync. As a follow-up, I plan to refactor so that there's only one copy. But that seems useful only in master, so let's use this one-line fix for the back branches. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/31027.1758919078@sss.pgh.pa.us Backpatch-through: 14	2025-09-27 14:29:41 -04:00
Masahiko Sawada	66cdef4425	Remove unused for_all_tables field from AlterPublicationStmt. No backpatch as AlterPublicationStmt struct is exposed. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CAD21AoC6B6AuxWOST-TkxUbDgp8FwX=BLEJZmKLG_VrH-hfxpA@mail.gmail.com	2025-09-26 09:23:00 -07:00
Álvaro Herrera	dbf8cfb4f0	Create a separate file listing backend types Use our established coding pattern to reduce maintenance pain when adding other per-process-type characteristics. Like PG_KEYWORD, PG_CMDTAG, PG_RMGR. To keep the strings translatable, the relevant makefile now also scans src/include for this specific file. I didn't want to have it scan all .h files, as then gettext would have to scan all header files. I didn't find any way to affect the meson behavior in this respect though. Author: Álvaro Herrera <alvherre@kurilemu.de> Co-authored-by: Jonathan Gonzalez V. <jonathan.abdiel@gmail.com> Discussion: https://postgr.es/m/202507151830.dwgz5nmmqtdy@alvherre.pgsql	2025-09-26 15:21:49 +02:00
Michael Paquier	85e0ff62b6	Improve stability of btree page split on ERRORs This improves the stability of VACUUM when processing btree indexes, which was previously able to trigger an assertion failure in _bt_lock_subtree_parent() when an error was previously thrown outside the scope of _bt_split() when splitting a btree page. VACUUM would consider the index as in a corrupted state as the right page would not be zeroed for the error thrown (allocation failure is one pattern). In a non-assert build, VACUUM is able to succeed, reporting what it sees as a corruption while attempting to fix the index. This would manifest as a LOG message, as of: LOG: failed to re-find parent key in index "idx" for deletion target page N CONTEXT: while vacuuming index "idx" of relation "public.tab" This commit improves the code to rely on two PGAlignedBlocks that are used as a temporary space for the left and right pages. The main change concerns the right page, whose contents are now copied into the "temporary" PGAlignedBlock page while its original space is zeroed. Its contents are moved from the PGAlignedBlock page back to the page once we enter in the critical section used for the split. This simplifies the split logic, as it is not necessary to zero the right page before throwing an error anymore. Hence errors can now be thrown outside the split code. For the left page, this shaves one allocation, with PageGetTempPage() being previously used. The previous logic originates from commit `8fa30f906b`, at a point where PGAlignedBlock did not exist yet. This could be argued as something that should be backpatched, but the lack of complaints indicates that it may not be necessary. Author: Konstantin Knizhnik <knizhnik@garret.ru> Discussion: https://postgr.es/m/566dacaf-5751-47e4-abc6-73de17a5d42a@garret.ru	2025-09-26 08:41:06 +09:00
David Rowley	3760d278dc	Fix misleading comment in pg_get_statisticsobjdef_string() The comment claimed that a TABLESPACE reference was added to the resulting string, but that's not true. Looks like the comment was copied from pg_get_indexdef_string() without being adjusted correctly. Reported-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxHwVPgeu8o9D8oUeDQYEHTAZGt-J5uaJNgYMzkAW7MiCA@mail.gmail.com	2025-09-26 11:04:15 +12:00
David Rowley	4be9024d57	Remove unused parameter from check_and_push_window_quals ... and find_window_run_conditions. This seems to have been around and unused ever since the Run Condition feature was added in `9d9c02ccd`. Let's remove it to clean things up a bit. Author: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/DD26NJ0Y34ZS.2ZOJPHSY12PFI@gmail.com	2025-09-26 10:21:30 +12:00
Tom Lane	02c4bc8830	Try to avoid floating-point roundoff error in pg_sleep(). I noticed the surprising behavior that pg_sleep(0.001) will sleep for 2ms not the expected 1ms. Apparently the float8 calculation of time-to-sleep is managing to produce something a hair over 1, which ceil() rounds up to 2, and then WaitLatch() faithfully waits 2ms. It could be that this works as-expected for some ranges of current timestamp but not others, which would account for not having seen it before. In any case, let's try to avoid it by removing the float arithmetic in the delay calculation. We're stuck with the declared input type being float8, but we can convert that to integer microseconds right away, and then work strictly with integral values. There might still be roundoff surprises for certain input values, but at least the behavior won't be time-varying. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/3879137.1758825752@sss.pgh.pa.us	2025-09-25 17:02:15 -04:00
Robert Haas	803ef0ed49	Fix array allocation bugs in SetExplainExtensionState. If we already have an extension_state array but see a new extension_id much larger than the highest the extension_id we've previously seen, the old code might have failed to expand the array to a large enough size, leading to disaster. Also, if we don't have an extension array at all and need to create one, we should make sure that it's big enough that we don't have to resize it instantly. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/2949591.1758570711@sss.pgh.pa.us Backpatch-through: 18	2025-09-25 11:43:52 -04:00
Daniel Gustafsson	0b3ce7878a	Remove preprocessor guards from injection points When defining an injection point there is no need to wrap the definition with USE_INJECTION_POINT guards, the INJECTION_POINT macro is available in all builds. Remove to make the code consistent. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/OSCPR01MB14966C8015DEB05ABEF2CE077F51FA@OSCPR01MB14966.jpnprd01.prod.outlook.com Backpatch-through: 17	2025-09-25 15:27:33 +02:00
Álvaro Herrera	7e638d7f50	Don't include execnodes.h in replication/conflict.h ... which silently propagates a lot of headers into many places via pgstat.h, as evidenced by the variety of headers that this patch needs to add to seemingly random places. Add a minimum of typedefs to conflict.h to be able to remove execnodes.h, and fix the fallout. Backpatch to 18, where conflict.h first appeared. Discussion: https://postgr.es/m/202509191927.uj2ijwmho7nv@alvherre.pgsql	2025-09-25 14:52:41 +02:00
Álvaro Herrera	81fc3e28e3	Update some more forward declarations to use typedef As commit `d4d1fc527b`. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/202509191025.22agk3fvpilc@alvherre.pgsql	2025-09-25 14:33:19 +02:00
Melanie Plageman	ae8ea7278c	Correct prune WAL record opcode name in comment `f83d709760` incorrectly refers to a XLOG_HEAP2_PRUNE_FREEZE WAL record opcode. No such code exists. The relevant opcodes are XLOG_HEAP2_PRUNE_ON_ACCESS, XLOG_HEAP2_PRUNE_VACUUM_SCAN, and XLOG_HEAP2_PRUNE_VACUUM_CLEANUP. Correct it. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/yn4zp35kkdsjx6wf47zcfmxgexxt4h2og47pvnw2x5ifyrs3qc%407uw6jyyxuyf7	2025-09-24 12:29:56 -04:00
Tom Lane	aadbcc40bc	Ensure guc_tables.o's dependency on guc_tables.inc.c is known. Without this, rebuilds can malfunction unless --enable-depend is used. Historically we've expected that you can get away without --enable-depend as long as you manually clean after changing *.h files; the makefiles are supposed to handle other sorts of dependencies. So add this one. Follow-on to `635998965`, so no need for back-patch. Discussion: https://postgr.es/m/3121329.1758650878@sss.pgh.pa.us	2025-09-24 12:28:20 -04:00
Fujii Masao	7fcb32ad02	Fix incorrect and inconsistent comments in tableam.h and heapam.c. This commit corrects several issues in function comments: * The parameter "rel" was incorrectly referred to as "relation" in the comments for table_tuple_delete(), table_tuple_update(), and table_tuple_lock(). * In table_tuple_delete(), "changingPart" was listed as an output parameter in the comments but is actually input. * In table_tuple_update(), "slot" was listed as an input parameter in the comments but is actually output. * The comment for "update_indexes" in table_tuple_update() was mis-indented. * The comments for heap_lock_tuple() incorrectly referenced a non-existent "tid" parameter. Author: Chao Li <lic@highgo.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAEoWx2nB6Ay8g=KEn7L3qbYX_4+sLk9XOMkV0XZqHR4cTY8ZvQ@mail.gmail.com	2025-09-25 00:51:59 +09:00
Peter Eisentraut	a5b35fcedb	Remove PointerIsValid() This doesn't provide any value over the standard style of checking the pointer directly or comparing against NULL. Also remove related: - AllocPointerIsValid() [unused] - IndexScanIsValid() [had one user] - HeapScanIsValid() [unused] - InvalidRelation [unused] Leaving HeapTupleIsValid(), ItemIdIsValid(), PortalIsValid(), RelationIsValid for now, to reduce code churn. Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/ad50ab6b-6f74-4603-b099-1cd6382fb13d%40eisentraut.org Discussion: https://www.postgresql.org/message-id/CA+hUKG+NFKnr=K4oybwDvT35dW=VAjAAfiuLxp+5JeZSOV3nBg@mail.gmail.com Discussion: https://www.postgresql.org/message-id/bccf2803-5252-47c2-9ff0-340502d5bd1c@iki.fi	2025-09-24 15:17:20 +02:00
Daniel Gustafsson	0fba25eb72	Fix incorrect option name in usage screen The usage screen incorrectly refered to the --docs option as --sgml. Backpatch down to v17 where this script was introduced. Author: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/20250729.135638.1148639539103758555.horikyota.ntt@gmail.com Backpatch-through: 17	2025-09-24 14:58:18 +02:00
Daniel Gustafsson	711ccce38f	Consistently handle tab delimiters for wait event names Format validation and element extraction for intermediate line strings were inconsistent in their handling of tab delimiters, which resulted in an unclear error when multiple tab characters were used as a delimiter. This fixes it by using captures from the validation regex instead of a separate split() to avoid the inconsistency. Also, it ensures that \t+ is used consistently when inspecting the strings. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/20250729.135638.1148639539103758555.horikyota.ntt@gmail.com	2025-09-24 14:57:26 +02:00
John Naylor	5334620eef	Update GB18030 encoding from version 2000 to 2022 Mappings for 18 characters have changed, affecting 36 code points. This is a break in compatibility, but these characters are rarely used. U+E5E5 (Private Use Area) was previously mapped to \xA3A0. This code point now maps to \x65356535. Attempting to convert \xA3A0 will now raise an error. Separate from the 2022 update, the following mappings were previously swapped, and subsequently corrected in 2000 and later versions: * U+E7C7 (Private Use Area) now maps to \x8135F437 * U+1E3F (Latin Small Letter M with Acute) now maps to \xA8BC The 2022 standard mentions the following policy changes, but they have no effect in our implementation: 66 new ideographs are now required, but these are mapped algorithmically so were already handled by utf8_and_gb18030.c. Nine CJK compatibility ideographs are no longer required, but implementations may retain them, as does the source we use from the Unicode Consortium. Release notes: Compatibility section For further details, see: https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132 Author: Chao Li <lic@highgo.com> Author: Zheng Tao <taoz@highgo.com> Discussion: https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com	2025-09-24 13:26:05 +07:00
Amit Kapila	e41d954da6	Fix LOCK_TIMEOUT handling during parallel apply. Previously, the parallel apply worker used SIGINT to receive a graceful shutdown signal from the leader apply worker. However, SIGINT is also used by the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. This overlap caused the parallel apply worker to miss LOCK_TIMEOUT signals, leading to incorrect behavior during lock wait/contention. This patch resolves the conflict by switching the graceful shutdown signal from SIGINT to SIGUSR2. Reported-by: Zane Duffield <duffieldzane@gmail.com> Diagnosed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16, where it was introduced Discussion: https://postgr.es/m/CACMiCkXyC4au74kvE2g6Y=mCEF8X6r-Ne_ty4r7qWkUjRE4+oQ@mail.gmail.com	2025-09-24 04:11:53 +00:00
Robert Haas	f2bae51dfd	Keep track of what RTIs a Result node is scanning. Result nodes now include an RTI set, which is only non-NULL when they have no subplan, and is taken from the relid set of the RelOptInfo that the Result is generating. ExplainPreScanNode now takes notice of these RTIs, which means that a few things get schema-qualified in the regression tests that previously did not. This makes the output more consistent between cases where some part of the plan tree is replaced by a Result node and those where this does not happen. Likewise, pg_overexplain's EXPLAIN (RANGE_TABLE) now displays the RTIs stored in a Result node just as it already does for other RTI-bearing node types. Result nodes also now include a result_reason, which tells us something about why the Result node was inserted. Using that information, EXPLAIN now emits, where relevant, a "Replaces" line describing the origin of a Result node. The purpose of these changes is to allow code that inspects a Plan tree to understand the origin of Result nodes that appear therein. Discussion: http://postgr.es/m/CA+TgmoYeUZePZWLsSO+1FAN7UPePT_RMEZBKkqYBJVCF1s60=w@mail.gmail.com Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>	2025-09-23 09:07:55 -04:00
David Rowley	9fc7f6ab72	Fix various incorrect filename references Author: Chao Li <li.evan.chao@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAEoWx2=hOBCPm-Z=F15twr_23XjHeoXSbifP5GdEdtWona97wQ@mail.gmail.com	2025-09-22 13:33:17 +12:00
Daniel Gustafsson	e1d917182c	Add support for base64url encoding and decoding This adds support for base64url encoding and decoding, a base64 variant which is safe to use in filenames and URLs. base64url replaces '+' in the base64 alphabet with '-' and '/' with '_', thus making it safe for URL addresses and file systems. Support for base64url was originally suggested by Przemysław Sztoch. Author: Florents Tselai <florents.tselai@gmail.com> Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Reviewed-by: David E. Wheeler <david@justatheory.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Chao Li (Evan) <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/70f2b6a8-486a-4fdb-a951-84cef35e22ab@sztoch.pl	2025-09-20 23:19:32 +02:00
Tom Lane	261f89a976	Track the maximum possible frequency of non-MCE array elements. The lossy-counting algorithm that ANALYZE uses to identify most-common array elements has a notion of cutoff frequency: elements with frequency greater than that are guaranteed to be collected, elements with smaller frequencies are not. In cases where we find fewer MCEs than the stats target would permit us to store, the cutoff frequency provides valuable additional information, to wit that there are no non-MCEs with frequency greater than that. What the selectivity estimation functions actually use the "minfreq" entry for is as a ceiling on the possible frequency of non-MCEs, so using the cutoff rather than the lowest stored MCE frequency provides a tighter bound and more accurate estimates. Therefore, instead of redundantly storing the minimum observed MCE frequency, store the cutoff frequency when there are fewer tracked values than we want. (When there are more, then of course we cannot assert that no non-stored elements are above the cutoff frequency, since we're throwing away some that are; so we still use the minimum stored frequency in that case.) Notably, this works even when none of the values are common enough to be called MCEs. In such cases we previously stored nothing in the STATISTIC_KIND_MCELEM pg_statistic slot, which resulted in the selectivity functions falling back to default estimates. So in that case we want to construct a STATISTIC_KIND_MCELEM entry that contains no "values" but does have "numbers", to wit the three extra numbers that the MCELEM entry type defines. A small obstacle is that update_attstats() has traditionally stored a null, not an empty array, when passed zero "values" for a slot. That gives rise to an MCELEM entry that get_attstatsslot() will spit up on. The least risky solution seems to be to adjust update_attstats() so that it will emit a non-null (but possibly empty) array when the passed stavalues array pointer isn't NULL, rather than conditioning that on numvalues > 0. In other existing cases I don't believe that that changes anything. For consistency, handle the stanumbers array the same way. In passing, improve the comments in routines that use STATISTIC_KIND_MCELEM data. Particularly, explain why we use minfreq / 2 not minfreq as the estimate for non-MCE values. Thanks to Matt Long for the suggestion that we could apply this idea even when there are more than zero MCEs. Reported-by: Mark Frost <FROSTMAR@uk.ibm.com> Reported-by: Matt Long <matt@mattlong.org> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/PH3PPF1C905D6E6F24A5C1A1A1D8345B593E16FA@PH3PPF1C905D6E6.namprd15.prod.outlook.com	2025-09-20 14:48:16 -04:00
Tom Lane	1eccb93150	Re-allow using statistics for bool-valued functions in WHERE. Commit `a391ff3c3`, which added the ability for a function's support function to provide a custom selectivity estimate for "WHERE f(...)", unintentionally removed the possibility of applying expression statistics after finding there's no applicable support function. That happened because we no longer fell through to boolvarsel() as before. Refactor to do so again, putting the 0.3333333 default back into boolvarsel() where it had been (cf. commit `39df0f150`). I surely wouldn't have made this error if `39df0f150` had included a test case, so add one now. At the time we did not have the "extended statistics" infrastructure, but we do now, and it is also unable to work in this scenario because of this error. So make use of that for the test case. This is very clearly a bug fix, but I'm afraid to put it into released branches because of the likelihood of altering plan choices, which we avoid doing in minor releases. So, master only. Reported-by: Frédéric Yhuel <frederic.yhuel@dalibo.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/a8b99dce-1bfb-4d97-af73-54a32b85c916@dalibo.com	2025-09-20 12:44:52 -04:00
David Rowley	ac7c8e412c	Improve wording in a few comments Initially this was to fix the "catched" typo, but I (David) wasn't quite clear on what the previous comment meant about being "effective". I expect this means efficiency, so I've reworded the comment to indicate that. While this is only a comment fixup, for the sake of possibly minimizing possible future backpatching pain, I've opted to backpatch to 18 since this code is new to that version and the release isn't out the door yet. Author: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAHewXNmSYWPud1sfBvpKbCJeRkWeZYuqatxtV9U9LvAFXBEiBw@mail.gmail.com Backpatch-through: 18	2025-09-19 23:35:23 +12:00
Amit Kapila	5b148706c5	Add optional pid parameter to pg_replication_origin_session_setup(). Commit `216a784829` introduced parallel apply workers, allowing multiple processes to share a replication origin. To support this, replorigin_session_setup() was extended to accept a pid argument identifying the process using the origin. This commit exposes that capability through the SQL interface function pg_replication_origin_session_setup() by adding an optional pid parameter. This enables multiple processes to coordinate replication using the same origin when using SQL-level replication functions. This change allows the non-builtin logical replication solutions to implement parallel apply for large transactions. Additionally, an existing internal error was made user-facing, as it can now be triggered via the exposed SQL API. Author: Doruk Yilmaz <doruk@mixrank.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/CAMPB6wfe4zLjJL8jiZV5kjjpwBM2=rTRme0UCL7Ra4L8MTVdOg@mail.gmail.com Discussion: https://postgr.es/m/CAE2gYzyTSNvHY1+iWUwykaLETSuAZsCWyryokjP6rG46ZvRgQA@mail.gmail.com	2025-09-19 05:38:40 +00:00
Amit Kapila	8aac5923a3	Improve few errdetail messages introduced in commit `0d48d393d4`. Based on suggestions by Tom Lane Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/20250916.114644.275726106301941878.horikyota.ntt@gmail.com	2025-09-19 04:52:59 +00:00
Michael Paquier	deb208df45	Make XLogFlush() and XLogNeedsFlush() decision-making more consistent When deciding which code path to use depending on the state of recovery, XLogFlush() and XLogNeedsFlush() have been relying on different criterias: - XLogFlush() relied on XLogInsertAllowed(). - XLogNeedsFlush() relied on RecoveryInProgress(). Currently, the checkpointer is allowed to insert WAL records while RecoveryInProgress() returns true for an end-of-recovery checkpoint, where XLogInsertAllowed() matters. Using RecoveryInProgress() in XLogNeedsFlush() did not really matter for its existing callers, as the checkpointer only called XLogFlush(). However, a feature under discussion, by Melanie Plageman, needs XLogNeedsFlush() to be able to work in more contexts, the end-of-recovery checkpoint being one. This commit changes XLogNeedsFlush() to use XLogInsertAllowed() instead of RecoveryInProgress(), making the checks in both routines more consistent. While on it, an assertion based on XLogNeedsFlush() is added at the end of XLogFlush(), triggered when flushing a physical position (not for the normal recovery patch that checks for updates of the minimum recovery point). This assertion would fail for example in the recovery test 015_promotion_pages if XLogNeedsFlush() is changed to use RecoveryInProgress(). This should be hopefully enough to ensure that the checks done in both routines remain consistent. Author: Melanie Plageman <melanieplageman@gmail.com> Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g@mail.gmail.com	2025-09-19 13:47:28 +09:00
Amit Langote	8741e48e5d	Fix EPQ crash from missing partition pruning state in EState Commit `bb3ec16e14` moved partition pruning metadata into PlannedStmt. At executor startup this metadata is used to initialize the EState fields es_part_prune_infos, es_part_prune_states, and es_part_prune_results. EvalPlanQualStart() failed to copy those fields into the child EState, causing NULL dereference when Append ran partition pruning during a recheck. This can occur with DELETE or UPDATE on partitioned tables that use runtime pruning, e.g. with generic plans. Fix by copying all partition pruning state into the EPQ estate. Add an isolation test that reproduces the crash with concurrent UPDATE and DELETE on a partitioned table, where the DELETE session hits the crash during its EPQ recheck after the UPDATE commits. Bug: #19056 Reported-by: Fei Changhong <feichanghong@qq.com> Diagnozed-by: Fei Changhong <feichanghong@qq.com> Author: David Rowley <dgrowleyml@gmail.com> Co-authored-by: Amit Langote <amitlangote09@gmail.com> Discussion: https://postgr.es/m/19056-a677cef9b54d76a0%40postgresql.org	2025-09-19 11:38:29 +09:00
Michael Paquier	3cd3a039da	Document and check that PgStat_HashKey has no padding This change is a tighter rework of `7d85d87f4d`, which tried to improve the code so as it would work should PgStat_HashKey gain new fields that create padding bytes. However, the previous change is proving to not be enough as some code paths of pgstats do not pass PgStat_HashKey by reference (valgrind would warn when padding is added to the structure, through a new field). Per discussion, let's document and check that PgStat_HashKey has no padding rather than try to complicate the code of pgstats so as it is able to work around that. This removes a couple of memset(0) calls that should not be required. While on it, this commit adds a static assertion checking that no padding is introduced in the structure, by checking that the size of PgStat_HashKey matches with the sum of the size of all its fields. The object ID part of the hash key is already 8 bytes, which should be plenty enough already. A comment is added to discourage the addition of new fields. Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0t9omat+HVSakJXwTMWvhpYFcAZb41RPWKwrKFUgmAFBQ@mail.gmail.com	2025-09-19 09:54:05 +09:00
Nathan Bossart	c3cc2ab87d	Fix re-initialization of LWLock-related shared memory. When shared memory is re-initialized after a crash, the named LWLock tranche request array that was copied to shared memory will no longer be accessible. To fix, save the pointer to the original array in postmaster's local memory, and switch to it when re-initializing the LWLock-related shared memory. Oversight in commit `ed1aad15e0`. Per buildfarm member batta. Reported-by: Michael Paquier <michael@paquier.xyz> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aMoejB3iTWy1SxfF%40paquier.xyz Discussion: https://postgr.es/m/f8ca018f-3479-49f6-a92c-e31db9f849d7%40gmail.com	2025-09-18 09:55:39 -05:00
Andres Freund	0110e2ec5c	Mark shared buffer lookup table HASH_FIXED_SIZE StrategyInitialize() calls InitBufTable() with maximum number of entries that the buffer lookup table can ever have. Thus there should not be any need to allocate more element after initialization. Hence mark the hash table as fixed sized. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAExHW5v0jh3F_wj86yC=qBfWk0uiT94qy=Z41uzAHLHh0SerRA@mail.gmail.com	2025-09-17 20:28:43 -04:00
Tom Lane	b0cc0a71e0	Calculate agglevelsup correctly when Aggref contains a CTE. If an aggregate function call contains a sub-select that has an RTE referencing a CTE outside the aggregate, we must treat that reference like a Var referencing the CTE's query level for purposes of determining the aggregate's level. Otherwise we might reach the nonsensical conclusion that the aggregate should be evaluated at some query level higher than the CTE, ending in a planner error or a broken plan tree that causes executor failures. Bug: #19055 Reported-by: BugForge <dllggyx@outlook.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19055-6970cfa8556a394d@postgresql.org Backpatch-through: 13	2025-09-17 16:32:57 -04:00
Thomas Munro	0951942bba	jit: Fix type used for Datum values in LLVM IR. Commit `2a600a93` made Datum 8 bytes wide everywhere. It was no longer appropriate to use TypeSizeT on 32 bit systems, and JIT compilation would fail with various type check errors. Introduce a separate LLVMTypeRef with the name TypeDatum. TypeSizeT is still used in some places for actual size_t values. Reported-by: Dmitry Mityugov <d.mityugov@postgrespro.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Dmitry Mityugov <d.mityugov@postgrespro.ru> Discussion: https://postgr.es/m/0a9f0be59171c2e8f1b3bc10f4fcf267%40postgrespro.ru	2025-09-17 13:38:35 +12:00
Michael Paquier	158c48303e	Fix shared memory calculation size of PgAioCtl The shared memory size was calculated based on an offset of io_handles, which is itself a pointer included in the structure. We tend to overestimate the shared memory size overall, so this was unlikely an issue in practice, but let's be correct and use the full size of the structure in the calculation, so as the pointer for io_handles is included. Oversight in `da7226993f`. Author: Madhukar Prasad <madhukarprasad@google.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAKi+wrbC2dTzh_vKJoAZXV5wqTbhY0n4wRNpCjJ=e36aoo0kFw@mail.gmail.com Backpatch-through: 18	2025-09-17 09:33:32 +09:00
David Rowley	ac06ea8f7b	Add missing EPQ recheck for TID Range Scan The EvalPlanQual recheck for TID Range Scan wasn't rechecking the TID qual still passed after following update chains. This could result in tuples being updated or deleted by plans using TID Range Scans where the ctid of the new (updated) tuple no longer matches the clause of the scan. This isn't desired behavior, and isn't consistent with what would happen if the chosen plan had used an Index or Seq Scan, and that could lead to hard to predict behavior for scans that contain TID quals and other quals as the planner has freedom to choose TID Range or some other non-TID scan method for such queries, and the chosen plan could change at any moment. Here we fix this by properly implementing the recheck function for TID Range Scans. Backpatch to 14, where TID Range Scans were added Reported-by: Sophie Alpert <pg@sophiebits.com> Author: Sophie Alpert <pg@sophiebits.com> Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com Backpatch-through: 14	2025-09-17 12:19:15 +12:00
David Rowley	dee21ea6d6	Add missing EPQ recheck for TID Scan The EvalPlanQual recheck for TID Scan wasn't rechecking the TID qual still passed after following update chains. This could result in tuples being updated or deleted by plans using TID Scans where the ctid of the new (updated) tuple no longer matches the clause of the scan. This isn't desired behavior, and isn't consistent with what would happen if the chosen plan had used an Index or Seq Scan, and that could lead to hard to predict behavior for scans that contain TID quals and other quals as the planner has freedom to choose TID or some other scan method for such queries, and the chosen plan could change at any moment. Here we fix this by properly implementing the recheck function for TID Scans. Backpatch to 13, oldest supported version Reported-by: Sophie Alpert <pg@sophiebits.com> Author: Sophie Alpert <pg@sophiebits.com> Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com Backpatch-through: 13	2025-09-17 11:48:55 +12:00
Tom Lane	8abbbbae61	Revert "Avoid race condition between "GRANT role" and "DROP ROLE"". This reverts commit `98fc31d649`. That change allowed DROP OWNED BY to drop grants of the target role to other roles, arguing that nobody would need those privileges anymore. But that's not so: if you're not superuser, you still need admin privilege on the target role so you can drop it. It's not clear whether or how the dependency-based approach to solving the original problem can be adapted to keep these grants. Since v18 release is fast approaching, the sanest thing to do seems to be to revert this patch for now. The race-condition problem is low severity and not worth taking risks for. I didn't force a catversion bump in `98fc31d64`, so I won't do so here either. Reported-by: Dipesh Dhameliya <dipeshdhameliya125@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CABgZEgczOFicCJoqtrH9gbYMe_BV3Hq8zzCBRcMgmU6LRsihUA@mail.gmail.com Backpatch-through: 18	2025-09-16 13:05:53 -04:00
Tom Lane	83a5641945	Provide more-specific error details/hints for function lookup failures. Up to now we've contented ourselves with a one-size-fits-all error hint when we fail to find any match to a function or procedure call. That was mostly okay in the beginning, but it was never great, and since the introduction of named arguments it's really not adequate. We at least ought to distinguish "function name doesn't exist" from "function name exists, but not with those argument names". And the rules for named-argument matching are arcane enough that some more detail seems warranted if we match the argument names but the call still doesn't work. This patch creates a framework for dealing with these problems: FuncnameGetCandidates and related code will now pass back a bitmask of flags showing how far the match succeeded. This allows a considerable amount of granularity in the reports. The set-bits-in-a-bitmask approach means that when there are multiple candidate functions, the report will reflect the match(es) that got the furthest, which seems correct. Also, we can avoid mentioning "maybe add casts" unless failure to match argument types is actually the issue. Extend the same return-a-bitmask approach to OpernameGetCandidates. The issues around argument names don't apply to operator syntax, but it still seems worth distinguishing between "there is no operator of that name" and "we couldn't match the argument types". While at it, adjust these messages and related ones to more strictly separate "detail" from "hint", following our message style guidelines' distinction between those. Reported-by: Dominique Devienne <ddevienne@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/1756041.1754616558@sss.pgh.pa.us	2025-09-16 12:17:02 -04:00
Richard Guo	b63a822452	Treat JsonConstructorExpr as non-strict JsonConstructorExpr can produce non-NULL output with a NULL input, so it should be treated as a non-strict construct. Failing to do so can lead to incorrect query behavior. For example, in the reported case, when pulling up a subquery that is under an outer join, if the subquery's target list contains a JsonConstructorExpr that uses subquery variables and it is mistakenly treated as strict, it will be pulled up without being wrapped in a PlaceHolderVar. As a result, the expression will be evaluated at the wrong place and will not be forced to null when the outer join should do so. Back-patch to v16 where JsonConstructorExpr was introduced. Bug: #19046 Reported-by: Runyuan He <runyuan@berkeley.edu> Author: Tender Wang <tndrwang@gmail.com> Co-authored-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/19046-765b6602b0a8cfdf@postgresql.org Backpatch-through: 16	2025-09-16 18:42:20 +09:00
John Naylor	cfa6cd2927	Generate GB18030 mappings from the Unicode Consortium's UCM file Previously we built the .map files for GB18030 (version 2000) from an XML file. The 2022 version for this encoding is only available as a Unicode Character Mapping (UCM) file, so as preparatory refactoring switch to this format as the source for building version 2000. As we do with most input files for the conversion mappings, download the file on demand. In order to generate the same mappings we have now, we must download from a previous upstream commit, rather than the head since the latter contains a correction not present in our current .map files. The XML file is still used by EUC_CN, so we cannot delete it from our repository. GB18030 is a superset of EUC_CN, so it may be possible to build EUC_CN from the same UCM file, but that is left for future work. Author: Chao Li <lic@highgo.com> Discussion: https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com	2025-09-16 16:29:08 +07:00
Peter Eisentraut	bce18ef3c6	Fix incorrect const qualifier Commit `7202d72787` added in passing some const qualifiers, but the one on the postmaster_child_launch() startup_data argument was incorrect, because the function itself modifies the pointed-to data. This is hidden from the compiler because of casts. The qualifiers on the functions called by postmaster_child_launch() are still correct.	2025-09-16 07:27:32 +02:00
Peter Geoghegan	7d9cd2df5f	Teach nbtree to avoid evaluating row compare keys. Add logic to _bt_set_startikey that determines whether row compare keys are guaranteed to be satisfied by every tuple on a page that is about to be read by _bt_readpage. This works in essentially the same way as the existing scalar inequality logic. Testing has shown that the new logic improves performance to about the same degree as the existing scalar inequality logic (compared to the unoptimized case). In other words, the new logic makes many row compare scans significantly faster. Note that the new row compare inequality logic is only effective when the same individual row member is the deciding subkey for all tuples on the page (obviously, all tuples have to satisfy the row compare, too). This is what makes the new row compare logic very similar to the existing logic for scalar inequalities. Note, in particular, that this makes it safe to ignore whether all row compare members are against either ASC or DESC index attributes (i.e. it doesn't matter if individual subkeys don't all use the same inequality strategy). Also stop refusing to set pstate.startikey to an offset beyond any nonrequired key (don't add logic that'll do that for an individual row compare subkey, either). We can fully rely on our firstchangingattnum tests instead. This will do the right thing when a page has a group of tuples with NULLs in a lower-order attribute that makes the tuples fail to satisfy a row compare key -- we won't incorrectly conclude that all tuples must satisfy the row compare, just because firsttup and lasttup happen to. Our firstchangingattnum test prevents that from happening. (Note that the original "avoid evaluating nbtree scan keys" mechanism added by commit `e0b1ee17` couldn't support row compares due to issues with tuples that contain NULLs in a lower-order subkey's attribute. That original mechanism relied on requiredness markings, which the replacement _bt_set_startikey mechanism never really needed.) Follow up to commit `8a510275`, which added the _bt_set_startikey optimization. _bt_set_startikey is now feature complete; there's no remaining kind of nbtree scan key that it still doesn't support. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAH2-WznL6Z3H_GTQze9d8T_Ls=cYbnd-_9f-Jo7aYgTGRUD58g@mail.gmail.com	2025-09-15 16:56:49 -04:00
Peter Eisentraut	ce71993ae4	Expand virtual generated columns in constraint expressions Virtual generated columns in constraint expressions need to be expanded because the optimizer matches these expressions to qual clauses. Failing to do so can cause us to miss opportunities for constraint exclusion. Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/204804c0-798f-4c72-bd1f-36116024fda3%40eisentraut.org	2025-09-15 16:27:50 +02:00
Peter Eisentraut	9ec0b29976	CREATE STATISTICS: improve misleading error message The previous change (commit `f225473cba`) was still not on target, because it talked about relation kinds, which are not what is being checked here. Provide a more accurate message. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxEZ48toGH0Em_6vdsT57Y3L8pLF=DZCQ_gCii6=C3MeXw@mail.gmail.com	2025-09-15 11:43:34 +02:00
Peter Eisentraut	4bd9191298	Change fmgr.h typedefs to use original names fmgr.h defined some types such as fmNodePtr which is just Node *, but it made its own types to avoid having to include various header files. With C11, we can now instead typedef the original names without fear of conflicts. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org	2025-09-15 11:04:10 +02:00
Peter Eisentraut	dc41d7415f	Remove hbaPort type This was just a workaround to avoid including the header file that defines the Port type. With C11, we can now just re-define the Port type without the possibility of a conflict. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org	2025-09-15 11:04:10 +02:00
Amit Kapila	0d48d393d4	Resume conflict-relevant data retention automatically. This commit resumes automatic retention of conflict-relevant data for a subscription. Previously, retention would stop if the apply process failed to advance its xmin (oldest_nonremovable_xid) within the configured max_retention_duration and user needs to manually re-enable retain_dead_tuples option. With this change, retention will resume automatically once the apply worker catches up and begins advancing its xmin (oldest_nonremovable_xid) within the configured threshold. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-09-15 08:46:55 +00:00
Peter Eisentraut	282d0bdee6	jit: fix build with LLVM-21 LLVM-21 renamed llvm::GlobalValue::getGUID() to getGUIDAssumingExternalLinkage(), so add a version guard. Author: Holger Hoffstätte <holger@applied-asynchrony.com> Discussion: https://www.postgresql.org/message-id/flat/d25e6e4a-d1b4-84d3-2f8a-6c45b975f53d%40applied-asynchrony.com	2025-09-15 08:31:11 +02:00
Peter Eisentraut	748caa9dcb	Some stylistic improvements in toast_save_datum() Move some variables to a smaller scope. Initialize chunk_data before storing a pointer to it; this avoids compiler warnings on clang-21, or respectively us having to work around it by initializing it to zero before the variable is used (as was done in commit `e92677e863`). Discussion: https://www.postgresql.org/message-id/flat/6604ad6e-5934-43ac-8590-15113d6ae4b1%40eisentraut.org	2025-09-15 07:43:23 +02:00
Peter Eisentraut	bf5da5d6ca	Hide duplicate names from extension views If extensions of equal names were installed in different directories in the path, the views pg_available_extensions and pg_available_extension_versions would show all of them, even though only the first one was actually reachable by CREATE EXTENSION. To fix, have those views skip extensions found later in the path if they have names already found earlier. Also add a bit of documentation that only the first extension in the path can be used. Reported-by: Pierrick <pierrick.chovelon@dalibo.com> Discussion: https://www.postgresql.org/message-id/flat/8f5a0517-1cb8-4085-ae89-77e7454e27ba%40dalibo.com	2025-09-15 07:30:31 +02:00
Peter Geoghegan	454c046094	nbtree: Always set skipScan flag on rescan. The TimescaleDB extension expects to be able to change an nbtree scan's keys across rescans. The issue arises in the extension's implementation of loose index scan. This is arguably a misuse of the index AM API, though apparently it worked until recently. It stopped working when the skipScan flag was added to BTScanOpaqueData by commit `8a510275`, though. The flag wouldn't reliably track whether the scan (actually, the current rescan) has any skip arrays, leading to confusion in _bt_set_startikey. nbtree preprocessing will now defensively initialize the scan's skipScan flag in all cases, including the case where _bt_preprocess_array_keys returns early due to the (re)scan not using arrays. While nbtree isn't obligated to support this use case (at least not according to my reading of the index AM API), it still seems like a good idea to be consistent here, on general robustness grounds. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Natalya Aksman <natalya@timescale.com> Discussion: https://postgr.es/m/CAJumhcirfMojbk20+W0YimbNDkwdECvJprQGQ-XqK--ph09nQw@mail.gmail.com Backpatch-through: 18	2025-09-13 21:01:33 -04:00
Tom Lane	cdf7feb965	Amend recent fix for SIMILAR TO regex conversion. Commit `e3ffc3e91` fixed the translation of character classes in SIMILAR TO regular expressions. Unfortunately the fix broke a corner case: if there is an escape character right after the opening bracket (for example in "[\q]"), a closing bracket right after the escape sequence would not be seen as closing the character class. There were two more oversights: a backslash or a nested opening bracket right at the beginning of a character class should remove the special meaning from any following caret or closing bracket. This bug suggests that this code needs to be more readable, so also rename the variables "charclass_depth" and "charclass_start" to something more meaningful, rewrite an "if" cascade to be more consistent, and improve the commentary. Reported-by: Dominique Devienne <ddevienne@gmail.com> Reported-by: Stephan Springl <springl-psql@bfw-online.de> Author: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAFCRh-8NwJd0jq6P=R3qhHyqU7hw0BTor3W0SvUcii24et+zAw@mail.gmail.com Backpatch-through: 13	2025-09-13 16:55:51 -04:00
Nathan Bossart	7e9c216b52	Re-pgindent nbtpreprocesskeys.c after commit `796962922e`. Backpatch-through: 18	2025-09-13 14:50:02 -05:00
Tom Lane	9a71989a8f	Reject "ALTER DATABASE/USER ... RESET foo" with invalid GUC name. If the database or user had no entry in pg_db_role_setting, RESET silently did nothing --- including not checking the validity of the given GUC name. This is quite inconsistent and surprising, because you would get such an error if there were any pg_db_role_setting entry, even though it contains values for unrelated GUCs. While this is clearly a bug, changing it in stable branches seems unwise. The effect will be that some ALTER commands that formerly were no-ops will now be errors, and people don't like that sort of thing in minor releases. Author: Vitaly Davydov <v.davydov@postgrespro.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/30783e-68c28a00-9-41004480@130449754	2025-09-12 18:10:11 -04:00
Tom Lane	f14ea34d6e	Fix oversights in pg_event_trigger_dropped_objects() fixes. Commit `a0b99fc12` caused pg_event_trigger_dropped_objects() to not fill the object_name field for schemas, which it should have; and caused it to fill the object_name field for default values, which it should not have. In addition, triggers and RLS policies really should behave the same way as we're making column defaults do; that is, they should have is_temporary = true if they belong to a temporary table. Fix those things, and upgrade event_trigger.sql's woefully inadequate test coverage of these secondary output columns. As before, back-patch only to v15. Reported-by: Sergey Shinderuk <s.shinderuk@postgrespro.ru> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/bd7b4651-1c26-4d30-832b-f942fabcb145@postgrespro.ru Backpatch-through: 15	2025-09-12 17:43:17 -04:00
Peter Geoghegan	796962922e	Always commute strategy when preprocessing DESC keys. A recently added nbtree preprocessing step failed to account for the fact that DESC columns already had their B-Tree strategy number commuted at this point in preprocessing. As a result, preprocessing could output a set of scan keys where one or more keys had the correct strategy number, but used the wrong comparison routine. To fix, make the faulty code path that looks up a more restrictive replacement operator/comparison routine commute its requested inequality strategy (while outputting the transformed strategy number as before). This makes the final transformed scan key comport with the approach preprocessing has always used to deal with DESC columns (which is described by comments above _bt_fix_scankey_strategy). Oversight in commit commit `b3f1a13f`, which made nbtree preprocessing perform transformations on skip array inequalities that can reduce the total number of index searches. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Natalya Aksman <natalya@timescale.com> Discussion: https://postgr.es/m/19049-b7df801e71de41b2@postgresql.org Backpatch-through: 18	2025-09-12 13:23:00 -04:00
Álvaro Herrera	7dcea51c2a	Avoid unexpected changes of CurrentResourceOwner and CurrentMemoryContext Users of logical decoding can encounter an unexpected change of CurrentResourceOwner and CurrentMemoryContext. The problem is that, unlike other call sites of RollbackAndReleaseCurrentSubTransaction(), in reorderbuffer.c we fail to restore the original values of these global variables after being clobbered by subtransaction abort. This patch saves the values prior to the call and restores them eventually. In addition, logical.c and logicalfuncs.c had a hack to restore resource owner, presumably because of lack of this restore. Remove that. Instead, because the test coverage here is not very consistent, add an Assert() to ensure that the resowner is kept identical; this would make it easy to detect other cases of bugs were we fail to restore resowner properly. This could be removed later. This is arguably an old bug, but there appears to be no reason to backpatch it and it's risky to do so, so refrain for now. Author: Antonin Houska <ah@cybertec.at> Reported-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/119497.1756892972@localhost	2025-09-12 18:47:25 +02:00
Peter Eisentraut	ae0e1be9f2	Allow redeclaration of typedef yyscan_t This is allowed in C11, so we don't need the workaround guards against it anymore. This effectively reverts commit `382092a0cd` that put these guards in place. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org	2025-09-12 08:16:00 +02:00
Peter Eisentraut	2aac62be8c	Default to log_lock_waits=on If someone is stuck behind a lock for more than a second, that is almost always a problem that is worth a log entry. Author: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-By: Michael Banck <mbanck@gmx.net> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Christoph Berg <myon@debian.org> Reviewed-By: Stephen Frost <sfrost@snowman.net> Discussion: https://postgr.es/m/b8b8502915e50f44deb111bc0b43a99e2733e117.camel%40cybertec.at	2025-09-12 07:57:06 +02:00
Peter Eisentraut	25f36066dd	Remove traces of support for Sun Studio compiler Per discussion, this compiler suite is no longer maintained, and it has not been able to compile PostgreSQL since at least PostgreSQL 17. This removes all the remaining support code for this compiler. Note that the Solaris operating system continues to be supported, but using GCC as the compiler. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/a0f817ee-fb86-483a-8a14-b6f7f5991b6e%40eisentraut.org	2025-09-12 07:39:05 +02:00
Peter Eisentraut	e92677e863	Silence compiler warnings on clang 21 Clang 21 shows some new compiler warnings, for example: warning: variable 'dstsize' is uninitialized when passed as a const pointer argument here [-Wuninitialized-const-pointer] The fix is to initialize the variables when they are defined. This is similar to, for example, the existing situation in gistKeyIsEQ(). Discussion: https://www.postgresql.org/message-id/flat/6604ad6e-5934-43ac-8590-15113d6ae4b1%40eisentraut.org	2025-09-12 07:28:32 +02:00
Richard Guo	2d756ebbe8	Fix misuse of Relids for storing attribute numbers The typedef Relids (Bitmapset ) is intended to represent set of relation identifiers, but was incorrectly used in several places to store sets of attribute numbers. This is my oversight in `e2debb643`. Fix that by replacing such usages with Bitmapset to reflect the correct semantics. Author: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAEG8a3LJhp_xriXf39iCz0TsK+M-2biuhDhpLC6Baxw8+ZYT3A@mail.gmail.com	2025-09-12 11:12:19 +09:00
Michael Paquier	528dadf691	Add more information for WAL records of hash index AMs hashdesc.c was missing a couple of fields in its record descriptions, as of: - is_prev_bucket_same_wrt for SQUEEZE_PAGE. - procid for INIT_META_PAGE. - old_bucket_flag and new_bucket_flag for SPLIT_ALLOCATE_PAGE. The author has noted the first hole, and I have spotted the others while double-checking this area of the code. Note that the only data missing now are the offsets stored in VACUUM_ONE_PAGE. We could perhaps add them, if somebody sees value in this data, even if it makes the output larger. These are discarded here. Author: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/CALdSSPjc-OVwtZH0Xrkvg7n=2ZwdbMJzqrm_ed_CfjiAzuKVGg@mail.gmail.com	2025-09-12 10:29:02 +09:00
Nathan Bossart	ed1aad15e0	Move named LWLock tranche requests to shared memory. In EXEC_BACKEND builds, GetNamedLWLockTranche() can segfault when called outside of the postmaster process, as it might access NamedLWLockTrancheRequestArray, which won't be initialized. Given the lack of reports, this is apparently unusual, presumably because it is usually called from a shmem_startup_hook like this: mystruct = ShmemInitStruct(..., &found); if (!found) { mystruct->locks = GetNamedLWLockTranche(...); ... } This genre of shmem_startup_hook evades the aforementioned segfaults because the struct is initialized in the postmaster, so all other callers skip the !found path. We considered modifying the documentation or requiring GetNamedLWLockTranche() to be called from the postmaster, but ultimately we decided to simply move the request array to shared memory (and add it to the BackendParameters struct), thereby allowing calls outside postmaster on all platforms. Since the main shared memory segment is initialized after accepting LWLock tranche requests, postmaster builds the request array in local memory first and then copies it to shared memory later. Given the lack of reports, back-patching seems unnecessary. Reported-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0v1_15QPg5Sqd2Qz5rh_qcsyCeHHmRDY89xVHcy2yt5BQ%40mail.gmail.com	2025-09-11 16:13:55 -05:00
Tom Lane	a0b99fc122	Report the correct is_temporary flag for column defaults. pg_event_trigger_dropped_objects() would report a column default object with is_temporary = false, even if it belongs to a temporary table. This seems clearly wrong, so adjust it to report the table's temp-ness. While here, refactor EventTriggerSQLDropAddObject to make its handling of namespace objects less messy and avoid duplication of the schema-lookup code. And add some explicit test coverage of dropped-object reports for dependencies of temp tables. Back-patch to v15. The bug exists further back, but the GetAttrDefaultColumnAddress function this patch depends on does not, and it doesn't seem worth adjusting it to cope with the older code. Author: Antoine Violin <violin.antuan@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAFjUV9x3-hv0gihf+CtUc-1it0hh7Skp9iYFhMS7FJjtAeAptA@mail.gmail.com Backpatch-through: 15	2025-09-11 17:11:57 -04:00
Peter Eisentraut	368c38dd47	Remove stray semicolon at global scope The Sun Studio compiler complains about an empty declaration here. Note for future historians: This does not mean that this compiler is still of current interest for anyone using PostgreSQL. But we can let this small fix be its parting gift. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/a0f817ee-fb86-483a-8a14-b6f7f5991b6e%40eisentraut.org	2025-09-11 12:03:15 +02:00
Tom Lane	09036dc71c	Avoid faulty alignment of Datums in build_sorted_items(). If sizeof(Pointer) is 4 then sizeof(SortItem) will be 12, so that if data->numrows is odd then we placed the values array at a location that's not a multiple of 8. That was fine when sizeof(Datum) was also 4, but in the wake of commit `2a600a93c` it makes some alignment-picky machines unhappy. (You need a 32-bit machine that nonetheless expects 8-byte alignment of 8-byte quantities, which is an odd-seeming combination but it does exist outside the Intel universe.) To fix, MAXALIGN the space allocated to the SortItem array. In passing, let's make the "len" variable be Size not int, just for paranoia's sake. This code was arguably not too safe even before `2a600a93c`, but at present I don't see a strong argument for back-patching. Reported-by: Tomas Vondra <tomas@vondra.me> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/87036018-8d70-40ad-a0ac-192b07bd7b04@vondra.me	2025-09-10 17:51:24 -04:00
Tom Lane	bdc6cfcd12	Eliminate duplicative hashtempcxt in nodeSubplan.c. Instead of building a separate memory context that's used just for running hash functions, make the hash functions run in the per-tuple context of the node's innerecontext. This saves a little space at runtime, and it avoids needing to reset two contexts instead of one inside buildSubPlanHash's main loop. This largely reverts commit `133924e13`. That's safe to do now because `bf6c614a2` decoupled the evaluation context used by TupleHashTableMatch from that used for hash function evaluation, so that there's no longer a risk of resetting the innerecontext too soon. Per discussion of bug #19040, although this is not directly a fix for that. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Haiyang Li <mohen.lhy@alibaba-inc.com> Reviewed-by: Fei Changhong <feichanghong@qq.com> Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org	2025-09-10 16:15:08 -04:00
Tom Lane	abdeacdb09	Fix memory leakage in nodeSubplan.c. If the hash functions used for hashing tuples leaked any memory, we failed to clean that up, resulting in query-lifespan memory leakage in queries using hashed subplans. One way that could happen is if the values being hashed require de-toasting, since most of our hash functions don't trouble to clean up de-toasted inputs. Prior to commit `bf6c614a2`, this leakage was largely masked because TupleHashTableMatch would reset hashtable->tempcxt (via execTuplesMatch). But it doesn't do that anymore, and that's not really the right place for this anyway: doing it there could reset the tempcxt many times per hash lookup, or not at all. Instead put reset calls into ExecHashSubPlan and buildSubPlanHash. Along the way to that, rearrange ExecHashSubPlan so that there's just one place to call MemoryContextReset instead of several. This amounts to accepting the de-facto API spec that the caller of the TupleHashTable routines is responsible for resetting the tempcxt adequately often. Although the other callers seem to get this right, it was not documented anywhere, so add a comment about it. Bug: #19040 Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com> Author: Haiyang Li <mohen.lhy@alibaba-inc.com> Reviewed-by: Fei Changhong <feichanghong@qq.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org Backpatch-through: 13	2025-09-10 16:05:03 -04:00
Nathan Bossart	9016fa7e3b	meson: Build numeric.c with -ftree-vectorize. autoconf builds have compiled this file with -ftree-vectorize since commit `8870917623`, but meson builds seem to have missed the memo. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/aL85CeasM51-0D1h%40nathan Backpatch-through: 16	2025-09-10 11:21:12 -05:00
Peter Eisentraut	33eec80940	Fix CREATE TABLE LIKE with not-valid check constraint In CREATE TABLE ... LIKE, any check constraints copied from the source table should be set to valid if they are ENFORCED (the default). Bug introduced in commit `ca87c415e2`. Author: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/CACJufxH%3D%2Bod8Wy0P4L3_GpapNwLUP3oAes5UFRJ7yTxrM_M5kg%40mail.gmail.com	2025-09-10 13:25:58 +02:00
Michael Paquier	e6da68a6e1	Remove dynahash.h All the callers of my_log2() are now limited inside dynahash.c, so let's remove this header. The same capability is provided by pg_bitutils.h already. Discussion: https://postgr.es/m/CAEZATCUJPQD_7sC-wErak2CQGNa6bj2hY-mr8wsBki=kX7f2_A@mail.gmail.com	2025-09-10 14:11:50 +09:00
Michael Paquier	b1187266e0	Replace callers of dynahash.h's my_log() by equivalent in pg_bitutils.h All the calls replaced by this commit use 4-byte integers for their variables used in input of my_log2(). Hence, the limit against too-large inputs does not really apply. Thresholds are also applied, as of: - In nodeAgg.c, the number of partitions is limited by HASHAGG_MAX_PARTITIONS. - In nodeHash.c, ExecChooseHashTableSize() caps its maximum number of buckets based on HashJoinTuple and palloc() allocation limit. - In worker.c, the number of subxacts tracked by ApplySubXactData uses uint32, making pg_ceil_log2_64() safe to use directly. Several approaches have been discussed, like an integration with thresholds in pg_bitutils.h, but it was found confusing. This uses Dean's idea, which gives a simpler result than what I came up with to be able to remove dynahash.h. dynahash.h will be removed in a follow-up commit, removing some duplication with the ceil log2 routines. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAEZATCUJPQD_7sC-wErak2CQGNa6bj2hY-mr8wsBki=kX7f2_A@mail.gmail.com	2025-09-10 11:20:46 +09:00
Michael Paquier	8c8f7b199d	Fix leak with SMgrRelations in startup process The startup process does not process shared invalidation messages, only sending them, and never calls AtEOXact_SMgr() which clean up any unpinned SMgrRelations. Hence, it is never able to free SMgrRelations on a periodic basis, bloating its hashtable over time. Like the checkpointer and the bgwriter, this commit takes a conservative approach by freeing periodically SMgrRelations when replaying a checkpoint record, either online or shutdown, so as the startup process has a way to perform a periodic cleanup. Issue caused by `21d9c3ee4e`, so backpatch down to v17. Author: Jingtang Zhang <mrdrivingduck@gmail.com> Reviewed-by: Yuhang Qiu <iamqyh@gmail.com> Discussion: https://postgr.es/m/28C687D4-F335-417E-B06C-6612A0BD5A10@gmail.com Backpatch-through: 17	2025-09-10 07:23:05 +09:00
Peter Eisentraut	81a61fde84	Fix typo in comment Author: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/CAK98qZ0whQ%3Dc%2BJGXbGSEBxCtLgy6sf-YGYqsKTAGsS-wt0wj%2BA%40mail.gmail.com	2025-09-09 15:33:46 +02:00
Dean Rasheed	faf071b553	Add date and timestamp variants of random(min, max). This adds 3 new variants of the random() function: random(min date, max date) returns date random(min timestamp, max timestamp) returns timestamp random(min timestamptz, max timestamptz) returns timestamptz Each returns a random value x in the range min <= x <= max. Author: Damien Clochard <damien@dalibo.info> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Vik Fearing <vik@postgresfriends.org> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/f524d8cab5914613d9e624d9ce177d3d@dalibo.info	2025-09-09 10:39:30 +01:00
Amit Kapila	5ac3c1ac22	Fix Coverity issue reported in commit `a850be2fe`. Address a potential SIGSEGV that may occur when the tablesync worker attempts to locate a deleted row while applying changes. This situation arises during conflict detection for update-deleted scenarios. To prevent this crash, ensure that the operation is errored out early if the leader apply worker is unavailable. Since the leader worker maintains the necessary conflict detection metadata, proceeding without it serves no purpose and risks reporting incorrect conflict type. In the passing, improve a nearby comment. Reported by Tom Lane as per Coverity Author: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/334468.1757280992@sss.pgh.pa.us	2025-09-09 03:18:22 +00:00
Melanie Plageman	8ec97e78a7	Add error codes when vacuum discovers VM corruption Commit `fd6ec93bf8` and other previous work established the principle that when an error is potentially reachable in case of on-disk corruption but is not expected to be reached otherwise, ERRCODE_DATA_CORRUPTED should be used. This allows log monitoring software to search for evidence of corruption by filtering on the error code. Enhance the existing log messages emitted when the heap page is found to be inconsistent with the VM by adding this error code. Suggested-by: Andrey Borodin <x4mmm@yandex-team.ru> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/87DD95AA-274F-4F4F-BAD9-7738E5B1F905%40yandex-team.ru	2025-09-08 17:13:31 -04:00
Jeff Davis	9af672bcb2	meson: build checksums with extra optimization flags. Use -funroll-loops and -ftree-vectorize when building checksum.c to match what autoconf does. Discussion: https://postgr.es/m/a81f2f7ef34afc24a89c613671ea017e3651329c.camel@j-davis.com Reviewed-by: Andres Freund <andres@anarazel.de>	2025-09-08 12:29:42 -07:00
Nathan Bossart	3bcfcd815e	pg_upgrade: Transfer pg_largeobject_metadata's files when possible. Commit `161a3e8b68` taught pg_upgrade to use COPY for large object metadata for upgrades from v12 and newer, which is much faster to restore than the proper large object commands. For upgrades from v16 and newer, we can take this a step further and transfer the large object metadata files as if they were user tables. We can't transfer the files from older versions because the aclitem data type (needed by pg_largeobject_metadata.lomacl) changed its storage format in v16 (see commit `7b378237aa`). Note that this commit is essentially a revert of commit `12a53c732c`. There are a couple of caveats. First, we still need to COPY the corresponding pg_shdepend rows for large objects. Second, we need to COPY anything in pg_largeobject_metadata with a comment or security label, else restoring those will fail. This means that an upgrade in which every large object has a comment or security label won't gain anything from this commit, but it should at least avoid making those unusual use-cases any worse. pg_upgrade must also take care to transfer the relfilenodes of pg_largeobject_metadata and its index, as was done for pg_largeobject in commits `d498e052b4` and `bbe08b8869`. Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aJ3_Gih_XW1_O2HF%40nathan	2025-09-08 14:19:48 -05:00
Robert Haas	5a170e992a	Don't generate fake "TLOCRN" or "TROCRN" aliases, either. This is just like the previous two commits, except that this fix actually doesn't change any regression test outputs. Author: Robert Haas <rhaas@postgresql.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA+TgmoYSYmDA2GvanzPMci084n+mVucv0bJ0HPbs6uhmMN6HMg@mail.gmail.com	2025-09-08 12:58:07 -04:00
Robert Haas	6f79024df3	Don't generate fake "ANY_subquery" aliases, either. This is just like the previous commit, but for a different invented alias name. Author: Robert Haas <rhaas@postgresql.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA+TgmoYSYmDA2GvanzPMci084n+mVucv0bJ0HPbs6uhmMN6HMg@mail.gmail.com	2025-09-08 12:24:02 -04:00
Robert Haas	585e31fcb6	Don't generate fake "SELECT" or "SELECT %d" subquery aliases. rte->alias should point only to a user-written alias, but in these cases that principle was violated. Fixing this causes some regression test output changes: wherever rte->alias previously had a value and is now NULL, rte->eref is now set to a generated name rather than to rte->alias; and the scheme used to generate eref names differs from what we were doing for aliases. The upshot is that instead of "SELECT" or "SELECT %d", EXPLAIN will now emit "unnamed_subquery" or "unnamed_subquery_%d". But that's a reasonable descriptor, and we were already producing that in yet other cases, so this seems not too objectionable. Author: Tom Lane <tgl@sss.pgh.pa.us> Co-authored-by: Robert Haas <rhaas@postgresql.org> Discussion: https://postgr.es/m/CA+TgmoYSYmDA2GvanzPMci084n+mVucv0bJ0HPbs6uhmMN6HMg@mail.gmail.com	2025-09-08 11:50:33 -04:00
Melanie Plageman	3399c26554	Remove unneeded VM pin from VM replay Previously, heap_xlog_visible() called visibilitymap_pin() even after getting a buffer from XLogReadBufferForRedoExtended() -- which returns a pinned buffer containing the specified block of the visibility map. This would just have resulted in visibilitymap_pin() returning early since the specified page was already present and pinned, but it was confusing extraneous code, so remove it. It doesn't seem worth backporting, though. It appears to be an oversight in `2c03216`. While we are at it, remove two VM-related redundant asserts in the COPY FREEZE code path. visibilitymap_set() already asserts that PD_ALL_VISIBLE is set on the heap page and checks that the vmbuffer contains the bits corresponding to the specified heap block, so callers do not also need to check this. Author: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CALdSSPhu7WZd%2BEfQDha1nz%3DDC93OtY1%3DUFEdWwSZsASka_2eRQ%40mail.gmail.com	2025-09-08 10:22:42 -04:00
Amit Kapila	6456c6e2c4	Add test to prevent premature removal of conflict-relevant data. A test has been added to ensure that conflict-relevant data is not prematurely removed when a concurrent prepared transaction is being committed on the publisher. This test introduces an injection point that simulates the presence of a prepared transaction in the commit phase, validating that the system correctly delays conflict slot advancement until the transaction is fully committed. Additionally, the test serves as a safeguard for developers, ensuring that the acquisition of the commit timestamp does not occur before marking DELAY_CHKPT_IN_COMMIT in RecordTransactionCommitPrepared. Reported-by: Robert Haas <robertmhaas@gmail.com> Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com	2025-09-08 12:06:03 +00:00
Michael Paquier	8191e0c16a	Fix corruption of pgstats shared hashtable due to OOM failures A new pgstats entry is created as a two-step process: - The entry is looked at in the shared hashtable of pgstats, and is inserted if not found. - When not found and inserted, its fields are then initialized. This part include a DSA chunk allocation for the stats data of the new entry. As currently coded, if the DSA chunk allocation fails due to an out-of-memory failure, an ERROR is generated, leaving in the pgstats shared hashtable an inconsistent entry due to the first step, as the entry has already been inserted in the hashtable. These broken entries can then be found by other backends, crashing them. There are only two callers of pgstat_init_entry(), when loading the pgstats file at startup and when creating a new pgstats entry. This commit changes pgstat_init_entry() so as we use dsa_allocate_extended() with DSA_ALLOC_NO_OOM, making it return NULL on allocation failure instead of failing. This way, a backend failing an entry creation can take appropriate cleanup actions in the shared hashtable before throwing an error. Currently, this means removing the entry from the shared hashtable before throwing the error for the allocation failure. Out-of-memory errors unlikely happen in the wild, and we do not bother with back-patches when these are fixed, usually. However, the problem dealt with here is a degree worse as it breaks the shared memory state of pgstats, impacting other processes that may look at an inconsistent entry that a different process has failed to create. Author: Mikhail Kot <mikhail.kot@databricks.com> Discussion: https://postgr.es/m/CAAi9E7jELo5_-sBENftnc2E8XhW2PKZJWfTC3i2y-GMQd2bcqQ@mail.gmail.com Backpatch-through: 15	2025-09-08 15:52:23 +09:00
Amit Kapila	1f7e9ba3ac	Post-commit review fixes for `228c370868`. This commit fixes three issues: 1) When a disabled subscription is created with retain_dead_tuples set to true, the launcher is not woken up immediately, which may lead to delays in creating the conflict detection slot. Creating the conflict detection slot is essential even when the subscription is not enabled. This ensures that dead tuples are retained, which is necessary for accurately identifying the type of conflict during replication. 2) Conflict-related data was unnecessarily retained when the subscription does not have a table. 3) Conflict-relevant data could be prematurely removed before applying prepared transactions on the publisher that are in the commit critical section. This issue occurred because the backend executing COMMIT PREPARED was not accounted for during the computation of oldestXid in the commit phase on the publisher. As a result, the subscriber could advance the conflict slot's xmin without waiting for such COMMIT PREPARED transactions to complete. We fixed this issue by identifying prepared transactions that are in the commit critical section during computation of oldestXid in commit phase. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS9PR01MB16913DACB64E5721872AA5C02943BA@OS9PR01MB16913.jpnprd01.prod.outlook.com Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com	2025-09-08 06:10:15 +00:00
Michael Paquier	43eb2c5419	Update parser README to include parse_jsontable.c The README was missing parse_jsontable.c which handles JSON_TABLE. Oversight in `de3600452b`. Author: Karthik S <karthikselvaam@gmail.com> Discussion: https://postgr.es/m/CAK4gQD9gdcj+vq_FZGp=Rv-W+41v8_C7cmCUmDeu=cfrOdfXEw@mail.gmail.com Backpatch-through: 17	2025-09-08 10:07:14 +09:00
Tatsuo Ishii	06473f5a34	Allow to log raw parse tree. This commit allows to log the raw parse tree in the same way we currently log the parse tree, rewritten tree, and plan tree. To avoid unnecessary log noise for users not interested in this detail, a new GUC option, "debug_print_raw_parse", has been added. When starting the PostgreSQL process with "-d N", and N is 3 or higher, debug_print_raw_parse is enabled automatically, alongside debug_print_parse. Author: Chao Li <lic@highgo.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://postgr.es/m/CAEoWx2mcO0Gpo4vd8kPMAFWeJLSp0MeUUnaLdE1x0tSVd-VzUw%40mail.gmail.com	2025-09-06 07:49:51 +09:00
Andres Freund	2c78940527	bufmgr: Remove freelist, always use clock-sweep This set of changes removes the list of available buffers and instead simply uses the clock-sweep algorithm to find and return an available buffer. This also removes the have_free_buffer() function and simply caps the pg_autoprewarm process to at most NBuffers. While on the surface this appears to be removing an optimization it is in fact eliminating code that induces overhead in the form of synchronization that is problematic for multi-core systems. The main reason for removing the freelist, however, is not the moderate improvement in scalability, but that having the freelist would require dedicated complexity in several upcoming patches. As we have not been able to find a case benefiting from the freelist... Author: Greg Burd <greg@burd.me> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com	2025-09-05 12:25:59 -04:00
Andres Freund	50e4c6ace5	bufmgr: Use consistent naming of the clock-sweep algorithm Minor edits to comments only. Author: Greg Burd <greg@burd.me> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com	2025-09-05 12:25:59 -04:00
Melanie Plageman	e3d5ddb7ca	Add assert and log message to visibilitymap_set Add an assert to visibilitymap_set() that the provided heap buffer is exclusively locked, which is expected. Also, enhance the debug logging message to specify which VM flags were set. Based on a related suggestion by Kirill Reshke on an in-progress patchset. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CALdSSPhAU56g1gGVT0%2BwG8RrSWE6qW8TOfNJS1HNAWX6wPgbFA%40mail.gmail.com	2025-09-05 09:33:36 -04:00
Dean Rasheed	6ede13d1b5	Fix concurrent update issue with MERGE. When executing a MERGE UPDATE action, if there is more than one concurrent update of the target row, the lock-and-retry code would sometimes incorrectly identify the latest version of the target tuple, leading to incorrect results. This was caused by using the ctid field from the TM_FailureData returned by table_tuple_lock() in a case where the result was TM_Ok, which is unsafe because the TM_FailureData struct is not guaranteed to be fully populated in that case. Instead, it should use the tupleid passed to (and updated by) table_tuple_lock(). To reduce the chances of similar errors in the future, improve the commentary for table_tuple_lock() and TM_FailureData to make it clearer that table_tuple_lock() updates the tid passed to it, and most fields of TM_FailureData should not be relied on in non-failure cases. An exception to this is the "traversed" field, which is set in both success and failure cases. Reported-by: Dmitry <dsy.075@yandex.ru> Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/1570d30e-2b95-4239-b9c3-f7bf2f2f8556@yandex.ru Backpatch-through: 15	2025-09-05 08:18:18 +01:00
Michael Paquier	567d27e8e2	Fix outdated comments in slru.c SlruRecentlyUsed() is an inline function since `53c2a97a92`, not a macro. The description of long_segment_names was missing at the top of SimpleLruInit(), part forgotten in `4ed8f0913b`. Author: Julien Rouhaud <rjuju123@gmail.com> Discussion: https://postgr.es/m/aLpBLMOYwEQkaleF@jrouhaud Backpatch-through: 17	2025-09-05 14:10:08 +09:00
Michael Paquier	4246a977ba	Switch some numeric-related functions to use soft error reporting This commit changes some functions related to the data type numeric to use the soft error reporting rather than a custom boolean flag (called "have_error") that callers of these functions could rely on to bypass the generation of ERROR reports, letting the callers do their own error handling (timestamp, jsonpath and numeric_to_char() require them). This results in the removal of some boilerplate code that was required to handle both the ereport() and the "have_error" code paths bypassing ereport(), unifying everything under the soft error reporting facility. While on it, some duplicated error messages are removed. The function upgraded in this commit were suffixed with "_opt_error" in their names. They are renamed to "_safe" instead. This change relies on `d9f7f5d32f`, that has introduced the soft error reporting infrastructure. Author: Amul Sul <sulamul@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAAJ_b96No5h5tRuR+KhcC44YcYUCw8WAHuLoqqyyop8_k3+JDQ@mail.gmail.com	2025-09-05 13:53:47 +09:00
Michael Paquier	ae45312008	Change pg_lsn_in_internal() to use soft error reporting pg_lsn includes pg_lsn_in_internal() for the purpose of parsing a LSN position for the GUC recovery_target_lsn (`21f428ebde`). It relies on a boolean called "have_error" that would be set when the LSN parsing fails, then let its callers handle any errors. `d9f7f5d32f` has added support for soft error reporting. This commit removes some boilerplate code and switches the routine to use soft error reporting directly, giving to the callers of pg_lsn_in_internal() the possibility to be fed the error message generated on failure. pg_lsn_in_internal() routine is renamed to pg_lsn_in_safe(), for consistency with other similar routines that are given an escontext. Author: Amul Sul <sulamul@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAAJ_b96No5h5tRuR+KhcC44YcYUCw8WAHuLoqqyyop8_k3+JDQ@mail.gmail.com	2025-09-05 12:59:29 +09:00
Nathan Bossart	d814d7fc3d	Revert recent change to RequestNamedLWLockTranche(). Commit `38b602b028` modified this function to allocate enough space for MAX_NAMED_TRANCHES (256) requests, which is likely far more than most clusters need. This commit reverts that change so that it first allocates enough space for only 16 requests and resizes the array when necessary. While at it, remove the check for too many tranches from this function. We can now rely on InitializeLWLocks() to do that check via its calls to LWLockNewTrancheId() for the named tranches. Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/aLmzwC2dRbqk14y6%40nathan	2025-09-04 15:34:48 -05:00
Nathan Bossart	1129d3e4c8	Adjust commentary for WaitEventLWLock in wait_event_names.txt. In addition to changing a couple of references for clarity, this commit combines the two similar comments.	2025-09-04 10:18:42 -05:00
Dean Rasheed	fc6600fc1c	Fix replica identity check for MERGE. When executing a MERGE, check that the target relation supports all actions mentioned in the MERGE command. Specifically, check that it has a REPLICA IDENTITY if it publishes updates or deletes and the MERGE command contains update or delete actions. Failing to do this can silently break replication. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Tested-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/OS3PR01MB57180C87E43A679A730482DF94B62@OS3PR01MB5718.jpnprd01.prod.outlook.com Backpatch-through: 15	2025-09-04 11:45:44 +01:00
Dean Rasheed	5386bfb9c1	Fix replica identity check for INSERT ON CONFLICT DO UPDATE. If an INSERT has an ON CONFLICT DO UPDATE clause, the executor must check that the target relation supports UPDATE as well as INSERT. In particular, it must check that the target relation has a REPLICA IDENTITY if it publishes updates. Formerly, it was not doing this check, which could lead to silently breaking replication. Fix by adding such a check to CheckValidResultRel(), which requires adding a new onConflictAction argument. In back-branches, preserve ABI compatibility by introducing a wrapper function with the original signature. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Tested-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/OS3PR01MB57180C87E43A679A730482DF94B62@OS3PR01MB5718.jpnprd01.prod.outlook.com Backpatch-through: 13	2025-09-04 11:27:53 +01:00
Michael Paquier	09119238a1	Fix incorrect comment in pgstat_backend.c The counters saved from pgWalUsage, used for the difference calculations when flushing the backend WAL stats, are updated when calling pgstat_flush_backend() under PGSTAT_BACKEND_FLUSH_WAL, and not pgstat_report_wal(). The comment updated in this commit referenced the latter, but it is perfectly OK to flush the backend stats independently of the WAL stats. Noticed while looking at this area of the code, introduced by `76def4cdd7` as a copy-pasto. Backpatch-through: 18	2025-09-04 08:34:51 +09:00
Nathan Bossart	38b602b028	Move dynamically-allocated LWLock tranche names to shared memory. There are two ways for shared libraries to allocate their own LWLock tranches. One way is to call RequestNamedLWLockTranche() in a shmem_request_hook, which requires the library to be loaded via shared_preload_libraries. The other way is to call LWLockNewTrancheId(), which is not subject to the same restrictions. However, LWLockNewTrancheId() does require each backend to store the tranche's name in backend-local memory via LWLockRegisterTranche(). This API is a little cumbersome and leads to things like unhelpful pg_stat_activity.wait_event values in backends that haven't loaded the library. This commit moves these LWLock tranche names to shared memory, thus eliminating the need for each backend to call LWLockRegisterTranche(). Instead, the tranche name must be provided to LWLockNewTrancheId(), which immediately makes the name available to all backends. Since the tranche name array is append-only, lookups can ordinarily avoid locking as long as their local copy of the LWLock counter is greater than the requested tranche ID. One downside of this approach is that we now have a hard limit on both the length of tranche names (NAMEDATALEN-1 bytes) and the number of dynamically-allocated tranches (256). Besides a limit of NAMEDATALEN-1 bytes for tranche names registered via RequestNamedLWLockTranche(), no such limits previously existed. We could avoid these new limits by using dynamic shared memory, but the complexity involved didn't seem worth it. We briefly considered making the tranche limit user-configurable but ultimately decided against that, too. Since there is still a lot of time left in the v19 development cycle, it's possible we will revisit this choice. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAA5RZ0vvED3naph8My8Szv6DL4AxOVK3eTPS0qXsaKi%3DbVdW2A%40mail.gmail.com	2025-09-03 13:57:48 -05:00
Peter Eisentraut	01d6e5b2cf	Fix mistake in new GUC tables source Commit `6359989654` had it so that the parameter "debug_discard_caches" did not exist unless DISCARD_CACHES_ENABLED was defined (typically via enabling asserts). This was a mistake, it did not correspond to the prior setup. Several tests use this parameter, so they were now failing if you did not have asserts enabled.	2025-09-03 11:48:35 +02:00
Peter Eisentraut	6359989654	Generate GUC tables from .dat file Store the information in guc_tables.c in a .dat file similar to the catalog data in src/include/catalog/, and generate a part of guc_tables.c from that. The goal is to make it easier to edit that information, and to be able to make changes to the downstream data structures more easily. (Essentially, those are the same reasons as for the original adoption of the .dat format.) Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: David E. Wheeler <david@justatheory.com> Discussion: https://www.postgresql.org/message-id/flat/dae6fe89-1e0c-4c3f-8d92-19d23374fb10%40eisentraut.org	2025-09-03 09:45:17 +02:00
Richard Guo	aba8f61c30	Fix planner error when estimating SubPlan cost SubPlan nodes are typically built very early, before any RelOptInfos have been constructed for the parent query level. As a result, the simple_rel_array in the parent root has not yet been initialized. Currently, during cost estimation of a SubPlan's testexpr, we may call examine_variable() to look up statistical data about the expressions. This can lead to "no relation entry for relid" errors. To fix, pass root as NULL to cost_qual_eval() in cost_subplan(), since the root does not yet contain enough information to safely consult statistics. One exception is SubPlan nodes built for the initplans of MIN/MAX aggregates from indexes. In this case, having a NULL root is safe because testexpr will be NULL. Additionally, an initplan will by definition not consult anything from the parent plan. Backpatch to all supported branches. Although the reported call path that triggers this error is not reachable prior to v17, there's no guarantee that other code paths -- especially in extensions -- could not encounter the same issue when cost_qual_eval() is called with a root that lacks a valid simple_rel_array. The test case is not included in pre-v17 branches though. Bug: #19037 Reported-by: Alexander Lakhin <exclusion@gmail.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19037-3d1c7bb553c7ce84@postgresql.org Backpatch-through: 13	2025-09-03 16:00:38 +09:00
Amit Kapila	f2dbc83501	Fix use-after-free issue in slot synchronization. Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 18, where it was introduced Discussion: https://postgr.es/m/CANhcyEXMrcEdzj-RNGJam0nJHM4y+ttdWsgUCFmXciM7BNKc7A@mail.gmail.com	2025-09-03 06:31:05 +00:00
Michael Paquier	c6ea528b47	Update outdated references to the SLRU ControlLock SLRU bank locks are referred as "bank locks" or "SLRU bank locks" in the code comments. The comments updated in this commit use the latter term. Oversight in `53c2a97a92`, that has replaced the single ControlLock by the bank control locks. Author: Julien Rouhaud <julien.rouhaud@free.fr> Discussion: https://postgr.es/m/aLUT2UO8RjJOzZNq@jrouhaud Backpatch-through: 17	2025-09-03 10:20:28 +09:00
Fujii Masao	229911c4bf	Add HINT for COPY TO when WHERE clause is used. COPY TO does not support a WHERE clause, and currently fails with the error: ERROR: WHERE clause not allowed with COPY TO Since the intended behavior can be achieved by using COPY (SELECT ... WHERE ...) TO, this commit adds a HINT to the error message: HINT: Try the COPY (SELECT ... WHERE ...) TO variant. This makes the error more informative and helps users quickly find the alternative usage. Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Discussion: https://postgr.es/m/3520c224c5ffac0113aef84a9179f37e@oss.nttdata.com	2025-09-03 08:35:55 +09:00
Tom Lane	1b1960c8c9	Improve error message for duplicate labels when creating an enum type. Previously, duplicate labels in CREATE TYPE AS ENUM were caught by the unique index on pg_enum, resulting in a generic error message. While this was evidently intentional, it's not terribly user-friendly, nor consistent with the ALTER TYPE cases which take more care with such errors. This patch adds an explicit check to produce a more user-friendly and descriptive error message. A potential objection to this implementation is that it adds O(N^2) work to the creation operation. However, quick testing finds that that's pretty negligible below 1000 enum labels, and tolerable even at 10000. So it doesn't really seem worth being smarter. Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20250704000402.37e605ab0c59c300965a17ee@sraoss.co.jp	2025-09-02 13:50:56 -04:00
Michael Paquier	eccba079c2	Generate pgstat_count_slru*() functions for slru using macros This change replaces seven functions definitions by macros, reducing a bit some repetitive patterns in the code. An interesting side effect is that this removes an inconsistency in the naming of SLRU increment functions with the field names. This change is similar to `850f4b4c8c`, `8018ffbf58` or `83a1a1b566`. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aLHA//gr4dTpDHHC@ip-10-97-1-34.eu-west-3.compute.internal	2025-09-02 16:22:03 +09:00
Amit Kapila	a850be2fe6	Add max_retention_duration option to subscriptions. This commit introduces a new subscription parameter, max_retention_duration, aimed at mitigating excessive accumulation of dead tuples when retain_dead_tuples is enabled and the apply worker lags behind the publisher. When the time spent advancing a non-removable transaction ID exceeds the max_retention_duration threshold, the apply worker will stop retaining conflict detection information. In such cases, the conflict slot's xmin will be set to InvalidTransactionId, provided that all apply workers associated with the subscription (with retain_dead_tuples enabled) confirm the retention duration has been exceeded. To ensure retention status persists across server restarts, a new column subretentionactive has been added to the pg_subscription catalog. This prevents unnecessary reactivation of retention logic after a restart. The conflict detection slot will not be automatically re-initialized unless a new subscription is created with retain_dead_tuples = true, or the user manually re-enables retain_dead_tuples. A future patch will introduce support for automatic slot re-initialization once at least one apply worker confirms that the retention duration is within the configured max_retention_duration. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-09-02 03:20:18 +00:00
Richard Guo	317c117d6d	Fix const-simplification for constraints and stats Constraint expressions and statistics expressions loaded from the system catalogs need to be run through const-simplification, because the planner will be comparing them to similarly-processed qual clauses. Without this step, the planner may fail to detect valid matches. Currently, NullTest clauses in these expressions may not be reduced correctly during const-simplification. This happens because their Var nodes do not yet have the correct varno when eval_const_expressions is applied. Since eval_const_expressions relies on varno to reduce NullTest quals, incorrect varno can cause problems. Additionally, for statistics expressions, eval_const_expressions is called with root set to NULL, which also inhibits NullTest reduction. This patch fixes the issue by ensuring that Vars are updated to have the correct varno before const-simplification, and that a valid root is passed to eval_const_expressions when needed. Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/19007-4cc6e252ed8aa54a@postgresql.org	2025-08-31 08:59:48 +09:00
Nathan Bossart	5487058b56	Prepare DSM registry for upcoming changes to LWLock tranche names. A proposed patch would place a limit of NAMEDATALEN-1 (i.e., 63) bytes on the names of dynamically-allocated LWLock tranches, but GetNamedDSA() and GetNamedDSHash() may register tranches with longer names. This commit lowers the maximum DSM registry entry name length to NAMEDATALEN-1 bytes and modifies GetNamedDSHash() to create only one tranche, thereby allowing us to keep the DSM registry's tranche names below NAMEDATALEN bytes. Author: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/aKzIg1JryN1qhNuy%40nathan	2025-08-29 20:34:53 -05:00
Tom Lane	f727b63e81	Provide error context when an error is thrown within WaitOnLock(). Show the requested lock level and the object being waited on, in the same format we use for deadlock reports and similar errors. This is particularly helpful for debugging lock-timeout errors, since otherwise the user has very little to go on about which lock timed out. The performance cost of setting up the callback should be negligible compared to the other tracing support already present in WaitOnLock. As in the deadlock-report case, we just show numeric object OIDs, because it seems too scary to try to perform catalog lookups in this context. Reported-by: Steve Baldwin <steve.baldwin@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1602369.1752167154@sss.pgh.pa.us	2025-08-29 15:43:34 -04:00
Nathan Bossart	67fcf48c3b	Make LWLockCounter a global variable. Using the LWLockCounter requires first calculating its address in shared memory like this: LWLockCounter = (int ) ((char ) MainLWLockArray - sizeof(int)); Commit `82e861fbe1` started this trend in order to fix EXEC_BACKEND builds, but it could also be fixed by adding it to the BackendParameters struct. The current approach is somewhat difficult to follow, so this commit switches to the latter. While at it, swap around the code in LWLockShmemSize() to match the order of assignments in CreateLWLocks() for added readability. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aLDLnan9gNCS9fHx%40nathan	2025-08-29 12:13:37 -05:00
Nathan Bossart	6fbd7b93c6	Remove unused parameter from ProcessSlotSyncInterrupts(). Oversight in commit `93db6cbda0`. Author: ChangAo Chen <cca5507@qq.com> Discussion: https://postgr.es/m/tencent_7B42BBE8D0A5C28DDAB91436192CBCCB8307%40qq.com	2025-08-29 10:56:10 -05:00
David Rowley	da9f9f75e5	Fix possible use after free in expand_partitioned_rtentry() It's possible that if the only live partition is concurrently dropped and try_table_open() fails, that the bms_del_member() will pfree the live_parts Bitmapset. Since the bms_del_member() call does not assign the result back to the live_parts local variable, the while loop could segfault as that variable would still reference the pfree'd Bitmapset. Backpatch to 15. `52f3de874` was backpatched to 14, but there's no bms_del_member() there due to live_parts not yet existing in RelOptInfo in that version. Technically there's no bug in version 15 as bms_del_member() didn't pfree when the set became empty prior to `00b41463c` (from v16). Applied to v15 anyway to keep the code similar and to avoid the bad coding pattern. Author: Bernd Reiß <bd_reiss@gmx.at> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/6b88f27a-c45c-4826-8e37-d61a04d90182@gmx.at Backpatch-through: 15	2025-08-30 00:50:50 +12:00
Álvaro Herrera	f225473cba	CREATE STATISTICS: improve misleading error message I think the error message for a different condition was inadvertently copied. This problem seems to have been introduced by commit `a4d75c86bf`. Author: Álvaro Herrera <alvherre@kurilemu.de> Reported-by: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Backpatch-through: 14 Discussion: https://postgr.es/m/CACJufxEZ48toGH0Em_6vdsT57Y3L8pLF=DZCQ_gCii6=C3MeXw@mail.gmail.com	2025-08-29 14:43:47 +02:00
Peter Eisentraut	991295f387	Mark ItemPointer arguments as const in tuple/table lock functions The functions LockTuple, ConditionalLockTuple, UnlockTuple, and XactLockTableWait take an ItemPointer argument that they do not modify, so the argument can be const-qualified to better convey intent and allow the compiler to enforce immutability. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2m9e4rECHBwpRE4%2BGCH%2BpbYZXLh2f4rB1Du5hDfKug%2BOg%40mail.gmail.com	2025-08-29 07:39:58 +02:00
Peter Eisentraut	710e6c4301	Remove unneeded casts of BufferGetPage() result BufferGetPage() already returns type Page, so casting it to Page doesn't achieve anything. A sizable number of call sites does this casting; remove that. This was already done inconsistently in the code in the first import in 1996 (but didn't exist in the pre-1995 code), and it was then apparently just copied around. Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/CALdSSPgFhc5=vLqHdk-zCcnztC0zEY3EU_Q6a9vPEaw7FkE9Vw@mail.gmail.com	2025-08-29 07:18:29 +02:00
Richard Guo	97b0f36bde	Fix semijoin unique-ification for child relations For a child relation, we should not assume that its parent's unique-ified relation (or unique-ified path in v18) always exists. In cases where all RHS columns that need to be unique-ified are equated to constants, the unique-ified relation/path for the parent table is not built, as there are no columns left to unique-ify. Failing to account for this can result in a SIGSEGV crash during planning. This patch checks whether the parent's unique-ified relation or path exists and skips unique-ification of the child relation if it does not. Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs49MOdLW2c+qbLHHBt8VBu=4ONpM91D19=AWeW93eFUF6A@mail.gmail.com Backpatch-through: 18	2025-08-29 13:14:12 +09:00
Masahiko Sawada	fabd8b8e2a	Use LW_SHARED in walsummarizer.c for WALSummarizerLock lock where possible. Previously, we used LW_EXCLUSIVE in several places despite only reading WalSummarizerCtl fields. This patch reduces the lock level to LW_SHARED where we are only reading the shared fields. Backpatch to 17, where wal summarization was introduced. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/CAD21AoDdKhf_9oriEYxY-JCdF+Oe_muhca3pcdkMEdBMzyHyKw@mail.gmail.com Backpatch-through: 17	2025-08-28 17:06:42 -07:00
Tom Lane	b8a1bdc458	Fix "variable not found in subplan target lists" in semijoin de-duplication. One mechanism we have for implementing semi-joins is to de-duplicate the output of the RHS and then treat the join as a plain inner join. Initial construction of the join's SpecialJoinInfo identifies the RHS columns that need to be de-duplicated, but later we may find that some of those don't need to be handled explicitly, either because they're known to be constant or because they are redundant with some previous column. Up to now, while sort-based de-duplication handled such cases well, hash-based de-duplication didn't: we'd still hash on all of the originally-identified columns. This is probably not a very big deal performance-wise, but in the wake of commit `a3179ab69` it can cause planner errors. That happens when join elimination causes recalculation of variables' attr_needed bitmapsets, and we decide that a variable mentioned in a semijoin clause doesn't need to be propagated up to the join level anymore. There are a number of ways we could slice the blame for this, but the only fix that doesn't result in pessimizing plans for loosely-related cases is to be more careful about not hashing columns we don't actually need to de-duplicate. We can install that consideration into create_unique_paths in master, or the predecessor code in create_unique_path in v18, without much refactoring. (As follow-up work, it might be a good idea to look at more-invasive refactoring, in hopes of preventing other bugs in this area. But with v18 release so close, there's not time for that now, nor would we be likely to want to put such refactoring into v18 anyway.) Reported-by: Sergey Soloviev <sergey.soloviev@tantorlabs.ru> Diagnosed-by: Richard Guo <guofenglinux@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/1fd1a421-4609-4d46-a1af-ab74d5de504a@tantorlabs.ru Backpatch-through: 18	2025-08-28 13:49:23 -04:00
Álvaro Herrera	325fc0ab14	Avoid including commands/dbcommands.h in so many places This has been done historically because of get_database_name (which since commit `cb98e6fb8f` belongs in lsyscache.c/h, so let's move it there) and get_database_oid (which is in the right place, but whose declaration should appear in pg_database.h rather than dbcommands.h). Clean this up. Also, xlogreader.h and stringinfo.h are no longer needed by dbcommands.h since commit `f1fd515b39`, so remove them. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/202508191031.5ipojyuaswzt@alvherre.pgsql	2025-08-28 12:39:04 +02:00
Peter Eisentraut	80f1106132	Message style improvements An improvement pass over the new stats import functionality.	2025-08-28 09:09:26 +02:00
Andres Freund	5865150b6d	aio: Stop using enum bitfields due to bad code generation During an investigation into rather odd aio related errors on macos, observed by Alexander and Konstantin, we started to wonder if bitfield access is related to the error. At the moment it looks like it is related, we cannot reproduce the failures when replacing the bitfields. In addition, the problem can only be reproduced with some compiler [versions] and not everyone has been able to reproduce the issue. The observed problem is that, very rarely, PgAioHandle->{state,target} are in an inconsistent state, after having been checked to be in a valid state not long before, triggering an assertion failure. Unfortunately, this could be caused by wrong compiler code generation or somehow of missing memory barriers - we don't really know. In theory there should not be any concurrent write access to the handle in the state the bug is triggered, as the handle was idle and is just being initialized. Separately from the bug, we observed that at least gcc and clang generate rather terrible code for the bitfield access. Even if it's not clear if the observed assertion failure is actually caused by the bitfield somehow, the bad code generation alone is sufficient reason to stop using bitfields. Therefore, replace the enum bitfields with uint8s and instead cast in each switch statement. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reported-by: Konstantin Knizhnik <knizhnik@garret.ru> Discussion: https://postgr.es/m/1500090.1745443021@sss.pgh.pa.us Backpatch-through: 18	2025-08-27 19:12:11 -04:00
Peter Eisentraut	e36fa9319b	Improve objectNamesToOids() comment Commit `d31bbfb659` removed the comment at objectNamesToOids() that there is no locking, because that commit added locking. But to fix all the problems, we'd still need a stronger lock. So put the comment back with more a detailed explanation. Co-authored-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://www.postgresql.org/message-id/flat/bf72b82c-124d-4efa-a484-bb928e9494e4@eisentraut.org	2025-08-27 17:46:26 +02:00
Peter Eisentraut	990c8db182	Fix: Don't strip $libdir from nested module_pathnames This patch fixes a bug in how 'load_external_function' handles '$libdir/ prefixes in module paths. Previously, 'load_external_function' would unconditionally strip '$libdir/' from the beginning of the 'filename' string. This caused an issue when the path was nested, such as "$libdir/nested/my_lib". Stripping the prefix resulted in a path of "nested/my_lib", which would fail to be found by the expand_dynamic_library_name function because the original '$libdir' macro was removed. To fix this, the code now checks for the presence of an additional directory separator ('/' or '\') after the '$libdir/' prefix. The prefix is only stripped if the remaining string does not contain a separator. This ensures that simple filenames like '"$libdir/my_lib"' are correctly handled, while nested paths are left intact for 'expand_dynamic_library_name' to process correctly. Reported-by: Dilip Kumar <dilipbalaut@gmail.com> Co-authored-by: Matheus Alcantara <matheusssilv97@gmail.com> Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAFiTN-uKNzAro4tVwtJhF1UqcygfJ%2BR%2BRL%3Db-_ZMYE3LdHoGhA%40mail.gmail.com	2025-08-27 15:49:58 +02:00
Peter Eisentraut	e567e22290	Message style improvements Mostly adding some quoting.	2025-08-26 22:52:11 +02:00
Tom Lane	327b7324d0	Put "excludeOnly" GIN scan keys at the end of the scankey array. Commit `4b754d6c1` introduced the concept of an excludeOnly scan key, which cannot select matching index entries but can reject non-matching tuples, for example a tsquery such as '!term'. There are poorly-documented assumptions that such scan keys do not appear as the first scan key. ginNewScanKey did nothing to ensure that, however, with the result that certain GIN index searches could go into an infinite loop while apparently-equivalent queries with the clauses in a different order were fine. Fix by teaching ginNewScanKey to place all excludeOnly scan keys after all not-excludeOnly ones. So far as we know at present, it might be sufficient to avoid the case where the very first scan key is excludeOnly; but I'm not very convinced that there aren't other dependencies on the ordering. Bug: #19031 Reported-by: Tim Wood <washwithcare@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19031-0638148643d25548@postgresql.org Backpatch-through: 13	2025-08-26 12:08:57 -04:00
Tom Lane	b55068236c	Do CHECK_FOR_INTERRUPTS inside, not before, scanGetItem. The CHECK_FOR_INTERRUPTS call in gingetbitmap turns out to be inadequate to prevent a long uninterruptible loop, because we now know a case where looping occurs within scanGetItem. While the next patch will fix the bug that caused that, it seems foolish to assume that no similar patterns are possible. Let's do the CFI within scanGetItem's retry loop, instead. This demonstrably allows canceling out of the loop exhibited in bug #19031. Bug: #19031 Reported-by: Tim Wood <washwithcare@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19031-0638148643d25548@postgresql.org Backpatch-through: 13	2025-08-26 11:38:41 -04:00
Alexander Korotkov	5f6f951f88	Improve RowMark handling during Self-Join Elimination The Self-Join Elimination SJE feature messes up keeping and removing RowMark's in remove_self_joins_one_group(). That didn't lead to user-level error, because the planned RowMark is only used to reference a rtable entry in later execution stages. An RTE entry for keeping and removing relations is identical and refers to the same relation OID. To reduce confusion and prevent future issues, this commit cleans up the code and fixes the incorrect behaviour. Furthermore, it includes sanity checks in setrefs.c on existing non-null RTE and RelOptInfo entries for each RowMark. Discussion: https://postgr.es/m/18c6bd6c-6d2a-419a-b0da-dfedef34b585%40gmail.com Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Backpatch-through: 18	2025-08-26 13:23:18 +03:00
Alexander Korotkov	d713cf9b65	Refactor variable names in remove_self_joins_one_group() Rename inner and outer to rrel and krel, respectively, to highlight their connection to r and k indexes. For the same reason, rename imark and omark to rmark and kmark. Discussion: https://postgr.es/m/18c6bd6c-6d2a-419a-b0da-dfedef34b585%40gmail.com Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Backpatch-through: 18	2025-08-26 13:22:43 +03:00
Peter Eisentraut	99234e9ddc	Message wording improvements Use "row" instead of "tuple" for user-facing information for logical replication conflicts.	2025-08-25 23:15:24 +02:00
Nathan Bossart	989b2e4d5c	Use PqMsg_* macros in applyparallelworker.c. Oversight in commit `f4b54e1ed9`. Author: Ranier Vilela <ranier.vf@gmail.com> Discussion: https://postgr.es/m/CAEudQAobFsHaLMypA6C96-9YExvF4AcU1xNPoPuNYRVm3mq4dg%40mail.gmail.com	2025-08-25 14:11:01 -05:00
Peter Eisentraut	878656dbde	Formatting cleanup of guc_tables.c This cleans up a few minor formatting inconsistencies. Reviewed-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/dae6fe89-1e0c-4c3f-8d92-19d23374fb10%40eisentraut.org	2025-08-25 09:10:27 +02:00
Alexander Korotkov	c13070a27b	Revert "Get rid of WALBufMappingLock" This reverts commit `bc22dc0e0d`. It appears that conditional variables are not suitable for use inside critical sections. If WaitLatch()/WaitEventSetWaitBlock() face postmaster death, they exit, releasing all locks instead of PANIC. In certain situations, this leads to data corruption. Reported-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru> Reviewed-by: Michael Paquier <michael@paquier.xyz> Backpatch-through: 18	2025-08-22 19:26:38 +03:00
Heikki Linnakangas	661f821ef0	Use ereport() rather than elog() Noah pointed this out before I committed `50f770c3d9`, but I accidentally pushed the old version with elog() anyway. Oops. Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://www.postgresql.org/message-id/20250820003756.31.nmisch@google.com	2025-08-22 13:35:05 +03:00
Heikki Linnakangas	50f770c3d9	Revert GetTransactionSnapshot() to return historic snapshot during LR Commit `1585ff7387` changed GetTransactionSnapshot() to throw an error if it's called during logical decoding, instead of returning the historic snapshot. I made that change for extra protection, because a historic snapshot can only be used to access catalog tables while GetTransactionSnapshot() is usually called when you're executing arbitrary queries. You might get very subtle visibility problems if you tried to use the historic snapshot for arbitrary queries. There's no built-in code in PostgreSQL that calls GetTransactionSnapshot() during logical decoding, but it turns out that the pglogical extension does just that, to evaluate row filter expressions. You would get weird results if the row filter runs arbitrary queries, but it is sane as long as you don't access any non-catalog tables. Even though there are no checks to enforce that in pglogical, a typical row filter expression does not access any tables and works fine. Accessing tables marked with the user_catalog_table = true option is also OK. To fix pglogical with row filters, and any other extensions that might do similar things, revert GetTransactionSnapshot() to return a historic snapshot during logical decoding. To try to still catch the unsafe usage of historic snapshots, add checks in heap_beginscan() and index_beginscan() to complain if you try to use a historic snapshot to scan a non-catalog table. We're very close to the version 18 release however, so add those new checks only in master. Backpatch-through: 18 Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://www.postgresql.org/message-id/20250809222338.cc.nmisch@google.com	2025-08-22 13:07:46 +03:00
Peter Eisentraut	16a0039dc0	Reduce lock level for ALTER DOMAIN ... VALIDATE CONSTRAINT Reduce from ShareLock to ShareUpdateExclusivelock. Validation during ALTER DOMAIN ... ADD CONSTRAINT keeps using ShareLock. Example: create domain d1 as int; create table t (a d1); alter domain d1 add constraint cc10 check (value > 10) not valid; begin; alter domain d1 validate constraint cc10; -- another session insert into t values (8); Now we should still be able to perform DML operations on table t while the domain constraint is being validated. The equivalent works already on table constraints. Author: jian he <jian.universality@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CACJufxHz92A88NLRTA2msgE2dpXpE-EoZ2QO61od76-6bfqurA%40mail.gmail.com	2025-08-22 08:56:11 +02:00
Michael Paquier	13b935cd52	Change dynahash.c and hsearch.h to use int64 instead of long This code was relying on "long", which is signed 8 bytes everywhere except on Windows where it is 4 bytes, that could potentially expose it to overflows, even if the current uses in the code are fine as far as I know. This code is now able to rely on the same sizeof() variable everywhere, with int64. long was used for sizes, partition counts and entry counts. Some callers of the dynahash.c routines used long declarations, that can be cleaned up to use int64 instead. There was one shortcut based on SIZEOF_LONG, that can be removed. long is entirely removed from dynahash.c and hsearch.h. Similar work was done in `b1e5c9fa9a`. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aKQYp-bKTRtRauZ6@paquier.xyz	2025-08-22 11:59:02 +09:00
Michael Paquier	ef03ea01fe	Ignore temporary relations in RelidByRelfilenumber() Temporary relations may share the same RelFileNumber with a permanent relation, or other temporary relations associated with other sessions. Being able to uniquely identify a temporary relation would require RelidByRelfilenumber() to know about the proc number of the temporary relation it wants to identify, something it is not designed for since its introduction in `f01d1ae3a1`. There are currently three callers of RelidByRelfilenumber(): - autoprewarm. - Logical decoding, reorder buffer. - pg_filenode_relation(), that attempts to find a relation OID based on a tablespace OID and a RelFileNumber. This makes the situation problematic particularly for the first two cases, leading to the possibility of random ERRORs due to inconsistencies that temporary relations can create in the cache maintained by RelidByRelfilenumber(). The third case should be less of an issue, as I suspect that there are few direct callers of pg_filenode_relation(). The window where the ERRORs are happen is very narrow, requiring an OID wraparound to create a lookup conflict in RelidByRelfilenumber() with a temporary table reusing the same OID as another relation already cached. The problem is easier to reach in workloads with a high OID consumption rate, especially with a higher number of temporary relations created. We could get pg_filenode_relation() and RelidByRelfilenumber() to work with temporary relations if provided the means to identify them with an optional proc number given in input, but the years have also shown that we do not have a use case for it, yet. Note that this could not be backpatched if pg_filenode_relation() needs changes. It is simpler to ignore temporary relations. Reported-by: Shenhao Wang <wangsh.fnst@fujitsu.com> Author: Vignesh C <vignesh21@gmail.com> Reviewed-By: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-By: Takamichi Osumi <osumi.takamichi@fujitsu.com> Reviewed-By: Michael Paquier <michael@paquier.xyz> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Reported-By: Shenhao Wang <wangsh.fnst@fujitsu.com> Discussion: https://postgr.es/m/bbaaf9f9-ebb2-645f-54bb-34d6efc7ac42@fujitsu.com Backpatch-through: 13	2025-08-22 09:03:59 +09:00
Peter Eisentraut	47932f3cdc	Use consistent type for pgaio_io_get_id() result The result of pgaio_io_get_id() was being assigned to a mix of int and uint32 variables. This fixes it to use int consistently, which seems the most correct. Also change the queue empty special value in method_worker.c to -1 from UINT32_MAX. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/70c784b3-f60b-4652-b8a6-75e5f051243e%40eisentraut.org	2025-08-21 19:45:25 +02:00
Fujii Masao	12da45742c	Disallow server start with sync_replication_slots = on and wal_level < logical. Replication slot synchronization (sync_replication_slots = on) requires wal_level to be logical. This commit prevents the server from starting if sync_replication_slots is enabled but wal_level is set to minimal or replica. Failing early during startup helps users catch invalid configurations immediately, which is important because changing wal_level requires a server restart. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Discussion: https://postgr.es/m/CAH0PTU_pc3oHi__XESF9ZigCyzai1Mo3LsOdFyQA4aUDkm01RA@mail.gmail.com	2025-08-21 22:18:11 +09:00
Tom Lane	a67d4847a4	Fix re-execution of a failed SQLFunctionCache entry. If we error out during execution of a SQL-language function, we will often leave behind non-null pointers in its SQLFunctionCache's cplan and eslist fields. This is problematic if the SQLFunctionCache is re-used, because those pointers will point at resources that were released during error cleanup. This problem escaped detection so far because ordinarily we won't re-use an FmgrInfo+SQLFunctionCache struct after a query error. However, in the rather improbable case that someone implements an opclass support function in SQL language, there will be long-lived FmgrInfos for it in the relcache, and then the problem is reachable after the function throws an error. To fix, add a flag to SQLFunctionCache that tracks whether execution escapes out of fmgr_sql, and clear out the relevant fields during init_sql_fcache if so. (This is going to need more thought if we ever try to share FMgrInfos across threads; but it's very far from being the only problem such a project will encounter, since many functions regard fn_extra as being query-local state.) This broke at commit 0313c5dc6; before that we did not try to re-use SQLFunctionCache state across calls. Hence, back-patch to v18. Bug: #19026 Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19026-90aed5e71d0c8af3@postgresql.org Backpatch-through: 18	2025-08-20 16:09:18 -04:00
Peter Eisentraut	e9c043a11a	Minor error message enhancement In refuseDupeIndexAttach(), change from errdetail("Another index is already attached for partition \"%s\"."...) to errdetail("Another index \"%s\" is already attached for partition \"%s\"."...) so we can easily understand which index is already attached for partition \"%s\". Author: Jian He <jian.universality@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/CACJufxGBfykJ_1ztk9T%2BL_gLmkOSOF%2BmL9Mn4ZPydz-rh%3DLccQ%40mail.gmail.com	2025-08-20 18:14:24 +02:00
Michael Paquier	1f2e51e3c7	Fix assertion failure with replication slot release in single-user mode Some replication slot manipulations (logical decoding via SQL, advancing) were failing an assertion when releasing a slot in single-user mode, because active_pid was not set in a ReplicationSlot when its slot is acquired. ReplicationSlotAcquire() has some logic to be able to work with the single-user mode. This commit sets ReplicationSlot->active_pid to MyProcPid, to let the slot-related logic fall-through, considering the single process as the one holding the slot. Some TAP tests are added for various replication slot functions with the single-user mode, while on it, for slot creation, drop, advancing, copy and logical decoding with multiple slot types (temporary, physical vs logical). These tests are skipped on Windows, as direct calls of postgres --single would fail on permission failures. There is no platform-specific behavior that needs to be checked, so living with this restriction should be fine. The CI is OK with that, now let's see what the buildfarm tells. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Mutaamba Maasha <maasha@gmail.com> Discussion: https://postgr.es/m/OSCPR01MB14966ED588A0328DAEBE8CB25F5FA2@OSCPR01MB14966.jpnprd01.prod.outlook.com Backpatch-through: 13	2025-08-20 15:00:04 +09:00
Nathan Bossart	3eec0e6533	Fix comment for MAX_SIMUL_LWLOCKS. This comment mentions that pg_buffercache locks all buffer partitions simultaneously, but it hasn't done so since v10. Oversight in commit `6e654546fb`. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/aKTuAHVEuYCUmmIy%40nathan	2025-08-19 16:48:22 -05:00
Amit Kapila	aa21e49225	Fix self-deadlock during DROP SUBSCRIPTION. The DROP SUBSCRIPTION command performs several operations: it stops the subscription workers, removes subscription-related entries from system catalogs, and deletes the replication slot on the publisher server. Previously, this command acquired an AccessExclusiveLock on pg_subscription before initiating these steps. However, while holding this lock, the command attempts to connect to the publisher to remove the replication slot. In cases where the connection is made to a newly created database on the same server as subscriber, the cache-building process during connection tries to acquire an AccessShareLock on pg_subscription, resulting in a self-deadlock. To resolve this issue, we reduce the lock level on pg_subscription during DROP SUBSCRIPTION from AccessExclusiveLock to RowExclusiveLock. Earlier, the higher lock level was used to prevent the launcher from starting a new worker during the drop operation, as a restarted worker could become orphaned. Now, instead of relying on a strict lock, we acquire an AccessShareLock on the specific subscription being dropped and re-validate its existence after acquiring the lock. If the subscription is no longer valid, the worker exits gracefully. This approach avoids the deadlock while still ensuring that orphan workers are not created. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 13 Discussion: https://postgr.es/m/18988-7312c868be2d467f@postgresql.org	2025-08-19 05:33:17 +00:00
Michael Paquier	a977e419ee	Refactor ReadMultiXactCounts() into GetMultiXactInfo() This provides a single entry point to access some information about the state of MultiXacts, able to return some data about multixacts offsets and counts. Originally this function was only able to return some information about the number of multixacts and multixact members, extended here to provide some data about the oldest multixact ID in use and the oldest offset, if known. This change has been proposed in a patch that aims at providing more monitoring capabilities for multixacts, and it is useful on its own. GetMultiXactInfo() is added to multixact.h, becoming available for out-of-core code. Extracted from a larger patch by the same author. Author: Naga Appani <nagnrik@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com	2025-08-19 14:04:09 +09:00
Michael Paquier	9b7eb6f02e	Remove useless pointer update in StatsShmemInit() This pointer was not used after its last update. This variable assignment was most likely a vestige artifact of the earlier versions of the patch set that have led to `5891c7a8ed`. This pointer update is useless, so let's remove it. It removes one call to pgstat_dsa_init_size(), making the code slightly easier to grasp. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aKLsu2sdpnyeuSSc@ip-10-97-1-34.eu-west-3.compute.internal	2025-08-19 09:54:18 +09:00
Richard Guo	bf9ee294e5	Simplify relation_has_unique_index_for() Now that the only call to relation_has_unique_index_for() that supplied an exprlist and oprlist has been removed, the loop handling those lists is effectively dead code. This patch removes that loop and simplifies the function accordingly. Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com	2025-08-19 09:37:04 +09:00
Richard Guo	24225ad9aa	Pathify RHS unique-ification for semijoin planning There are two implementation techniques for semijoins: one uses the JOIN_SEMI jointype, where the executor emits at most one matching row per left-hand side (LHS) row; the other unique-ifies the right-hand side (RHS) and then performs a plain inner join. The latter technique currently has some drawbacks related to the unique-ification step. * Only the cheapest-total path of the RHS is considered during unique-ification. This may cause us to miss some optimization opportunities; for example, a path with a better sort order might be overlooked simply because it is not the cheapest in total cost. Such a path could help avoid a sort at a higher level, potentially resulting in a cheaper overall plan. * We currently rely on heuristics to choose between hash-based and sort-based unique-ification. A better approach would be to generate paths for both methods and allow add_path() to decide which one is preferable, consistent with how path selection is handled elsewhere in the planner. * In the sort-based implementation, we currently pay no attention to the pathkeys of the input subpath or the resulting output. This can result in redundant sort nodes being added to the final plan. This patch improves semijoin planning by creating a new RelOptInfo for the RHS rel to represent its unique-ified version. It then generates multiple paths that represent elimination of distinct rows from the RHS, considering both a hash-based implementation using the cheapest total path of the original RHS rel, and sort-based implementations that either exploit presorted input paths or explicitly sort the cheapest total path. All resulting paths compete in add_path(), and those deemed worthy of consideration are added to the new RelOptInfo. Finally, the unique-ified rel is joined with the other side of the semijoin using a plain inner join. As a side effect, most of the code related to the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes -- used to indicate that the LHS or RHS path should be made unique -- has been removed. Besides, the T_Unique path now has the same meaning for both semijoins and upper DISTINCT clauses: it represents adjacent-duplicate removal on presorted input. This patch unifies their handling by sharing the same data structures and functions. This patch also removes the UNIQUE_PATH_NOOP related code along the way, as it is dead code -- if the RHS rel is provably unique, the semijoin should have already been simplified to a plain inner join by analyzejoins.c. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com	2025-08-19 09:35:40 +09:00
Michael Paquier	24e71d53f8	Remove unneeded header declarations in multixact.c Two header declarations were related to SQL-callable functions, that should have been cleaned up in `df9133fa63`. Some more includes can be removed on closer inspection, so let's clean up these as well, while on it. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/345438.1755524834@sss.pgh.pa.us	2025-08-19 08:57:20 +09:00
David Rowley	a98ccf727e	Remove HASH_DEBUG output from dynahash.c This existed in a semi broken stated from `be0a66666` until `296cba276`. Recent discussion has questioned the value of having this at all as it only outputs static information from various of the hash table's properties when the hash table is created. Author: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/OSCPR01MB1496650D03FA0293AB9C21416F534A@OSCPR01MB14966.jpnprd01.prod.outlook.com	2025-08-19 11:14:21 +12:00
David Rowley	05fcb9667c	Use elog(DEBUG4) for dynahash.c statistics output Previously this was being output to stderr. This commit adjusts things to use elog(DEBUG4). Here we also adjust the format of the message to add the hash table name and also put the message on a single line. This should make grepping the logs for this information easier. Also get rid of the global hash table statistics. This seems very dated and didn't fit very well with trying to put all the statistics for a specific hash table on a single log line. The main aim here is to allow it so we can have at least one buildfarm member build with HASH_STATISTICS to help prevent future changes from breaking things in that area. `ca3891251` recently fixed some issues here. In passing, switch to using uint64 data types rather than longs for the usage counters. The long type is 32 bits on some platforms we support. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAApHDvoccvJ9CG5zx+i-EyCzJbcL5K=CzqrnL_YN59qaL5hiaw@mail.gmail.com	2025-08-19 10:57:44 +12:00
Peter Eisentraut	c61d51d500	Detect buffer underflow in get_th() Input with zero length can result in a buffer underflow when accessing *(num + (len - 1)), as (len - 1) would produce a negative index. Add an assertion for zero-length input to prevent it. This was found by ALT Linux Team. Reviewing the call sites shows that get_th() currently cannot be applied to an empty string: it is always called on a string containing a number we've just printed. Therefore, an assertion rather than a user-facing error message is sufficient. Co-authored-by: Alexander Kuznetsov <kuznetsovam@altlinux.org> Discussion: https://www.postgresql.org/message-id/flat/e22df993-cdb4-4d0a-b629-42211ebed582@altlinux.org	2025-08-18 11:03:22 +02:00
Michael Paquier	df9133fa63	Move SQL-callable code related to multixacts into its own file A patch is under discussion to add more SQL capabilities related to multixacts, and this move avoids bloating the file more than necessary. This affects pg_get_multixact_members(). A side effect of this move is the requirement to add mxstatus_to_string() to multixact.h. Extracted from a larger patch by the same author, tweaked by me. Author: Naga Appani <nagnrik@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com	2025-08-18 14:57:55 +09:00
Michael Paquier	ba3d93b2e8	Refactor init_params() in sequence.c to not use FormData_pg_sequence_data init_params() sets up "last_value" and "is_called" for a sequence relation holdind its metadata, based on the sequence properties in pg_sequences. "log_cnt" is the third property that can be updated in this routine for FormData_pg_sequence_data, tracking when WAL records should be generated for a sequence after nextval() iterations. This routine is called when creating or altering a sequence. This commit refactors init_params() to not depend anymore on FormData_pg_sequence_data, removing traces of it in sequence.c, making easier the manipulation of metadata related to sequences. The knowledge about "log_cnt" is replaced with a more general "reset_state" flag, to let the caller know if the sequence state should be reset. In the case of in-core sequences, this relates to WAL logging. We still need to depend on FormData_pg_sequence. Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/ZWlohtKAs0uVVpZ3@paquier.xyz	2025-08-18 11:38:44 +09:00
Masahiko Sawada	928da6ff12	Fix typos in comments. Oversight in commit `fd5a1a0c3e`. Author: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAHewXNmTT3M_w4NngG=6G3mdT3iJ6DdncTqV9YnGXBPHW8XYtA@mail.gmail.com	2025-08-16 01:11:40 -07:00
Masahiko Sawada	37265ca01f	Fix constant when extracting timestamp from UUIDv7. When extracting a timestamp from a UUIDv7, a conversion from milliseconds to microseconds was using the incorrect constant NS_PER_US instead of US_PER_MS. Although both constants have the same value, this fix improves code clarity by using the semantically correct constant. Backpatch to v18, where UUIDv7 was introduced. Author: Erik Nordström <erik@tigerdata.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CACAa4V+i07eaP6h4MHNydZeX47kkLPwAg0sqe67R=M5tLdxNuQ@mail.gmail.com Backpatch-through: 18	2025-08-15 11:58:53 -07:00
David Rowley	296cba2760	Fix invalid format string in HASH_DEBUG code This seems to have been broken back in `be0a66666`. Reported-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/OSCPR01MB14966E11EEFB37D7857FCEDB7F535A@OSCPR01MB14966.jpnprd01.prod.outlook.com Backpatch-through: 14	2025-08-15 18:05:44 +12:00
David Rowley	ca38912512	Fix failing -D HASH_STATISTICS builds This seems to have been broken for a few years by `cc5ef90ed`. Author: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/OSCPR01MB14966E11EEFB37D7857FCEDB7F535A@OSCPR01MB14966.jpnprd01.prod.outlook.com Backpatch-through: 17	2025-08-15 17:23:45 +12:00
David Rowley	b4632883d4	Add Asserts to validate prevbit values in bms_prev_member bms_prev_member() could attempt to access memory outside of the words[] array in cases where the prevbit was a number < -1 or > a->nwords * BITS_PER_BITMAPWORD + 1. Here we add the Asserts to help draw attention to bogus callers so we're more likely to catch them during development. In passing, fix wording of bms_prev_member's header comment which talks about how we expect the callers to ensure only valid prevbit values are used. Author: Greg Burd <greg@burd.me> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2000A717-1FFE-4031-827B-9330FB2E9065%40getmailspring.com	2025-08-15 16:33:07 +12:00
Álvaro Herrera	d0e7e04ede	Avoid including tableam.h and xlogreader.h in nbtree.h Doing that seems rather random and unnecessary. This commit removes those and fixes fallout, which is pretty minimal. We do need to add a forward declaration of struct TM_IndexDeleteOp (whose full definition appears in tableam.h) so that _bt_delitems_delete_check()'s declaration can use it. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/202508051109.lzk3lcuzsaxo@alvherre.pgsql	2025-08-14 17:48:46 +02:00
Tom Lane	ed07361721	Don't leak memory during failure exit from SelectConfigFiles(). Make sure the memory allocated by make_absolute_path() is freed when SelectConfigFiles() fails. Since all the callers will exit immediately in that case, there's no practical gain here, but silencing Valgrind leak complaints seems useful. In any case, it was inconsistent that only one of the failure exits did this. Author: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAJ7c6TMByXE8dc7zDvDWTQjk6o-XXAdRg_RAg5CBaUOgFPV3LQ%40mail.gmail.com	2025-08-14 11:39:19 -04:00
Heikki Linnakangas	4ec6e22b43	Fix LSN format in debug message Commit `2633dae2e4` standardized all existing messages to use `%X/%08X` for LSNs, but this one crept back in after the commit.	2025-08-14 13:31:18 +03:00
Michael Paquier	6304256e79	Fix compilation warning with SerializeClientConnectionInfo() This function uses an argument named "maxsize" that is only used in assertions, being set once outside the assertion area. Recent gcc versions with -Wunused-but-set-parameter complain about a warning when building without assertions enabled, because of that. In order to fix this issue, PG_USED_FOR_ASSERTS_ONLY is added to the function argument of SerializeClientConnectionInfo(), which is the first time we are doing so in the tree. The CI is fine with the change, but let's see what the buildfarm has to say on the matter. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Jacob Champion <jchampion@postgresql.org> Discussion: https://postgr.es/m/pevajesswhxafjkivoq3yvwxga77tbncghlf3gq5fvchsvfuda@6uivg25sb3nx Backpatch-through: 16	2025-08-14 16:21:50 +09:00
Fujii Masao	e9a31c0cc6	Revert logical snapshot filename format change in SnapBuildSnapshotExists(). Commit `2633dae2e4` standardized LSN formatting but mistakenly changed the logical snapshot filename format in SnapBuildSnapshotExists() from "%X-%X.snap" to "%08X-%08X.snap". Other code still used the original "%X-%X.snap" format, causing the replication slot synchronization worker to fail to find existing snapshot files and produce excessive log messages. This commit restores the original "%X-%X.snap" format in SnapBuildSnapshotExists() to resolve the issue. Author: Shveta Malik <shveta.malik@gmail.com> Discussion: https://postgr.es/m/CAHGQGwHuHPB-ucAk_Tq3uSs4Fdziu1Jp_AA_RD3m5Ycky7m48w@mail.gmail.com	2025-08-14 12:33:14 +09:00
Tom Lane	ee54046601	Grab the low-hanging fruit from forcing USE_FLOAT8_BYVAL to true. Remove conditionally-compiled code for the other case. Replace uses of FLOAT8PASSBYVAL with constant "true", mainly because it was quite confusing in cases where the type we were dealing with wasn't float8. I left the associated pg_control and Pg_magic_struct fields in place. Perhaps we should get rid of them, but it would save little, so it doesn't seem worth thinking hard about the compatibility implications. I just labeled them "vestigial" in places where that seemed helpful. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us	2025-08-13 17:18:22 -04:00
Tom Lane	6aebedc384	Grab the low-hanging fruit from forcing sizeof(Datum) to 8. Remove conditionally-compiled code for smaller Datum widths, and simplify comments that describe cases no longer of interest. I also fixed up a few more places that were not using DatumGetIntXX where they should, and made some cosmetic adjustments such as using sizeof(int64) not sizeof(Datum) in places where that fit better with the surrounding code. One thing I remembered while preparing this part is that SP-GiST stores pass-by-value prefix keys as Datums, so that the on-disk representation depends on sizeof(Datum). That's even more unfortunate than the existing commentary makes it out to be, because now there is a hazard that the change of sizeof(Datum) will break SP-GiST indexes on 32-bit machines. It appears that there are no existing SP-GiST opclasses that are actually affected; and if there are some that I didn't find, the number of installations that are using them on 32-bit machines is doubtless tiny. So I'm proceeding on the assumption that we can get away with this, but it's something to worry about. (gininsert.c looks like it has a similar problem, but it's okay because the "tuples" it's constructing are just transient data within the tuplesort step. That's pretty poorly documented though, so I added some comments.) Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us	2025-08-13 17:18:22 -04:00
Tom Lane	2a600a93c7	Make type Datum be 8 bytes wide everywhere. This patch makes sizeof(Datum) be 8 on all platforms including 32-bit ones. The objective is to allow USE_FLOAT8_BYVAL to be true everywhere, and in consequence to remove a lot of code that is specific to pass-by-reference handling of float8, int8, etc. The code for abbreviated sort keys can be simplified similarly. In this way we can reduce the maintenance effort involved in supporting 32-bit platforms, without going so far as to actually desupport them. Since Datum is strictly an in-memory concept, this has no impact on on-disk storage, though an initdb or pg_upgrade will be needed to fix affected catalog entries. We have required platforms to support [u]int64 for ages, so this breaks no supported platform. We can expect that this change will make 32-bit builds a bit slower and more memory-hungry, although being able to use pass-by-value handling of 8-byte types may buy back some of that. But we stopped optimizing for 32-bit cases a long time ago, and this seems like just another step on that path. This initial patch simply forces the correct type definition and USE_FLOAT8_BYVAL setting, and cleans up a couple of minor compiler complaints that ensued. This is sufficient for testing purposes. In the wake of a bunch of Datum-conversion cleanups by Peter Eisentraut, this now compiles cleanly with gcc on a 32-bit platform. (I'd only tested the previous version with clang, which it turns out is less picky than gcc about width-changing coercions.) There is a good deal of now-dead code that I'll remove in separate follow-up patches. A catversion bump is required because this affects initial catalog contents (on 32-bit machines) in two ways: pg_type.typbyval changes for some built-in types, and Const nodes in stored views/rules will now have 8 bytes not 4 for pass-by-value types. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us	2025-08-13 17:18:22 -04:00
Tom Lane	21fddb3d76	Don't treat EINVAL from semget() as a hard failure. It turns out that on some platforms (at least current macOS, NetBSD, OpenBSD) semget(2) will return EINVAL if there is a pre-existing semaphore set with the same key and too few semaphores. Our code expects EEXIST in that case and treats EINVAL as a hard failure, resulting in failure during initdb or postmaster start. POSIX does document EINVAL for too-few-semaphores-in-set, and is silent on its priority relative to EEXIST, so this behavior arguably conforms to spec. Nonetheless it's quite problematic because EINVAL is also documented to mean that nsems is greater than the system's limit on the number of semaphores per set (SEMMSL). If that is where the problem lies, retrying would just become an infinite loop. To resolve this contradiction, retry after EINVAL, but also install a loop limit that will make us give up regardless of the specific errno after trying 1000 different keys. (1000 is a pretty arbitrary number, but it seems like it should be sufficient.) I like this better than the previous infinite-looping behavior, since it will also keep us out of trouble if (say) we get EACCES due to a system-level permissions problem rather than anything to do with a specific semaphore set. This problem has only been observed in the field in PG 17, which uses a higher nsems value than other branches (cf. `38da05346`, `810a8b1c8`). That makes it possible to get the failure if a new v17 postmaster has a key collision with an existing postmaster of another branch. In principle though, we might see such a collision against a semaphore set created by some other application, in which case all branches are vulnerable on these platforms. Hence, backpatch. Reported-by: Gavin Panella <gavinpanella@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CALL7chmzY3eXHA7zHnODUVGZLSvK3wYCSP0RmcDFHJY8f28Q3g@mail.gmail.com Backpatch-through: 13	2025-08-13 12:00:03 -04:00
Andres Freund	b227b0bb4e	Reduce ExecSeqScan* code size using pg_assume() `fb9f955025` optimized code generation by using specialized variants of ExecSeqScan* for [not] having a qual, projection etc. This allowed the compiler to optimize the code out the code for qual / projection. However, as observed by David Rowley at the time, the compiler couldn't prove the opposite, i.e. that the qual etc are present. By using pg_assume(), introduced in `d65eb5b1b8`, we can tell the compiler that the relevant variables are non-null. This reduces the code size to a surprising degree and seems to lead to a small but reproducible performance gain. Reviewed-by: Amit Langote <amitlangote09@gmail.com> Discussion: https://postgr.es/m/CA+HiwqFk-MbwhfX_kucxzL8zLmjEt9MMcHi2YF=DyhPrSjsBEA@mail.gmail.com	2025-08-11 15:41:34 -04:00
Andres Freund	01d6832c10	meson: add and use stamp files for generated headers Without using stamp files, meson lists the generated headers as the dependency for every .c file, bloating build.ninja by more than 2x. Processing all the dependencies also increases the time to generate build.ninja. The immediate benefit is that this makes re-configuring and clean builds a bit faster. The main motivation however is that I have other patches that introduce additional build targets that further would increase the size of build.ninja, making re-configuring more noticeably slower. Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/cgkdgvzdpinkacf4v33mky7tbmk467oda5dd4dlmucjjockxzi@xkqfvjoq4uiy	2025-08-11 15:18:23 -04:00
Dean Rasheed	22424953cd	Fix security checks in selectivity estimation functions. Commit `e2d4ef8de8` (the fix for CVE-2017-7484) added security checks to the selectivity estimation functions to prevent them from running user-supplied operators on data obtained from pg_statistic if the user lacks privileges to select from the underlying table. In cases involving inheritance/partitioning, those checks were originally performed against the child RTE (which for plain inheritance might actually refer to the parent table). Commit `553d2ec271` then extended that to also check the parent RTE, allowing access if the user had permissions on either the parent or the child. It turns out, however, that doing any checks using the child RTE is incorrect, since securityQuals is set to NULL when creating an RTE for an inheritance child (whether it refers to the parent table or the child table), and therefore such checks do not correctly account for any RLS policies or security barrier views. Therefore, do the security checks using only the parent RTE. This is consistent with how RLS policies are applied, and the executor's ACL checks, both of which use only the parent table's permissions/policies. Similar checks are performed in the extended stats code, so update that in the same way, centralizing all the checks in a new function. In addition, note that these checks by themselves are insufficient to ensure that the user has access to the table's data because, in a query that goes via a view, they only check that the view owner has permissions on the underlying table, not that the current user has permissions on the view itself. In the selectivity estimation functions, there is no easy way to navigate from underlying tables to views, so add permissions checks for all views mentioned in the query to the planner startup code. If the user lacks permissions on a view, a permissions error will now be reported at planner-startup, and the selectivity estimation functions will not be run. Checking view permissions at planner-startup in this way is a little ugly, since the same checks will be repeated at executor-startup. Longer-term, it might be better to move all the permissions checks from the executor to the planner so that permissions errors can be reported sooner, instead of creating a plan that won't ever be run. However, such a change seems too far-reaching to be back-patched. Back-patch to all supported versions. In v13, there is the added complication that UPDATEs and DELETEs on inherited target tables are planned using inheritance_planner(), which plans each inheritance child table separately, so that the selectivity estimation functions do not know that they are dealing with a child table accessed via its parent. Handle that by checking access permissions on the top parent table at planner-startup, in the same way as we do for views. Any securityQuals on the top parent table are moved down to the child tables by inheritance_planner(), so they continue to be checked by the selectivity estimation functions. Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Noah Misch <noah@leadboat.com> Backpatch-through: 13 Security: CVE-2025-8713	2025-08-11 09:03:11 +01:00
Thomas Munro	b421223172	Fix rare bug in read_stream.c's split IO handling. The internal queue of buffers could become corrupted in a rare edge case that failed to invalidate an entry, causing a stale buffer to be "forwarded" to StartReadBuffers(). This is a simple fix for the immediate problem. A small API change might be able to remove this and related fragility entirely, but that will have to wait a bit. Defect in commit `ed0b87ca`. Bug: 19006 Backpatch-through: 18 Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/19006-80fcaaf69000377e%40postgresql.org	2025-08-09 13:04:38 +12:00
Tom Lane	665c3dbba4	Mop-up for Datum conversion cleanups. Fix a couple more places where an explicit Datum conversion is needed (not clear how we missed these in `ff89e182d` and previous commits). Replace the minority usage "(Datum) NULL" with "(Datum) 0". The former depends on the assumption that Datum is the same width as Pointer, the latter doesn't. Anyway consistency is a good thing. This is, I believe, the last of the notational mop-up needed before we can consider changing Datum to uint64 everywhere. It's also important cleanup for more aggressive ideas such as making Datum a struct. Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us Discussion: https://postgr.es/m/8246d7ff-f4b7-4363-913e-827dadfeb145@eisentraut.org	2025-08-08 18:44:57 -04:00
Peter Eisentraut	ff89e182d4	Add missing Datum conversions Add various missing conversions from and to Datum. The previous code mostly relied on implicit conversions or its own explicit casts instead of using the correct DatumGet() or GetDatum() functions. We think these omissions are harmless. Some actual bugs that were discovered during this process have been committed separately (`80c758a2e1`, `fd2ab03fea`). Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org	2025-08-08 22:06:57 +02:00
Peter Eisentraut	dcfc0f8912	Remove useless/superfluous Datum conversions Remove useless DatumGetFoo() and FooGetDatum() calls. These are places where no conversion from or to Datum was actually happening. We think these extra calls covered here were harmless. Some actual bugs that were discovered during this process have been committed separately (`80c758a2e1`, `2242b26ce4`). Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org	2025-08-08 22:06:57 +02:00
Thomas Munro	b5cd74612c	Remove obsolete comment. Remove a comment about potential for AIO in StartReadBuffersImpl(), because that change happened.	2025-08-09 01:46:04 +12:00
Etsuro Fujita	9e63f83a7e	Fix oversight in FindTriggerIncompatibleWithInheritance. This function is called from ATExecAttachPartition/ATExecAddInherit, which prevent tables with row-level triggers with transition tables from becoming partitions or inheritance children, to check if there is such a trigger on the given table, but failed to check if a found trigger is row-level, causing the caller functions to needlessly prevent a table with only a statement-level trigger with transition tables from becoming a partition or inheritance child. Repair. Oversight in commit `501ed02cf`. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Discussion: https://postgr.es/m/CAPmGK167mXzwzzmJ_0YZ3EZrbwiCxtM1vogH_8drqsE6PtxRYw%40mail.gmail.com Backpatch-through: 13	2025-08-08 17:35:00 +09:00
Etsuro Fujita	62a1211d33	Disallow collecting transition tuples from child foreign tables. Commit `9e6104c66` disallowed transition tables on foreign tables, but failed to account for cases where a foreign table is a child table of a partitioned/inherited table on which transition tables exist, leading to incorrect transition tuples collected from such foreign tables for queries on the parent table triggering transition capture. This occurred not only for inherited UPDATE/DELETE but for partitioned INSERT later supported by commit `3d956d956`, which should have handled it at least for the INSERT case, but didn't. To fix, modify ExecARTriggers to throw an error if the given relation is a foreign table requesting transition capture. Also, this commit fixes make_modifytable so that in case of an inherited UPDATE/DELETE triggering transition capture, FDWs choose normal operations to modify child foreign tables, not DirectModify; which is needed because they would otherwise skip the calls to ExecARTriggers at execution, causing unexpected behavior. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Discussion: https://postgr.es/m/CAPmGK14QJYikKzBDCe3jMbpGENnQ7popFmbEgm-XTNuk55oyHg%40mail.gmail.com Backpatch-through: 13	2025-08-08 10:50:00 +09:00
Michael Paquier	84b32fd228	Add information about "generation" when dropping twice pgstats entry Dropping twice a pgstats entry should not happen, and the error report generated was missing the "generation" counter (tracking when an entry is reused) that has been added in `818119afcc`. Like `d92573adcb`, backpatch down to v15 where this information is useful to have, to gather more information from instances where the problem shows up. A report has shown that this error path has been reached on a standby based on 17.3, for a relation stats entry and an OID close to wraparound. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/CAN4RuQvYth942J2+FcLmJKgdpq6fE5eqyFvb_PuskxF2eL=Wzg@mail.gmail.com Backpatch-through: 15	2025-08-08 09:07:10 +09:00
Dean Rasheed	d699687b32	Extend int128.h to support more numeric code. This adds a few more functions to int128.h, allowing more of numeric.c to use 128-bit integers on all platforms. Specifically, int64_div_fast_to_numeric() and the following aggregate functions can now use 128-bit integers for improved performance on all platforms, rather than just platforms with native support for int128: - SUM(int8) - AVG(int8) - STDDEV_POP(int2 or int4) - STDDEV_SAMP(int2 or int4) - VAR_POP(int2 or int4) - VAR_SAMP(int2 or int4) In addition to improved performance on platforms lacking native 128-bit integer support, this significantly simplifies this numeric code by allowing a lot of conditionally compiled code to be deleted. A couple of numeric functions (div_var_int64() and sqrt_var()) still contain conditionally compiled 128-bit integer code that only works on platforms with native 128-bit integer support. Making those work more generally would require rolling our own higher precision 128-bit division, which isn't supported for now. Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://postgr.es/m/CAEZATCWgBMc9ZwKMYqQpaQz2X6gaamYRB+RnMsUNcdMcL2Mj_w@mail.gmail.com	2025-08-07 15:49:24 +01:00
Alexander Korotkov	466c5435fd	Fix checkpointer shared memory allocation Use Min(NBuffers, MAX_CHECKPOINT_REQUESTS) instead of NBuffers in CheckpointerShmemSize() to match the actual array size limit set in CheckpointerShmemInit(). This prevents wasting shared memory when NBuffers > MAX_CHECKPOINT_REQUESTS. Also, fix the comment. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1439188.1754506714%40sss.pgh.pa.us Author: Xuneng Zhou <xunengzhou@gmail.com> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-08-07 14:29:02 +03:00
Michael Paquier	2242b26ce4	Fix incorrect Datum conversion in timestamptz_trunc_internal() The code used a PG_RETURN_TIMESTAMPTZ() where the return type is TimestampTz and not a Datum. On 64-bit systems, there is no effect since this just ends up casting 64-bit integers back and forth. On 32-bit systems, timestamptz is pass-by-reference. PG_RETURN_TIMESTAMPTZ() allocates new memory and returns the address, meaning that the caller could interpret this as a timestamp value. The effect is using "date_trunc(..., 'infinity'::timestamptz) will return random values (instead of the correct return value 'infinity'). Bug introduced in commit `d85ce012f9`. Author: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2d320b6f-b4af-4fbc-9eec-5d0fa15d187b@eisentraut.org Discussion: https://postgr.es/m/4bf60a84-2862-4a53-acd5-8eddf134a60e@eisentraut.org Backpatch-through: 18	2025-08-07 11:02:04 +09:00
Nathan Bossart	9ea3b6f751	Expand usage of macros for protocol characters. This commit makes use of the existing PqMsg_* macros in more places and adds new PqReplMsg_* and PqBackupMsg_* macros for use in special replication and backup messages, respectively. Author: Dave Cramer <davecramer@gmail.com> Co-authored-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/aIECfYfevCUpenBT@nathan Discussion: https://postgr.es/m/CAFcNs%2Br73NOUb7%2BqKrV4HHEki02CS96Z%2Bx19WaFgE087BWwEng%40mail.gmail.com	2025-08-06 13:37:00 -05:00
Nathan Bossart	35baa60cc7	Rename transformRelOptions()'s "namspace" parameter to "nameSpace". The name "namspace" looks like a typo, but it was presumably meant to avoid using the "namespace" C++ keyword. This commit renames the parameter to "nameSpace" to prevent future confusion while still avoiding the keyword. Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/aJJxpfsDfiQ1VbJ5%40nathan	2025-08-06 12:08:07 -05:00
Peter Eisentraut	73d33be4da	Remove INT64_HEX_FORMAT and UINT64_HEX_FORMAT These were introduced (commit `efdc7d7475`) at the same time as we were moving to using the standard inttypes.h format macros (commit `a0ed19e0a9`). It doesn't seem useful to keep a new already-deprecated interface like this with only a few users, so remove the new symbols again and have the callers use PRIx64. (Also, INT64_HEX_FORMAT was kind of a misnomer, since hex formats all use unsigned types.) Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/0ac47b5d-e5ab-4cac-98a7-bdee0e2831e4%40eisentraut.org	2025-08-06 11:08:10 +02:00
Masahiko Sawada	b5c53b403c	Suppress maybe-uninitialized warning. Following commit `e035863c9a`, building with -O0 began triggering warnings about potentially uninitialized 'workbuf' usage. While theoretically the initialization isn't necessary since VARDATA() doesn't access the contents of the pointed-to object, this commit explicitly initializes the workbuf variable to suppress the warning. Buildfarm members adder and flaviventris have shown the warning. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAD21AoCOZxfqnNgfM5yVKJZYnOq5m2Q96fBGy1fovEqQ9V4OZA@mail.gmail.com	2025-08-05 15:30:28 -07:00
Tom Lane	80c758a2e1	Fix incorrect return value in brin_minmax_multi_distance_numeric(). The result of "DirectFunctionCall1(numeric_float8, d)" is already in Datum form, but the code was incorrectly applying PG_RETURN_FLOAT8() to it. On machines where float8 is pass-by-reference, this would result in complete garbage, since an unpredictable pointer value would be treated as an integer and then converted to float. It's not entirely clear how much of a problem would ensue on 64-bit hardware, but certainly interpreting a float8 bitpattern as uint64 and then converting that to float isn't the intended behavior. As luck would have it, even the complete-garbage case doesn't break BRIN indexes, since the results are only used to make choices about how to merge values into ranges: at worst, we'd make poor choices resulting in an inefficient index. Doubtless that explains the lack of field complaints. However, users with BRIN indexes that use the numeric_minmax_multi_ops opclass may wish to reindex in hopes of making their indexes more efficient. Author: Peter Eisentraut <peter@eisentraut.org> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2093712.1753983215@sss.pgh.pa.us Backpatch-through: 14	2025-08-05 16:51:10 -04:00
Masahiko Sawada	deb674454c	Add backup_type column to pg_stat_progress_basebackup. This commit introduces a new column backup_type that indicates the type of backup being performed: either 'full' or 'incremental'. Bump catalog version. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp> Discussion: https://postgr.es/m/CAOzEurQuzbHwTj1ehk1a+eeQDidJPyrE5s6mYumkjwjZnurhkQ@mail.gmail.com	2025-08-05 10:50:45 -07:00
Jeff Davis	295a39770e	Don't copy datlocale from template unless provider matches. During CREATE DATABASE, if changing the locale provider, require that a new locale is specified rather than trying to reinterpret the template's locale using the new provider. This only affects the behavior when the template uses the builtin provider and CREATE DATABASE specifies the ICU provider without specifying the locale. Previously, that may have succeeded due to loose validation by ICU, whereas now that will cause an error. Because it can cause an error, backport only to unreleased versions. Discussion: https://postgr.es/m/5038b33a6dc639009f4b3d43fa6ae0c5ba9e04f7.camel@j-davis.com Backpatch-through: 18	2025-08-05 09:25:23 -07:00
Tom Lane	f291751ef8	Mop-up for commit `e035863c9`. Neither Peter nor I had tried this with USE_VALGRIND ... Per buildfarm member skink.	2025-08-05 12:11:33 -04:00
Peter Eisentraut	0f5ade7a36	Fix varatt versus Datum type confusions Macros like VARDATA() and VARSIZE() should be thought of as taking values of type pointer to struct varlena or some other related struct. The way they are implemented, you can pass anything to it and it will cast it right. But this is in principle incorrect. To fix, add the required DatumGetPointer() calls. Or in a couple of cases, remove superfluous PointerGetDatum() calls. It is planned in a subsequent patch to change macros like VARDATA() and VARSIZE() to inline functions, which will enforce stricter typing. This is in preparation for that. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/928ea48f-77c6-417b-897c-621ef16685a6%40eisentraut.org	2025-08-05 12:11:36 +02:00
Peter Eisentraut	2ad6e80de9	Fix various hash function uses These instances were using Datum-returning functions where a lower-level function returning uint32 would be more appropriate. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org	2025-08-05 11:47:23 +02:00
Amit Kapila	c9a5860f7a	Throw ERROR when publish_generated_columns is specified without a value. Previously, specifying the publication option 'publish_generated_columns' without an explicit value would incorrectly default to 'stored', which is not the intended behavior. This patch fixes the issue by raising an ERROR when no value is provided for 'publish_generated_columns', ensuring that users must explicitly specify a valid option. Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Backpatch-through: 18, where it was introduced Discussion: https://postgr.es/m/CAHut+PsCUCWiEKmB10DxhoPfXbF6jw5RD9ib2LuaQeA_XraW7w@mail.gmail.com	2025-08-05 09:34:22 +00:00
Peter Eisentraut	1469e31297	Fix mixups of FooGetDatum() vs. DatumGetFoo() Some of these were accidentally reversed, but there was no ill effect. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org	2025-08-05 10:53:49 +02:00
Fujii Masao	4614d53d4e	Avoid unexpected shutdown when sync_replication_slots is enabled. Previously, enabling sync_replication_slots while wal_level was not set to logical could cause the server to shut down. This was because the postmaster performed a configuration check before launching the slot synchronization worker and raised an ERROR if the settings were incompatible. Since ERROR is treated as FATAL in the postmaster, this resulted in the entire server shutting down unexpectedly. This commit changes the postmaster to log that message with a LOG-level instead of raising an ERROR, allowing the server to continue running even with the misconfiguration. Back-patch to v17, where slot synchronization was introduced. Reported-by: Hugo DUBOIS <hdubois@scaleway.com> Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Hugo DUBOIS <hdubois@scaleway.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Discussion: https://postgr.es/m/CAH0PTU_pc3oHi__XESF9ZigCyzai1Mo3LsOdFyQA4aUDkm01RA@mail.gmail.com Backpatch-through: 17	2025-08-04 20:51:42 +09:00
David Rowley	bca9a1900c	Fix incorrect comment regarding mod_since_analyze Author: Yugo Nagata <nagata@sraoss.co.jp> Discussion: https://postgr.es/m/20250804140120.280c2d6a9d2ea687cd167743@sraoss.co.jp	2025-08-04 17:43:22 +12:00
Amit Kapila	fd5a1a0c3e	Detect and report update_deleted conflicts. This enhancement builds upon the infrastructure introduced in commit `228c370868`, which enables the preservation of deleted tuples and their origin information on the subscriber. This capability is crucial for handling concurrent transactions replicated from remote nodes. The update introduces support for detecting update_deleted conflicts during the application of update operations on the subscriber. When an update operation fails to locate the target row-typically because it has been concurrently deleted-we perform an additional table scan. This scan uses the SnapshotAny mechanism and we do this additional scan only when the retain_dead_tuples option is enabled for the relevant subscription. The goal of this scan is to locate the most recently deleted tuple-matching the old column values from the remote update-that has not yet been removed by VACUUM and is still visible according to our slot (i.e., its deletion is not older than conflict-detection-slot's xmin). If such a tuple is found, the system reports an update_deleted conflict, including the origin and transaction details responsible for the deletion. This provides a groundwork for more robust and accurate conflict resolution process, preventing unexpected behavior by correctly identifying cases where a remote update clashes with a deletion from another origin. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-08-04 04:02:47 +00:00
Tom Lane	5c8eda1f72	Take a little more care in set_backtrace(). Coverity complained that the "errtrace" string is leaked if we return early because backtrace_symbols fails. Another criticism that could be leveled at this is that not providing any hint of what happened is user-unfriendly. Fix that. The odds of a leak here are small, and typically it wouldn't matter anyway since the leak will be in ErrorContext which will soon get reset. So I'm not feeling a need to back-patch.	2025-08-03 13:01:17 -04:00
Tom Lane	4fbfdde58e	Avoid leakage of zero-length arrays in partition_bounds_copy(). If ndatums is zero, the code would allocate zero-length boundKinds and boundDatums chunks, which would have nothing pointing to them, leading to Valgrind complaints. Rearrange the code to avoid the useless pallocs, and also to not bother computing byval/typlen when they aren't used. I'm unsure why I didn't see this in my Valgrind testing back in May. This code hasn't changed since then, but maybe we added a regression test that reaches this edge case. Or possibly I just failed to notice the reports, which do say "0 bytes lost". Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us	2025-08-02 21:59:46 -04:00
Tom Lane	b102c8c473	Silence complaints about leaks in PlanCacheComputeResultDesc. CompleteCachedPlan intentionally doesn't worry about small leaks from PlanCacheComputeResultDesc. However, Valgrind knows nothing of engineering tradeoffs and complains anyway. Silence it by doing things the hard way if USE_VALGRIND. I don't really love this patch, because it makes the handling of plansource->resultDesc different from the handling of query dependencies and search_path just above, which likewise are willing to accept small leaks into the cached plan's context. However, those cases aren't provoking Valgrind complaints. (Perhaps in a CLOBBER_CACHE_ALWAYS build, they would?) For the moment, this makes the src/pl/plpgsql tests leak-free according to Valgrind. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us	2025-08-02 21:59:46 -04:00
Tom Lane	7f6ededa76	Suppress complaints about leaks in TS dictionary loading. Like the situation with function cache loading, text search dictionary loading functions tend to leak some cruft into the dictionary's long-lived cache context. To judge by the examples in the core regression tests, not very many bytes are at stake. Moreover, I don't see a way to prevent such leaks without changing the API for TS template initialization functions: right now they do not have to worry about making sure that their results are long-lived. Hence, I think we should install a suppression rule rather than trying to fix this completely. However, I did grab some low-hanging fruit: several places were leaking the result of get_tsearch_config_filename. This seems worth doing mostly because they are inconsistent with other dictionaries that were freeing it already. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us	2025-08-02 21:59:46 -04:00
Tom Lane	db01c90b2f	Silence Valgrind leakage complaints in more-or-less-hackish ways. These changes don't actually fix any leaks. They just make sure that Valgrind will find pointers to data structures that remain allocated at process exit, and thus not falsely complain about leaks. In particular, we are trying to avoid situations where there is no pointer to the beginning of an allocated block (except possibly within the block itself, which Valgrind won't count). * Because dynahash.c never frees hashtable storage except by deleting the whole hashtable context, it doesn't bother to track the individual blocks of elements allocated by element_alloc(). This results in "possibly lost" complaints from Valgrind except when the first element of each block is actively in use. (Otherwise it'll be on a freelist, but very likely only reachable via "interior pointers" within element blocks, which doesn't satisfy Valgrind.) To fix, if we're building with USE_VALGRIND, expend an extra pointer's worth of space in each element block so that we can chain them all together from the HTAB header. Skip this in shared hashtables though: Valgrind doesn't track those, and we'd need additional locking to make it safe to manipulate a shared chain. While here, update a comment obsoleted by `9c911ec06`. * Put the dlist_node fields of catctup and catclist structs first. This ensures that the dlist pointers point to the starts of these palloc blocks, and thus that Valgrind won't consider them "possibly lost". * The postmaster's PMChild structs and the autovac launcher's avl_dbase structs also have the dlist_node-is-not-first problem, but putting it first still wouldn't silence the warning because we bulk-allocate those structs in an array, so that Valgrind sees a single allocation. Commonly the first array element will be pointed to only from some later element, so that the reference would be an interior pointer even if it pointed to the array start. (This is the same issue as for dynahash elements.) Since these are pretty simple data structures, I don't feel too bad about faking out Valgrind by just keeping a static pointer to the array start. (This is all quite hacky, and it's not hard to imagine usages where we'd need some other idea in order to have reasonable leak tracking of structures that are only accessible via dlist_node lists. But these changes seem to be enough to silence this class of leakage complaints for the moment.) * Free a couple of data structures manually near the end of an autovacuum worker's run when USE_VALGRIND, and ensure that the final vac_update_datfrozenxid() call is done in a non-permanent context. This doesn't have any real effect on the process's total memory consumption, since we're going to exit as soon as that last transaction is done. But it does pacify Valgrind. * Valgrind complains about the postmaster's socket-files and lock-files lists being leaked, which we can silence by just not nulling out the static pointers to them. * Valgrind seems not to consider the global "environ" variable as a valid root pointer; so when we allocate a new environment array, it claims that data is leaked. To fix that, keep our own statically-allocated copy of the pointer, similarly to the previous item. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us	2025-08-02 21:59:46 -04:00
Tom Lane	e78d1d6d47	Fix assorted pretty-trivial memory leaks in the backend. In the current system architecture, none of these are worth obsessing over; most are once-per-process leaks. However, Valgrind complains about all of them, and if we get to using threads rather than processes for backend sessions, it will become more interesting to avoid per-session leaks. * Fix leaks in StartupXLOG() and ShutdownWalRecovery(). * Fix leakage of pq_mq_handle in a parallel worker. While at it, move mq_putmessage's "Assert(pq_mq_handle != NULL)" to someplace where it's not trivially useless. * Fix leak in logicalrep_worker_detach(). * Don't leak the startup-packet buffer in ProcessStartupPacket(). * Fix leak in evtcache.c's DecodeTextArrayToBitmapset(). If the presented array is toasted, this neglected to free the detoasted copy, which was then leaked into EventTriggerCacheContext. * I'm distressed by the amount of code that BuildEventTriggerCache is willing to run while switched into a long-lived cache context. Although the detoasted array is the only leak that Valgrind reports, let's tighten things up while we're here. (DecodeTextArrayToBitmapset is still run in the cache context, so doing this doesn't remove the need for the detoast fix. But it reduces the surface area for other leaks.) * load_domaintype_info() intentionally leaked some intermediate cruft into the long-lived DomainConstraintCache's memory context, reasoning that the amount of leakage will typically not be much so it's not worth doing a copyObject() of the final tree to avoid that. But Valgrind knows nothing of engineering tradeoffs and complains anyway. On the whole, the copyObject doesn't cost that much and this is surely not a performance-critical code path, so let's do it the clean way. * MarkGUCPrefixReserved didn't bother to clean up removed placeholder GUCs at all, which shows up as a leak in one regression test. It seems appropriate for it to do as much cleanup as define_custom_variable does when replacing placeholders, so factor that code out into a helper function. define_custom_variable's logic was one brick shy of a load too: it forgot to free the separate allocation for the placeholder's name. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us	2025-08-02 21:59:46 -04:00
Tom Lane	9e9190154e	Fix MemoryContextAllocAligned's interaction with Valgrind. Arrange that only the "aligned chunk" part of the allocated space is included in a Valgrind vchunk. This suppresses complaints about that vchunk being possibly lost because PG is retaining only pointers to the aligned chunk. Also make sure that trailing wasted space is marked NOACCESS. As a tiny performance improvement, arrange that MCXT_ALLOC_ZERO zeroes only the returned "aligned chunk", not the wasted padding space. In passing, fix GetLocalBufferStorage to use MemoryContextAllocAligned instead of rolling its own implementation, which was equally broken according to Valgrind. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us	2025-08-02 21:59:46 -04:00
Tom Lane	bb049a79d3	Improve our support for Valgrind's leak tracking. When determining whether an allocated chunk is still reachable, Valgrind will consider only pointers within what it believes to be allocated chunks. Normally, all of a block obtained from malloc() would be considered "allocated" --- but it turns out that if we use VALGRIND_MEMPOOL_ALLOC to designate sub-section(s) of a malloc'ed block as allocated, all the rest of that malloc'ed block is ignored. This leads to lots of false positives of course. In particular, in any multi-malloc-block context, all but the primary block were reported as leaked. We also had a problem with context "ident" strings, which were reported as leaked unless there was some other pointer to them besides the one in the context header. To fix, we need to use VALGRIND_MEMPOOL_ALLOC to designate a context's management structs (the context struct itself and any per-block headers) as allocated chunks. That forces moving the VALGRIND_CREATE_MEMPOOL/VALGRIND_DESTROY_MEMPOOL calls into the per-context-type code, so that the pool identifier can be made as soon as we've allocated the initial block, but otherwise it's fairly straightforward. Note that in Valgrind's eyes there is no distinction between these allocations and the allocations that the mmgr modules hand out to user code. That's fine for now, but perhaps someday we'll want to do better yet. When reading this patch, it's helpful to start with the comments added at the head of mcxt.c. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us Discussion: https://postgr.es/m/20210317181531.7oggpqevzz6bka3g@alap3.anarazel.de	2025-08-02 21:59:46 -04:00
Michael Paquier	3b3fa94900	Fix use-after-free with INSERT ON CONFLICT changes in reorderbuffer.c In ReorderBufferProcessTXN(), used to send the data of a transaction to an output plugin, INSERT ON CONFLICT changes (INTERNAL_SPEC_INSERT) are delayed until a confirmation record arrives (INTERNAL_SPEC_CONFIRM), updating the change being processed. `8c58624df4` has added an extra step after processing a change to update the progress of the transaction, by calling the callback update_progress_txn() based on the LSN stored in a change after a threshold of CHANGES_THRESHOLD (100) is reached. This logic has missed the fact that for an INSERT ON CONFLICT change the data is freed once processed, hence update_progress_txn() could be called pointing to a LSN value that's already been freed. This could result in random crashes, depending on the workload. Per discussion, this issue is fixed by reusing in update_progress_txn() the LSN from the change processed found at the beginning of the loop, meaning that for a INTERNAL_SPEC_CONFIRM change the progress is updated using the LSN of the INTERNAL_SPEC_CONFIRM change, and not the LSN from its INTERNAL_SPEC_INSERT change. This is actually more correct, as we want to update the progress to point to the INTERNAL_SPEC_CONFIRM change. Masahiko Sawada has found a nice trick to reproduce the issue: hardcode CHANGES_THRESHOLD at 1 and run test_decoding (test "ddl" being enough) on an instance running valgrind. The bug has been analyzed by Ethan Mertz, who also originally suggested the solution used in this patch. Issue introduced by `8c58624df4`, so backpatch down to v16. Author: Ethan Mertz <ethan.mertz@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/aIsQqDZ7x4LAQ6u1@paquier.xyz Backpatch-through: 16	2025-08-02 17:08:45 +09:00
Nathan Bossart	9eb6068fb6	Allow resetting unknown custom GUCs with reserved prefixes. Currently, ALTER DATABASE/ROLE/SYSTEM RESET [ALL] with an unknown custom GUC with a prefix reserved by MarkGUCPrefixReserved() errors (unless a superuser runs a RESET ALL variant). This is problematic for cases such as an extension library upgrade that removes a GUC. To fix, simply make sure the relevant code paths explicitly allow it. Note that we require superuser or privileges on the parameter to reset it. This is perhaps a bit more restrictive than is necessary, but it's not clear whether further relaxing the requirements is safe. Oversight in commit `88103567cb`. The ALTER SYSTEM fix is dependent on commit `2d870b4aef`, which first appeared in v17. Unfortunately, back-patching that commit would introduce ABI breakage, and while that breakage seems unlikely to bother anyone, it doesn't seem worth the risk. Hence, the ALTER SYSTEM part of this commit is omitted on v15 and v16. Reported-by: Mert Alev <mert@futo.org> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Discussion: https://postgr.es/m/18964-ba09dea8c98fccd6%40postgresql.org Backpatch-through: 15	2025-08-01 16:52:11 -05:00
Masahiko Sawada	a2c6c4ed31	Fix typo in AutoVacLauncherMain(). Author: Yugo Nagata <nagata@sraoss.co.jp> Discussion: https://postgr.es/m/20250802002027.cd35c481f6c6bae7ca2a3e26@sraoss.co.jp	2025-08-01 18:02:41 +00:00
Amit Kapila	2ab2d6f970	Fix a deadlock during ALTER SUBSCRIPTION ... DROP PUBLICATION. A deadlock can occur when the DDL command and the apply worker acquire catalog locks in different orders while dropping replication origins. The issue is rare in PG16 and higher branches because, in most cases, the tablesync worker performs the origin drop in those branches, and its locking sequence does not conflict with DDL operations. This patch ensures consistent lock acquisition to prevent such deadlocks. As per buildfarm. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Ajin Cherian <itsajin@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 14, where it was introduced Discussion: https://postgr.es/m/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com	2025-08-01 07:58:48 +00:00
Michael Paquier	e125e36002	Rename CachedPlanType to PlannedStmtOrigin for PlannedStmt Commit `719dcf3c42` introduced a field called CachedPlanType in PlannedStmt to allow extensions to determine whether a cached plan is generic or custom. After discussion, the concepts that we want to track are a bit wider than initially anticipated, as it is closer to knowing from which "source" or "origin" a PlannedStmt has been generated or retrieved. Custom and generic cached plans are a subset of that. Based on the state of HEAD, we have been able to define two more origins: - "standard", for the case where PlannedStmt is generated in standard_planner(), the most common case. - "internal", for the fake PlannedStmt generated internally by some query patterns. This could be tuned in the future depending on what is needed. This looks like a good starting point, at least. The default value is called "UNKNOWN", provided as fallback value. This value is not used in the core code, the idea is to let extensions building their own PlannedStmts know about this new field. Author: Michael Paquier <michael@paquier.xyz> Co-authored-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/aILaHupXbIGgF2wJ@paquier.xyz	2025-07-31 10:06:34 +09:00
Heikki Linnakangas	613f647122	Handle cancel requests with PID 0 gracefully If the client sent a query cancel request with backend PID 0, it tripped an assertion. With assertions disabled, you got this in the log instead: LOG: invalid cancel request with PID 0 LOG: wrong key in cancel request for process 0 Query cancellations don't even require authentication, so we better tolerate bogus requests. Fix by turning the assertion into a regular runtime check. Spotted while testing libpq behavior with a modified server that didn't send BackendKeyData to the client. Backpatch-through: 18	2025-07-30 00:39:49 +03:00
Tom Lane	4300d8b6a7	Don't put library-supplied -L/-I switches before user-supplied ones. For many optional libraries, we extract the -L and -l switches needed to link the library from a helper program such as llvm-config. In some cases we put the resulting -L switches into LDFLAGS ahead of -L switches specified via --with-libraries. That risks breaking the user's intention for --with-libraries. It's not such a problem if the library's -L switch points to a directory containing only that library, but on some platforms a library helper may "helpfully" offer a switch such as -L/usr/lib that points to a directory holding all standard libraries. If the user specified --with-libraries in hopes of overriding the standard build of some library, the -L/usr/lib switch prevents that from happening since it will come before the user-specified directory. To fix, avoid inserting these switches directly into LDFLAGS during configure, instead adding them to LIBDIRS or SHLIB_LINK. They will still eventually get added to LDFLAGS, but only after the switches coming from --with-libraries. The same problem exists for -I switches: those coming from --with-includes should appear before any coming from helper programs such as llvm-config. We have not heard field complaints about this case, but it seems certain that a user attempting to override a standard library could have issues. The changes for this go well beyond configure itself, however, because many Makefiles have occasion to manipulate CPPFLAGS to insert locally-desirable -I switches, and some of them got it wrong. The correct ordering is any -I switches pointing at within-the- source-tree-or-build-tree directories, then those from the tree-wide CPPFLAGS, then those from helper programs. There were several places that risked pulling in a system-supplied copy of libpq headers, for example, instead of the in-tree files. (Commit `cb36f8ec2` fixed one instance of that a few months ago, but this exercise found more.) The Meson build scripts may or may not have any comparable problems, but I'll leave it to someone else to investigate that. Reported-by: Charles Samborski <demurgos@demurgos.net> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/70f2155f-27ca-4534-b33d-7750e20633d7@demurgos.net Backpatch-through: 13	2025-07-29 15:17:40 -04:00
Peter Eisentraut	c3019bb778	Update comment The code being referred to was moved to a different function in commit `eb8312a22a`, so update the comment accordingly.	2025-07-29 18:57:14 +02:00
Tom Lane	902f922218	Remove unnecessary complication around xmlParseBalancedChunkMemory. When I prepared `71c0921b6` et al yesterday, I was thinking that the logic involving explicitly freeing the node_list output was still needed to dodge leakage bugs in libxml2. But I was misremembering: we introduced that only because with early 2.13.x releases we could not trust xmlParseBalancedChunkMemory's result code, so we had to look to see if a node list was returned or not. There's no reason to believe that xmlParseBalancedChunkMemory will fail to clean up the node list when required, so simplify. (This essentially completes reverting all the non-cosmetic changes in 6082b3d5d.) Reported-by: Jim Jones <jim.jones@uni-muenster.de> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/997668.1753802857@sss.pgh.pa.us Backpatch-through: 13	2025-07-29 12:47:38 -04:00
Robert Haas	1d1612aec7	Run pgindent. Per buildfarm member koel, Nathan Bossart, and David Rowley.	2025-07-29 09:10:41 -04:00
David Rowley	4bc62b8684	Display Memoize planner estimates in EXPLAIN There've been a few complaints that it can be overly difficult to figure out why the planner picked a Memoize plan. To help address that, here we adjust the EXPLAIN output to display the following additional details: 1) The estimated number of cache entries that can be stored at once 2) The estimated number of unique lookup keys that we expect to see 3) The number of lookups we expect 4) The estimated hit ratio Technically #4 can be calculated using #1, #2 and #3, but it's not a particularly obvious calculation, so we opt to display it explicitly. The original patch by Lukas Fittl only displayed the hit ratio, but there was a fear that might lead to more questions about how that was calculated. The idea with displaying all 4 is to be transparent which may allow queries to be tuned more easily. For example, if #2 isn't correct then maybe extended statistics or a manual n_distinct estimate can be used to help fix poor plan choices. Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAP53Pky29GWAVVk3oBgKBDqhND0BRBN6yTPeguV_qSivFL5N_g%40mail.gmail.com	2025-07-29 15:18:01 +12:00
Tom Lane	71c0921b64	Avoid regression in the size of XML input that we will accept. This mostly reverts commit `6082b3d5d`, "Use xmlParseInNodeContext not xmlParseBalancedChunkMemory". It turns out that xmlParseInNodeContext will reject text chunks exceeding 10MB, while (in most libxml2 versions) xmlParseBalancedChunkMemory will not. The bleeding-edge libxml2 bug that we needed to work around a year ago is presumably no longer a factor, and the argument that xmlParseBalancedChunkMemory is semi-deprecated is not enough to justify a functionality regression. Hence, go back to doing it the old way. Reported-by: Michael Paquier <michael@paquier.xyz> Author: Michael Paquier <michael@paquier.xyz> Co-authored-by: Erik Wienhold <ewie@ewie.name> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/aIGknLuc8b8ega2X@paquier.xyz Backpatch-through: 13	2025-07-28 16:50:41 -04:00
Robert Haas	d5b9b2d402	Remove misleading hint for "unexpected data beyond EOF" error. Commit `ffae5cc5a6` added this hint in 2006, but it's now obsolete and doesn't reflect what users should really check in this situation. We were not able to agree on a new hint, so just delete the existing one and update the comments to mention one possibility that is known to cause problems of this kind: something other than PostgreSQL is modifying files in the PostgreSQL data directory. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Robert Haas <rhaas@postgresql.org> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/CAKZiRmxNbcaL76x=09Sxf7aUmrRQJBf8drzDdUHo+j9_eM+VMg@mail.gmail.com	2025-07-28 11:15:47 -04:00
Robert Haas	dcc9820a35	Avoid throwing away the error message in syncrep_yyerror. Commit `473a575e05` purported to make this function stash the error message in *syncrep_parse_result_p, but it didn't actually. As a result, an attempt to set synchronous_standby_names to any value that does not parse resulted in a generic "parser failed." message rather than anything more specific. This fixes that. Discussion: http://postgr.es/m/CA+TgmoYF9wPNZ-Q_EMfib_espgHycY-eX__6Tzo2GpYpVXqCdQ@mail.gmail.com Backpatch-through: 18	2025-07-28 10:35:05 -04:00
Michael Paquier	793928c2d5	Fix performance regression with flush of pending fixed-numbered stats The callback added in `fc415edf8c` used to check if there is any pending data to flush for fixed-numbered statistics, done by looping across all the builtin and custom stats kinds with a call to have_fixed_pending_cb, is proving to able to show in workloads that do not report any stats (read-only, no function calls, no WAL, no IO, etc). The code used in v17 was cheaper than that what HEAD has introduced, relying on three boolean checks for WAL, SLRU and IO stats. This commit switches the code to use a more efficient approach than `fc415edf8c`, with a single boolean flag that can be switched to "true" by any fixed-numbered stats kinds to force pgstat_report_stat() to go through one round of reports. The flag is reset by pgstat_report_stat() once a full round of reports is done. The flag being false means that fixed-numbered stats kinds saw no activity, and that there is no pending data to flush. `ac000fca74` took one step in improving the performance by reducing the number of stats kinds that the backend can hold. This commit takes a more drastic step by bringing back the code efficiency to what it was before v18 with a cheap check at the beginning of pgstat_report_stat() for its fast-exit path. The callback have_static_pending_cb is removed as an effect of all that. Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/eb224uegsga2hgq7dfq3ps5cduhpqej7ir2hjxzzozjthrekx5@dysei6buqthe Backpatch-through: 18	2025-07-28 08:15:11 +09:00
Alexander Korotkov	258bf0a2ea	Process sync requests incrementally in AbsorbSyncRequests If the number of sync requests is big enough, the palloc() call in AbsorbSyncRequests() will attempt to allocate more than 1 GB of memory, resulting in failure. This can lead to an infinite loop in the checkpointer process, as it repeatedly fails to absorb the pending requests. This commit introduces the following changes to cope with this problem: 1. Turn pending checkpointer requests array in shared memory into a bounded ring buffer. 2. Limit maximum ring buffer size to 10M items. 3. Make AbsorbSyncRequests() process requests incrementally in 10K batches. Even #2 makes the whole queue size fit the maximum palloc() size of 1 GB. of continuous lock holding. This commit is for master only. Simpler fix, which just limits a request queue size to 10M, will be backpatched. Reported-by: Ekaterina Sokolova <e.sokolova@postgrespro.ru> Discussion: https://postgr.es/m/db4534f83a22a29ab5ee2566ad86ca92%40postgrespro.ru Author: Maxim Orlov <orlovmg@gmail.com> Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-07-27 15:07:47 +03:00
Michael Paquier	6f22a82a40	Add assertions for all the required index AM callbacks Similar checks are done for the mandatory table AM callbacks. A portion of the index AM callbacks are optional and can be NULL; the rest is mandatory and is documented as such in the documentation and in amapi.h. These checks are useful to detect quickly if all the mandatory callbacks are defined when implementing a new index access method, as the assertions are run when loading the AM. Author: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/ME0P300MB0445795D31CEAB92C58B41FDB651A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM	2025-07-27 17:48:47 +09:00
Tom Lane	80aa9848be	Reap the benefits of not having to avoid leaking PGresults. Remove a bunch of PG_TRY constructs, de-volatilize related variables, remove some PQclear calls in error paths. Aside from making the code simpler and shorter, this should provide some marginal performance gains. For ease of review, I did not re-indent code within the removed PG_TRY constructs. That'll be done in a separate patch. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/2976982.1748049023@sss.pgh.pa.us	2025-07-25 16:31:43 -04:00
Tom Lane	7d8f595779	Create infrastructure to reliably prevent leakage of PGresults. Commit `232d8caea` fixed a case where postgres_fdw could lose track of a PGresult object, resulting in a process-lifespan memory leak. But I have little faith that there aren't other potential PGresult leakages, now or in future, in the backend modules that use libpq. Therefore, this patch proposes infrastructure that makes all PGresults returned from libpq act as though they are palloc'd in the CurrentMemoryContext (with the option to relocate them to another context later). This should greatly reduce the risk of careless leaks, and it also permits removal of a bunch of code that attempted to prevent such leaks via PG_TRY blocks. This patch adds infrastructure that wraps each PGresult in a "libpqsrv_PGresult" that provides a memory context reset callback to PQclear the PGresult. Code using this abstraction is inherently memory-safe to the same extent as we are accustomed to in most backend code. Furthermore, we add some macros that automatically redirect calls of the libpq functions concerned with PGresults to use this infrastructure, so that almost no source-code changes are needed to wheel this infrastructure into place in all the backend code that uses libpq. Perhaps in future we could create similar infrastructure for PGconn objects, but there seems less need for that. This patch just creates the infrastructure and makes relevant code use it, including reverting `232d8caea` in favor of this mechanism. A good deal of follow-on simplification is possible now that we don't have to be so cautious about freeing PGresults, but I'll put that in a separate patch. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/2976982.1748049023@sss.pgh.pa.us	2025-07-25 16:30:00 -04:00
Tom Lane	5457ea46d1	Fix dynahash's HASH_FIXED_SIZE ("isfixed") option. This flag was effectively a no-op in EXEC_BACKEND (ie, Windows) builds, because it was kept in the process-local HTAB struct, and it could only ever become set in the postmaster's copy. The simplest fix is to move it to the shared HASHHDR struct. We could keep a copy in HTAB as well, as we do with keysize and some other fields, but the "too much contention" argument doesn't seem to apply here: we only examine isfixed during element_alloc(), which had better not get hit very often for a shared hashtable. This oversight dates to `7c797e719` which invented the option. But back-patching doesn't seem appropriate given the lack of field complaints. If there is anyone running an affected workload on Windows, they might be unhappy about the behavior changing in a minor release. Author: Aidar Imamov <a.imamov@postgrespro.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/4d0cb35ff01c5c74d2b9a582ecb73823@postgrespro.ru	2025-07-25 10:56:55 -04:00
Álvaro Herrera	1dfe3ef3f9	Refactor grammar to create opt_utility_option_list This changes the grammar for REINDEX, CHECKPOINT, CLUSTER, ANALYZE/ANALYSE; they still accept the same options as before, but the grammar is written differently for convenience of future development. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/202507231538.ir7pjzoow6oe@alvherre.pgsql	2025-07-25 12:03:19 +02:00
Fujii Masao	b5d084c535	Fix background worker not restarting after crash-and-restart cycle. Previously, if a background worker crashed (e.g., due to a SIGKILL) and the server restarted due to restart_after_crash being enabled, the worker was not restarted as expected. Background workers without the never-restart flag should automatically restart in this case. This issue was introduced in commit `28a520c0b7`, which failed to reset the rw_pid field in the RegisteredBgWorker struct for the crashed worker. This commit fixes the problem by resetting rw_pid for all eligible background workers during the crash-and-restart cycle. Back-patched to v18, where the bug was introduced. Bug fix patches were proposed by Andrey Rudometov and ChangAo Chen, but this commit uses a different approach. Reported-by: Andrey Rudometov <unlimitedhikari@gmail.com> Reported-by: ChangAo Chen <cca5507@qq.com> Author: Andrey Rudometov <unlimitedhikari@gmail.com> Author: ChangAo Chen <cca5507@qq.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: ChangAo Chen <cca5507@qq.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Discussion: https://postgr.es/m/CAF6JsWiO=i24qYitWe6ns1sXqcL86rYxdyU+pNYk-WueKPSySg@mail.gmail.com Discussion: https://postgr.es/m/tencent_E00A056B3953EE6440F0F40F80EC30427D09@qq.com Backpatch-through: 18	2025-07-25 18:38:36 +09:00
Michael Paquier	641f20d4c4	Fix assertion failure with latch wait in single-user mode LatchWaitSetPostmasterDeathPos, the latch event position for the postmaster death event, is initialized under IsUnderPostmaster. WaitLatch() considered it as a valid wait target in single-user mode (!IsUnderPostmaster), which was incorrect. One code path found to fail with an assertion failure is a database drop in single-user mode while waiting in WaitForProcSignalBarrier() after the drop. Oversight in commit `84e5b2f07a`. Author: Patrick Stählin <me@packi.ch> Co-authored-by: Ronan Dunklau <ronan.dunklau@aiven.io> Discussion: https://postgr.es/m/18996-3a2744c8140488de@postgresql.org Backpatch-through: 18	2025-07-25 16:17:13 +09:00
Nathan Bossart	15d33eb192	Fix return value of visibilitymap_get_status(). This function is declared as returning a uint8, but it returns a bool in one code path. To fix, return (uint8) 0 instead of false there. This should behave exactly the same as before, but it might prevent future compiler complaints. Oversight in commit `a892234f83`. Author: Julien Rouhaud <rjuju123@gmail.com> Discussion: https://postgr.es/m/aIHluT2isN58jqHV%40jrouhaud	2025-07-24 10:13:45 -05:00
Michael Paquier	719dcf3c42	Introduce field tracking cached plan type in PlannedStmt PlannedStmt gains a new field, called CachedPlanType, able to track if a given plan tree originates from the cache and if we are dealing with a generic or custom cached plan. This field can be used for monitoring or statistical purposes, in the executor hooks, for example, based on the planned statement attached to a QueryDesc. A patch is under discussion for pg_stat_statements to provide an equivalent of the counters in pg_prepared_statements for custom and generic plans, to provide a more global view of such data, as this data is now restricted to the current session. The concept introduced in this commit is useful on its own, and has been extracted from a larger patch by the same author. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAA5RZ0uFw8Y9GCFvafhC=OA8NnMqVZyzXPfv_EePOt+iv1T-qQ@mail.gmail.com	2025-07-24 15:41:18 +09:00
Tom Lane	e6dfd068ed	Fix build breakage on Solaris-alikes with late-model GCC. Solaris has never bothered to add "const" to the second argument of PAM conversation procs, as all other Unixen did decades ago. This resulted in an "incompatible pointer" compiler warning when building --with-pam, but had no more serious effect than that, so we never did anything about it. However, as of GCC 14 the case is an error not warning by default. To complicate matters, recent OpenIndiana (and maybe illumos in general?) does supply the "const" by default, so we can't just assume that platforms using our solaris template need help. What we can do, short of building a configure-time probe, is to make solaris.h #define _PAM_LEGACY_NONCONST, which causes OpenIndiana's pam_appl.h to revert to the traditional definition, and hopefully will have no effect anywhere else. Then we can use that same symbol to control whether we include "const" in the declaration of pam_passwd_conv_proc(). Bug: #18995 Reported-by: Andrew Watkins <awatkins1966@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18995-82058da9ab4337a7@postgresql.org Backpatch-through: 13	2025-07-23 15:44:29 -04:00
Nathan Bossart	2047ad0681	Cross-check lists of built-in LWLock tranches. lwlock.c, lwlock.h, and wait_event_names.txt each contain a list of built-in LWLock tranches. It is easy to miss one or the other when adding or removing tranches, and discrepancies have adverse effects (e.g., breaking JOINs between pg_stat_activity and pg_wait_events). This commit moves the lists of built-in tranches in lwlock.{c,h} to lwlocklist.h and adds a cross-check to the script that generates lwlocknames.h. If the lists do not match exactly, building will fail. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aHpOgwuFQfcFMZ/B%40ip-10-97-1-34.eu-west-3.compute.internal	2025-07-23 12:06:20 -05:00
Nathan Bossart	37c7a7eeb6	Use PqMsg_* macros in walsender.c Oversights in commits `f4b54e1ed9`, `dc21234005`, and `228c370868`. Author: Dave Cramer <davecramer@gmail.com> Discussion: https://postgr.es/m/CADK3HH%2BowWVdnbmWH4NHG8%3D%2BkXA_wjsyEVLoY719iJnb%3D%2BtT6A%40mail.gmail.com	2025-07-23 10:29:45 -05:00
Amit Kapila	228c370868	Preserve conflict-relevant data during logical replication. Logical replication requires reliable conflict detection to maintain data consistency across nodes. To achieve this, we must prevent premature removal of tuples deleted by other origins and their associated commit_ts data by VACUUM, which could otherwise lead to incorrect conflict reporting and resolution. This patch introduces a mechanism to retain deleted tuples on the subscriber during the application of concurrent transactions from remote nodes. Retaining these tuples allows us to correctly ignore concurrent updates to the same tuple. Without this, an UPDATE might be misinterpreted as an INSERT during resolutions due to the absence of the original tuple. Additionally, we ensure that origin metadata is not prematurely removed by vacuum freeze, which is essential for detecting update_origin_differs and delete_origin_differs conflicts. To support this, a new replication slot named pg_conflict_detection is created and maintained by the launcher on the subscriber. Each apply worker tracks its own non-removable transaction ID, which the launcher aggregates to determine the appropriate xmin for the slot, thereby retaining necessary tuples. Conflict information retention (deleted tuples and commit_ts) can be enabled per subscription via the retain_conflict_info option. This is disabled by default to avoid unnecessary overhead for configurations that do not require conflict resolution or logging. During upgrades, if any subscription on the old cluster has retain_conflict_info enabled, a conflict detection slot will be created to protect relevant tuples from deletion when the new cluster starts. This is a foundational work to correctly detect update_deleted conflict which will be done in a follow-up patch. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-07-23 02:56:00 +00:00
Fujii Masao	a7ca73af66	Remove translation marker from libpq-be-fe-helpers.h. Commit `112faf1378` introduced a translation marker in libpq-be-fe-helpers.h, but this caused build failures on some platforms—such as the one reported by buildfarm member indri—due to linker issues with dblink. This is the same problem previously addressed in commit `213c959a29`. To fix the issue, this commit removes the translation marker from libpq-be-fe-helpers.h, following the approach used in `213c959a29`. It also removes the associated gettext_noop() calls added in commit `112faf1378`, as they are no longer needed. While reviewing this, a gettext_noop() call was also found in contrib/basic_archive. Since contrib modules don't support translation, this call has been removed as well. Per buildfarm member indri. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/0e6299d9-608a-4ffa-aeb1-40cb8a99000b@oss.nttdata.com	2025-07-22 22:08:36 +09:00
Andres Freund	d3f97fd1dd	aio: Fix assertion, clarify README The assertion wouldn't have triggered for a long while yet, but this won't accidentally fail to detect the issue if/when it occurs. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAEze2Wj-43JV4YufW23gm=Uwr7Lkj+p0yKctKHxNm1rwFC+_DQ@mail.gmail.com Backpatch-through: 18	2025-07-22 08:30:52 -04:00
Fujii Masao	112faf1378	Log remote NOTICE, WARNING, and similar messages using ereport(). Previously, NOTICE, WARNING, and similar messages received from remote servers over replication, postgres_fdw, or dblink connections were printed directly to stderr on the local server (e.g., the subscriber). As a result, these messages lacked log prefixes (e.g., timestamp), making them harder to trace and correlate with other log entries. This commit addresses the issue by introducing a custom notice receiver for replication, postgres_fdw, and dblink connections. These messages are now logged via ereport(), ensuring they appear in the logs with proper formatting and context, which improves clarity and aids in debugging. Author: Vignesh C <vignesh21@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CALDaNm2xsHpWRtLm-VL_HJCsaE3+1Y_n-jDEAr3-suxVqc3xoQ@mail.gmail.com	2025-07-22 14:16:45 +09:00
Richard Guo	e2debb6438	Reduce "Var IS [NOT] NULL" quals during constant folding In commit `b262ad440`, we introduced an optimization that reduces an IS [NOT] NULL qual on a NOT NULL column to constant true or constant false, provided we can prove that the input expression of the NullTest is not nullable by any outer joins or grouping sets. This deduction happens quite late in the planner, during the distribution of quals to rels in query_planner. However, this approach has some drawbacks: we can't perform any further folding with the constant, and it turns out to be prone to bugs. Ideally, this deduction should happen during constant folding. However, the per-relation information about which columns are defined as NOT NULL is not available at that point. This information is currently collected from catalogs when building RelOptInfos for base or "other" relations. This patch moves the collection of NOT NULL attribute information for relations before pull_up_sublinks, storing it in a hash table keyed by relation OID. It then uses this information to perform the NullTest deduction for Vars during constant folding. This also makes it possible to leverage this information to pull up NOT IN subqueries. Note that this patch does not get rid of restriction_is_always_true and restriction_is_always_false. Removing them would prevent us from reducing some IS [NOT] NULL quals that we were previously able to reduce, because (a) the self-join elimination may introduce new IS NOT NULL quals after constant folding, and (b) if some outer joins are converted to inner joins, previously irreducible NullTest quals may become reducible. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com	2025-07-22 11:21:36 +09:00
Richard Guo	904f6a593a	Centralize collection of catalog info needed early in the planner There are several pieces of catalog information that need to be retrieved for a relation during the early stage of planning. These include relhassubclass, which is used to clear the inh flag if the relation has no children, as well as a column's attgenerated and default value, which are needed to expand virtual generated columns. More such information may be required in the future. Currently, these pieces of catalog data are collected in multiple places, resulting in repeated table_open/table_close calls for each relation in the rangetable. This patch centralizes the collection of all required early-stage catalog information into a single loop over the rangetable, allowing each relation to be opened and closed only once. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com	2025-07-22 11:20:40 +09:00
Richard Guo	e0d0529526	Expand virtual generated columns before sublink pull-up Currently, we expand virtual generated columns after we have pulled up any SubLinks within the query's quals. This ensures that the virtual generated column references within SubLinks that should be transformed into joins are correctly expanded. This approach works well and has posed no issues. In an upcoming patch, we plan to centralize the collection of catalog information needed early in the planner. This will help avoid repeated table_open/table_close calls for relations in the rangetable. Since this information is required during sublink pull-up, we are moving the expansion of virtual generated columns to occur beforehand. To achieve this, if any EXISTS SubLinks can be pulled up, their rangetables are processed just before pulling them up. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com	2025-07-22 11:19:17 +09:00
Tom Lane	aadf7db66e	Mostly-cosmetic adjustments to estimate_multivariate_bucketsize(). The only practical effect of these changes is to avoid a useless list_copy() operation when there is a single hashclause. That's never going to make any noticeable performance difference, but the code is arguably clearer this way, especially if we take the opportunity to add some comments so that readers don't have to reverse-engineer the usage of these local variables. Also add some braces for better/more consistent style. Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAHewXNnHBOO9NEa=NBDYOrwZL4oHu2NOcTYvqyNyWEswo8f5OQ@mail.gmail.com	2025-07-19 14:23:02 -04:00
Alexander Korotkov	d3917d8f13	Fix infinite wait when reading a partially written WAL record If a crash occurs while writing a WAL record that spans multiple pages, the recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag. However, logical decoding currently attempts to read the full WAL record based on its expected size before checking this flag, which can lead to an infinite wait if the remaining data is never written (e.g., no activity after crash). This patch updates the logic first to read the page header and check for the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct the full WAL record. If the flag is set, decoding correctly identifies the record as incomplete and avoids waiting for WAL data that will never arrive. Discussion: https://postgr.es/m/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com Discussion: https://postgr.es/m/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com Author: Vignesh C <vignesh21@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Backpatch-through: 13	2025-07-19 13:45:51 +03:00
Tom Lane	3683af6170	Speed up byteain by not parsing traditional-style input twice. Instead of laboriously computing the exact output length, use strlen to get an upper bound cheaply. (This is still O(N) of course, but the constant factor is a lot less.) This will typically result in overallocating the output datum, but that's of little concern since it's a short-lived allocation in just about all use-cases. A simple microbenchmark showed about 40% speedup for long input strings. While here, make some cosmetic cleanups and add a test case that covers the double-backslash code path in byteain and byteaout. Author: Steven Niu <niushiji@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Stepan Neretin <slpmcf@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/ca315729-140b-426e-81a6-6cd5cfe7ecc5@gmail.com	2025-07-18 16:42:10 -04:00
Nathan Bossart	84409ed640	Remove unused variable in generate-lwlocknames.pl. Oversight in commit `da952b415f`. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aHpOgwuFQfcFMZ/B%40ip-10-97-1-34.eu-west-3.compute.internal	2025-07-18 11:27:19 -05:00
Dean Rasheed	5022ff250e	Fix concurrent update trigger issues with MERGE in a CTE. If a MERGE inside a CTE attempts an UPDATE or DELETE on a table with BEFORE ROW triggers, and a concurrent UPDATE or DELETE happens, the merge code would fail (crashing in the case of an UPDATE action, and potentially executing the wrong action for a DELETE action). This is the same issue that `9321c79c86` attempted to fix, except now for a MERGE inside a CTE. As noted in `9321c79c86`, what needs to happen is for the trigger code to exit early, returning the TM_Result and TM_FailureData information to the merge code, if a concurrent modification is detected, rather than attempting to do an EPQ recheck. The merge code will then do its own rechecking, and rescan the action list, potentially executing a different action in light of the concurrent update. In particular, the trigger code must never call ExecGetUpdateNewTuple() for MERGE, since that is bound to fail because MERGE has its own per-action projection information. Commit `9321c79c86` did this using estate->es_plannedstmt->commandType in the trigger code to detect that a MERGE was being executed, which is fine for a plain MERGE command, but does not work for a MERGE inside a CTE. Fix by passing that information to the trigger code as an additional parameter passed to ExecBRUpdateTriggers() and ExecBRDeleteTriggers(). Back-patch as far as v17 only, since MERGE cannot appear inside a CTE prior to that. Additionally, take care to preserve the trigger ABI in v17 (though not in v18, which is still in beta). Bug: #18986 Reported-by: Yaroslav Syrytsia <me@ys.lc> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/18986-e7a8aac3d339fa47@postgresql.org Backpatch-through: 17	2025-07-18 09:55:43 +01:00
Álvaro Herrera	b8926a5b4b	Remove assertion from PortalRunMulti We have an assertion to ensure that a command tag has been assigned by the time we're done executing, but if we happen to execute a command with no queries, the assertion would fail. Per discussion, rather than contort things to get a tag assigned, just remove the assertion. Oversight in `2f9661311b`. That commit also retained a comment that explained logic that had been adjacent to it but diffused into various places, leaving none apt to keep part of the comment. Remove that part, and rewrite what remains for extra clarity. Bug: #18984 Backpatch-through: 13 Reported-by: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michaël Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org	2025-07-17 17:40:22 +02:00
Amit Langote	afa5c365ec	Remove duplicate line In `231b7d670b`, while copy-pasting some code into ExecEvalJsonCoercionFinish(), I (amitlan) accidentally introduced a duplicate line. Remove it. Reported-by: Jian He <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxHcf=BpmRAJcjgfjOUfV76MwKnyz1x3ErXsWL26EAFmng@mail.gmail.com	2025-07-17 14:37:06 +09:00
Michael Paquier	a493e741d3	Fix inconsistent LWLock tranche names for MultiXact* The terms used in wait_event_names.txt and lwlock.c were inconsistent for MultiXactOffsetSLRU and MultiXactMemberSLRU, which could cause joins between pg_wait_events and pg_stat_activity to fail. lwlock.c is adjusted in this commit to what the historical name of the event has always been, and what is documented. Oversight in `53c2a97a92`. `08b9b9e043` has fixed a similar inconsistency some time ago. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/aHdxN0D0hKXzHFQG@ip-10-97-1-34.eu-west-3.compute.internal Backpatch-through: 17	2025-07-17 09:30:26 +09:00
Jeff Davis	5e6e42e44f	Force LC_COLLATE to C in postmaster. Avoid dependence on setlocale(). strcoll(), etc., are not called directly; all collation-sensitive calls should go through pg_locale.c and use the appropriate provider. By setting LC_COLLATE to C, we avoid accidentally depending on libc behavior when using a different provider. No behavior change in the backend, but it's possible that some extensions will be affected. Such extensions should be updated to use the pg_locale_t APIs. Discussion: https://postgr.es/m/9875f7f9-50f1-4b5d-86fc-ee8b03e8c162@eisentraut.org Reviewed-by: Peter Eisentraut <peter@eisentraut.org>	2025-07-16 14:13:18 -07:00
Peter Geoghegan	4c8ad67a98	nbtree: Use only one notnullkey ScanKeyData. _bt_first need only store one ScanKeyData struct on the stack for the purposes of building an IS NOT NULL key based on an implied NOT NULL constraint. We don't need INDEX_MAX_KEYS-many ScanKeyData structs. This saves us a little over 2KB in stack space. It's possible that this has some performance benefit. It also seems simpler and more direct. It isn't possible for more than a single index attribute to need its own implied IS NOT NULL key: the first such attribute/IS NOT NULL key always makes _bt_first stop adding additional boundary keys to startKeys[]. Using INDEX_MAX_KEYS-many ScanKeyData entries was (at best) misleading. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Mircea Cadariu <cadariu.mircea@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzm=1kJMSZhhTLoM5BPbwQNWxUj-ynOEh=89ptDZAVgauw@mail.gmail.com	2025-07-16 13:05:44 -04:00
Michael Paquier	1dbe6f7667	Refactor non-supported compression error message in toast_compression.c This code used a NO_LZ4_SUPPORT() macro to issue an error in the code paths where LZ4 [de]compression is attempted but the build does not support it. This commit refactors the code to use a more flexible error message so as it can be used for other compression methods, where the method is given in input of macro. Extracted from a larger patch by the same author. Author: Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com	2025-07-16 11:59:22 +09:00
Fujii Masao	b8341ae856	pgoutput: Initialize missing default for "origin" parameter. The pgoutput plugin initializes optional parameters like "binary" with default values at the start of processing. However, the "origin" parameter was previously missed and left without explicit initialization. Although the PGOutputData struct, which holds these settings, is zero-initialized at allocation (resulting in publish_no_origin field for "origin" parameter being false by default), this default was not set explicitly, unlike other parameters. This commit adds explicit initialization of the "origin" parameter to ensure consistency and clarity in how defaults are handled. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/d2790f10-238d-4cb5-a743-d9d2a9dd900f@oss.nttdata.com	2025-07-16 10:31:51 +09:00
Tom Lane	5fe55a0fe4	Doc: clarify description of regexp fields in pg_ident.conf. The grammar was a little shaky and confusing here, so word-smith it a bit. Also, adjust the comments in pg_ident.conf.sample to use the same terminology as the SGML docs, in particular "DATABASE-USERNAME" not "PG-USERNAME". Back-patch appropriate subsets. I did not risk changing pg_ident.conf.sample in released branches, but it still seems OK to change it in v18. Reported-by: Alexey Shishkin <alexey.shishkin@enterprisedb.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Discussion: https://postgr.es/m/175206279327.3157504.12519088928605422253@wrigleys.postgresql.org Backpatch-through: 13	2025-07-15 18:53:00 -04:00
Tom Lane	2a3a396432	Clarify the ra != rb case in compareJsonbContainers(). It's impossible to reach this case with either ra or rb being WJB_DONE, because our earlier checks that the structure and length of the inputs match should guarantee that we reach their ends simultaneously. However, the comment completely fails to explain this, and the Asserts don't cover it either. The comment is pretty obscure anyway, so rewrite it, and extend the Asserts to reject WJB_DONE. This is only cosmetic, so no need for back-patch. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/0c623e8a204187b87b4736792398eaf1@postgrespro.ru	2025-07-15 18:21:12 -04:00
Tom Lane	aad1617b76	Silence uninitialized-value warnings in compareJsonbContainers(). Because not every path through JsonbIteratorNext() sets val->type, some compilers complain that compareJsonbContainers() is comparing possibly-uninitialized values. The paths that don't set it return WJB_DONE, WJB_END_ARRAY, or WJB_END_OBJECT, so it's clear by manual inspection that the "(ra == rb)" code path is safe, and indeed we aren't seeing warnings about that. But the (ra != rb) case is much less obviously safe. In Assert-enabled builds it seems that the asserts rejecting WJB_END_ARRAY and WJB_END_OBJECT persuade gcc 15.x not to warn, which makes little sense because it's impossible to believe that the compiler can prove of its own accord that ra/rb aren't WJB_DONE here. (In fact they never will be, so the code isn't wrong, but why is there no warning?) Without Asserts, the appearance of warnings is quite unsurprising. We discussed fixing this by converting those two Asserts into pg_assume, but that seems not very satisfactory when it's so unclear why the compiler is or isn't warning: the warning could easily reappear with some other compiler version. Let's fix it in a less magical, more future-proof way by changing JsonbIteratorNext() so that it always does set val->type. The cost of that should be pretty negligible, and it makes the function's API spec less squishy. Reported-by: Erik Rijkers <er@xs4all.nl> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/988bf1bc-3f1f-99f3-bf98-222f1cd9dc5e@xs4all.nl Discussion: https://postgr.es/m/0c623e8a204187b87b4736792398eaf1@postgrespro.ru Backpatch-through: 13	2025-07-15 18:11:18 -04:00
Michael Paquier	006fc975a2	Fix comments in index.c This comment paragraph referred to text_eq(), but the name of the function in charge of "text" comparisons is called texteq(). Author: Jian He <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxHL--XNcCCO1LgKsygzYGiVHZMfTcAxOSG8+ezxWtjddw@mail.gmail.com	2025-07-15 16:05:59 +09:00
Tom Lane	3c4e26a62c	In username-map substitution, cope with more than one \1. If the system-name field of a pg_ident.conf line is a regex containing capturing parentheses, you can write \1 in the user-name field to represent the captured part of the system name. But what happens if you write \1 more than once? The only reasonable expectation IMO is that each \1 gets replaced, but presently our code replaces only the first. Fix that. Also, improve the tests for this feature to exercise cases where a non-empty string needs to be substituted for \1. The previous testing didn't inspire much faith that it was verifying correct operation of the substitution code. Given the lack of field complaints about this, I don't feel a need to back-patch. Reported-by: David G. Johnston <david.g.johnston@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAKFQuwZu6kZ8ZPvJ3pWXig+6UX4nTVK-hdL_ZS3fSdps=RJQQQ@mail.gmail.com	2025-07-13 13:52:32 -04:00
Nathan Bossart	8893c3ab36	Remove XLogCtl->ckptFullXid. A few code paths set this variable, but its value is never used. Oversight in commit `2fc7af5e96`. Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://postgr.es/m/aHFyE1bs9YR93dQ1%40nathan	2025-07-12 14:34:57 -05:00
Tom Lane	84ce258707	Replace float8 with int in date2isoweek() and date2isoyear(). The values of the "result" variables in these functions are always integers; using a float8 variable accomplishes nothing except to incur useless conversions to and from float. While that wastes a few nanoseconds, these functions aren't all that time-critical. But it seems worth fixing to remove possible reader confusion. Also, in the case of date2isoyear(), "result" is a very poorly chosen variable name because it is not the function's result. Rename it to "week", and do the same in date2isoweek() for consistency. Since this is mostly cosmetic, there seems little need for back-patch. Author: Sergey Fukanchik <s.fukanchik@postgrespro.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6323a-68726500-1-7def9d00@137821581	2025-07-12 11:50:37 -04:00
Andres Freund	f2c87ac04e	Remove long-unused TransactionIdIsActive() TransactionIdIsActive() has not been used since `bb38fb0d43`, in 2014. There are no known uses in extensions either and it's hard to see valid uses for it. Therefore remove TransactionIdIsActive(). Discussion: https://postgr.es/m/odgftbtwp5oq7cxjgf4kjkmyq7ypoftmqy7eqa7w3awnouzot6@hrwnl5tdqrgu	2025-07-12 11:00:44 -04:00
Thomas Munro	b8e1f2d96b	aio: Fix configuration reload in IO workers. method_worker.c installed SignalHandlerForConfigReload, but it failed to actually process reload requests. That hasn't yet produced any concrete problem reports in terms of GUC changes it should have cared about in v18, but it was inconsistent. It did cause problems for a couple of patches in development that need IO workers to react to ALTER SYSTEM + pg_reload_conf(). Fix extracted from one of those patches. Back-patch to 18. Reported-by: Dmitry Dolgov <9erthalion6@gmail.com> Discussion: https://postgr.es/m/sh5uqe4a4aqo5zkkpfy5fobe2rg2zzouctdjz7kou4t74c66ql%40yzpkxb7pgoxf	2025-07-12 16:33:02 +12:00
Thomas Munro	177c1f0593	aio: Remove obsolete IO worker ID references. In an ancient ancestor of this code, the postmaster assigned IDs to IO workers. Now it tracks them in an unordered array and doesn't know their IDs, so it might be confusing to readers that it still referred to their indexes as IDs. No change in behavior, just variable name and error message cleanup. Back-patch to 18. Discussion: https://postgr.es/m/CA%2BhUKG%2BwbaZZ9Nwc_bTopm4f-7vDmCwLk80uKDHj9mq%2BUp0E%2Bg%40mail.gmail.com	2025-07-12 14:44:22 +12:00
Thomas Munro	01d618bcd7	aio: Regularize IO worker internal naming. Adopt PgAioXXX convention for pgaio module type names. Rename a function that didn't use a pgaio_worker_ submodule prefix. Rename the internal submit function's arguments to match the indirectly relevant function pointer declaration and nearby examples. Rename the array of handle IDs in PgAioSubmissionQueue to sqes, a term of art seen in the systems it emulates, also clarifying that they're not IO handle pointers as the old name might imply. No change in behavior, just type, variable and function name cleanup. Back-patch to 18. Discussion: https://postgr.es/m/CA%2BhUKG%2BwbaZZ9Nwc_bTopm4f-7vDmCwLk80uKDHj9mq%2BUp0E%2Bg%40mail.gmail.com	2025-07-12 14:44:09 +12:00
Thomas Munro	40e105042a	Fix stale idle flag when IO workers exit. Otherwise we could choose a worker that has exited and crash while trying to wake it up. Back-patch to 18. Reported-by: Tomas Vondra <tomas@vondra.me> Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/t5aqjhkj6xdkido535pds7fk5z4finoxra4zypefjqnlieevbg%40357aaf6u525j	2025-07-12 13:11:47 +12:00
Tom Lane	64840e4624	Fix inconsistent quoting of role names in ACLs. getid() and putid(), which parse and deparse role names within ACL input/output, applied isalnum() to see if a character within a role name requires quoting. They did this even for non-ASCII characters, which is problematic because the results would depend on encoding, locale, and perhaps even platform. So it's possible that putid() could elect not to quote some string that, later in some other environment, getid() will decide is not a valid identifier, causing dump/reload or similar failures. To fix this in a way that won't risk interoperability problems with unpatched versions, make getid() treat any non-ASCII as a legitimate identifier character (hence not requiring quotes), while making putid() treat any non-ASCII as requiring quoting. We could remove the resulting excess quoting once we feel that no unpatched servers remain in the wild, but that'll be years. A lesser problem is that getid() did the wrong thing with an input consisting of just two double quotes (""). That has to represent an empty string, but getid() read it as a single double quote instead. The case cannot arise in the normal course of events, since we don't allow empty-string role names. But let's fix it while we're here. Although we've not heard field reports of problems with non-ASCII role names, there's clearly a hazard there, so back-patch to all supported versions. Reported-by: Peter Eisentraut <peter@eisentraut.org> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3792884.1751492172@sss.pgh.pa.us Backpatch-through: 13	2025-07-11 18:50:13 -04:00
Nathan Bossart	8d33fbacba	Add FLUSH_UNLOGGED option to CHECKPOINT command. This option, which is disabled by default, can be used to request the checkpoint also flush dirty buffers of unlogged relations. As with the MODE option, the server may consolidate the options for concurrently requested checkpoints. For example, if one session uses (FLUSH_UNLOGGED FALSE) and another uses (FLUSH_UNLOGGED TRUE), the server may perform one checkpoint with FLUSH_UNLOGGED enabled. Author: Christoph Berg <myon@debian.org> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de	2025-07-11 11:51:25 -05:00
Nathan Bossart	2f698d7f4b	Add MODE option to CHECKPOINT command. This option may be set to FAST (the default) to request the checkpoint be completed as fast as possible, or SPREAD to request the checkpoint be spread over a longer interval (based on the checkpoint-related configuration parameters). Note that the server may consolidate the options for concurrently requested checkpoints. For example, if one session requests a "fast" checkpoint and another requests a "spread" checkpoint, the server may perform one "fast" checkpoint. Author: Christoph Berg <myon@debian.org> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de	2025-07-11 11:51:25 -05:00
Nathan Bossart	a4f126516e	Add option list to CHECKPOINT command. This commit adds the boilerplate code for supporting a list of options in CHECKPOINT commands. No actual options are supported yet, but follow-up commits will add support for MODE and FLUSH_UNLOGGED. While at it, this commit refactors the code for executing CHECKPOINT commands to its own function since it's about to become significantly larger. Author: Christoph Berg <myon@debian.org> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de	2025-07-11 11:51:25 -05:00
Nathan Bossart	bb938e2c3c	Rename CHECKPOINT_IMMEDIATE to CHECKPOINT_FAST. The new name more accurately reflects the effects of this flag on a requested checkpoint. Checkpoint-related log messages (i.e., those controlled by the log_checkpoints configuration parameter) will now say "fast" instead of "immediate", too. Likewise, references to "immediate" checkpoints in the documentation have been updated to say "fast". This is preparatory work for a follow-up commit that will add a MODE option to the CHECKPOINT command. Author: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de	2025-07-11 11:51:25 -05:00
Nathan Bossart	cd8324cc89	Rename CHECKPOINT_FLUSH_ALL to CHECKPOINT_FLUSH_UNLOGGED. The new name more accurately relects the effects of this flag on a requested checkpoint. Checkpoint-related log messages (i.e., those controlled by the log_checkpoints configuration parameter) will now say "flush-unlogged" instead of "flush-all", too. This is preparatory work for a follow-up commit that will add a FLUSH_UNLOGGED option to the CHECKPOINT command. Author: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de	2025-07-11 11:51:25 -05:00
Amit Kapila	72e6c08fea	Fix the handling of two GUCs during upgrade. Previously, the check_hook functions for max_slot_wal_keep_size and idle_replication_slot_timeout would incorrectly raise an ERROR for values set in postgresql.conf during upgrade, even though those values were not actively used in the upgrade process. To prevent logical slot invalidation during upgrade, we used to set special values for these GUCs. Now, instead of relying on those values, we directly prevent WAL removal and logical slot invalidation caused by max_slot_wal_keep_size and idle_replication_slot_timeout. Note: PostgreSQL 17 does not include the idle_replication_slot_timeout GUC, so related changes were not backported. BUG #18979 Reported-by: jorsol <jorsol@gmail.com> Author: Dilip Kumar <dilipbalaut@gmail.com> Reviewed by: vignesh C <vignesh21@gmail.com> Reviewed by: Alvaro Herrera <alvherre@alvh.no-ip.org> Backpatch-through: 17, where it was introduced Discussion: https://postgr.es/m/219561.1751826409@sss.pgh.pa.us Discussion: https://postgr.es/m/18979-a1b7fdbb7cd181c6@postgresql.org	2025-07-11 10:46:43 +05:30
Fujii Masao	110e6dcaa6	doc: Clarify meaning of "idle" in idle_replication_slot_timeout. This commit updates the documentation to clarify that "idle" in idle_replication_slot_timeout means the replication slot is inactive, that is, not currently used by any replication connection. Without this clarification, "idle" could be misinterpreted to mean that the slot is not advancing or that no data is being streamed, even if a connection exists. Back-patch to v18 where idle_replication_slot_timeout was added. Author: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Gunnar Morling <gunnar.morling@googlemail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CADGJaX_0+FTguWpNSpgVWYQP_7MhoO0D8=cp4XozSQgaZ40Odw@mail.gmail.com Backpatch-through: 18	2025-07-11 08:44:32 +09:00
Fujii Masao	05dedf43d3	Change unit of idle_replication_slot_timeout to seconds. Previously, the idle_replication_slot_timeout parameter used minutes as its unit, based on the assumption that values would typically exceed one minute in production environments. However, this caused unexpected behavior: specifying a value below 30 seconds would round down to 0, effectively disabling the timeout. This could be surprising to users. To allow finer-grained control and avoid such confusion, this commit changes the unit of idle_replication_slot_timeout to seconds. Larger values can still be specified easily using standard time suffixes, for example, '24h' for 24 hours. Back-patch to v18 where idle_replication_slot_timeout was added. Reported-by: Gunnar Morling <gunnar.morling@googlemail.com> Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CADGJaX_0+FTguWpNSpgVWYQP_7MhoO0D8=cp4XozSQgaZ40Odw@mail.gmail.com Backpatch-through: 18	2025-07-11 08:39:24 +09:00
Jeff Davis	53cd0b71ee	Change wchar2char() and char2wchar() to accept a locale_t. These are libc-specific functions, so should require a locale_t rather than a pg_locale_t (which could use another provider). Discussion: https://postgr.es/m/a8666c391dfcabe79868d95f7160eac533ace718.camel%40j-davis.com	2025-07-09 08:45:34 -07:00
Nathan Bossart	167ed8082f	Introduce pg_dsm_registry_allocations view. This commit adds a new system view that provides information about entries in the dynamic shared memory (DSM) registry. Specifically, it returns the name, type, and size of each entry. Note that since we cannot discover the size of dynamic shared memory areas (DSAs) and hash tables backed by DSAs (dshashes) without first attaching to them, the size column is left as NULL for those. Bumps catversion. Author: Florents Tselai <florents.tselai@gmail.com> Reviewed-by: Sungwoo Chang <swchangdev@gmail.com> Discussion: https://postgr.es/m/4D445D3E-81C5-4135-95BB-D414204A0AB4%40gmail.com	2025-07-09 09:17:56 -05:00
Tom Lane	e03c952877	Fix low-probability memory leak in XMLSERIALIZE(... INDENT). xmltotext_with_options() did not consider the possibility that pg_xml_init() could fail --- most likely due to OOM. If that happened, the already-parsed xmlDoc structure would be leaked. Oversight in commit `483bdb2af`. Bug: #18981 Author: Dmitry Kovalenko <d.kovalenko@postgrespro.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18981-9bc3c80f107ae925@postgresql.org Backpatch-through: 16	2025-07-08 12:50:33 -04:00
Andres Freund	f54af9f267	aio: Combine io_uring memory mappings, if supported By default io_uring creates a shared memory mapping for each io_uring instance, leading to a large number of memory mappings. Unfortunately a large number of memory mappings slows things down, backend exit is particularly affected. To address that, newer kernels (6.5) support using user-provided memory for the memory. By putting the relevant memory into shared memory we don't need any additional mappings. On a system with a new enough kernel and liburing, there is no discernible overhead when doing a pgbench -S -C anymore. Reported-by: MARK CALLAGHAN <mdcallag@gmail.com> Reviewed-by: "Burd, Greg" <greg@burd.me> Reviewed-by: Jim Nasby <jnasby@upgrade.com> Discussion: https://postgr.es/m/CAFbpF8OA44_UG+RYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA@mail.gmail.com Backpatch-through: 18	2025-07-07 22:57:07 -04:00
Richard Guo	55a780e947	Consider explicit incremental sort for Append and MergeAppend For an ordered Append or MergeAppend, we need to inject an explicit sort into any subpath that is not already well enough ordered. Currently, only explicit full sorts are considered; incremental sorts are not yet taken into account. In this patch, for subpaths of an ordered Append or MergeAppend, we choose to use explicit incremental sort if it is enabled and there are presorted keys. The rationale is based on the assumption that incremental sort is always faster than full sort when there are presorted keys, a premise that has been applied in various parts of the code. In addition, the current cost model tends to favor incremental sort as being cheaper than full sort in the presence of presorted keys, making it reasonable not to consider full sort in such cases. No backpatch as this could result in plan changes. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAMbWs4_V7a2enTR+T3pOY_YZ-FU8ZsFYym2swOz4jNMqmSgyuw@mail.gmail.com	2025-07-08 10:21:44 +09:00
Álvaro Herrera	c616785516	Refactor some repetitive SLRU code Functions to bootstrap and zero pages in various SLRU callers were fairly duplicative. We can slash almost two hundred lines with a couple of simple helpers: - SimpleLruZeroAndWritePage: Does the equivalent of SimpleLruZeroPage followed by flushing the page to disk - XLogSimpleInsertInt64: Does a XLogBeginInsert followed by XLogInsert of a trivial record whose data is just an int64. Author: Evgeny Voropaev <evgeny.voropaev@tantorlabs.com> Reviewed by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/97820ce8-a1cd-407f-a02b-47368fadb14b%40tantorlabs.com	2025-07-07 16:49:19 +02:00
Álvaro Herrera	2633dae2e4	Standardize LSN formatting by zero padding This commit standardizes the output format for LSNs to ensure consistent representation across various tools and messages. Previously, LSNs were inconsistently printed as `%X/%X` in some contexts, while others used zero-padding. This often led to confusion when comparing. To address this, the LSN format is now uniformly set to `%X/%08X`, ensuring the lower 32-bit part is always zero-padded to eight hexadecimal digits. Author: Japin Li <japinli@hotmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/ME0P300MB0445CA53CA0E4B8C1879AF84B641A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM	2025-07-07 13:57:43 +02:00
Michael Paquier	62a17a9283	Integrate FullTransactionIds deeper into two-phase code This refactoring is a follow-up of the work done in `5a1dfde833`, that has switched 2PC file names to use FullTransactionIds when written on disk. This will help with the integration of a follow-up solution related to the handling of two-phase files during recovery, to address older defects while reading these from disk after a crash. This change is useful in itself as it reduces the need to build the file names from epoch numbers and TransactionIds, because we can use directly FullTransactionIds from which the 2PC file names are guessed. So this avoids a lot of back-and-forth between the FullTransactionIds retrieved from the file names and how these are passed around in the internal 2PC logic. Note that the core of the change is the use of a FullTransactionId instead of a TransactionId in GlobalTransactionData, that tracks 2PC file information in shared memory. The change in TwoPhaseCallback makes this commit unfit for stable branches. Noah has contributed a good chunk of this patch. I have spent some time on it as well while working on the issues with two-phase state files and recovery. Author: Noah Misch <noah@leadboat.com> Co-Authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/Z5sd5O9JO7NYNK-C@paquier.xyz Discussion: https://postgr.es/m/20250116205254.65.nmisch@google.com	2025-07-07 12:50:40 +09:00
Michael Paquier	5a6c39b6df	Disable commit timestamps during bootstrap Attempting to use commit timestamps during bootstrapping leads to an assertion failure, that can be reached for example with an initdb -c that enables track_commit_timestamp. It makes little sense to register a commit timestamp for a BootstrapTransactionId, so let's disable the activation of the module in this case. This problem has been independently reported once by each author of this commit. Each author has proposed basically the same patch, relying on IsBootstrapProcessingMode() to skip the use of commit_ts during bootstrap. The test addition is a suggestion by me, and is applied down to v16. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Author: Andy Fan <zhihuifan1213@163.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/OSCPR01MB14966FF9E4C4145F37B937E52F5102@OSCPR01MB14966.jpnprd01.prod.outlook.com Discussion: https://postgr.es/m/87plejmnpy.fsf@163.com Backpatch-through: 13	2025-07-04 15:09:24 +09:00
Fujii Masao	78ebda66bf	Speed up truncation of temporary relations. Previously, truncating a temporary relation required scanning the entire local buffer pool once per relation fork to invalidate buffers. This could be slow, especially with a large local buffers, as the scan was repeated multiple times. A similar issue with regular tables (shared buffers) was addressed in commit `6d05086c0a` by scanning the buffer pool only once for all forks. This commit applies the same optimization to temporary relations, improving truncation performance. Author: Daniil Davydov <3danissimo@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Maxim Orlov <orlovmg@gmail.com> Discussion: https://postgr.es/m/CAJDiXggNqsJOH7C5co4jA8nDk8vw-=sokyh5s1_TENWnC6Ofcg@mail.gmail.com	2025-07-04 09:03:58 +09:00
Tom Lane	931766aaec	Simplify COALESCE() with one surviving argument. If, after removal of useless null-constant arguments, a CoalesceExpr has exactly one remaining argument, we can just take that argument as the result, without bothering to wrap a new CoalesceExpr around it. This isn't likely to produce any great improvement in runtime per se, but it can lead to better plans since the planner no longer has to treat the expression as non-strict. However, there were a few regression test cases that intentionally wrote COALESCE(x) as a shorthand way of creating a non-strict subexpression. To avoid ruining the intent of those tests, write COALESCE(x,x) instead. (If anyone ever proposes de-duplicating COALESCE arguments, we'll need another iteration of this arms race. But it seems pretty unlikely that such an optimization would be worthwhile.) Author: Maksim Milyutin <maksim.milyutin@tantorlabs.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/8e8573c3-1411-448d-877e-53258b7b2be0@tantorlabs.ru	2025-07-03 17:39:53 -04:00
Tom Lane	0059bbe1ec	Break out xxx2yyy_opt_overflow APIs for more datetime conversions. Previous commits invented timestamp2timestamptz_opt_overflow, date2timestamp_opt_overflow, and date2timestamptz_opt_overflow functions to perform non-error-throwing conversions between datetime types. This patch completes the set by adding timestamp2date_opt_overflow, timestamptz2date_opt_overflow, and timestamptz2timestamp_opt_overflow. In addition, adjust timestamp2timestamptz_opt_overflow so that it doesn't throw error if timestamp2tm fails, but treats that as an overflow case. The situation probably can't arise except with an invalid timestamp value, and I can't think of a way that that would happen except data corruption. However, it's pretty silly to have a function whose entire reason for existence is to not throw errors for out-of-range inputs nonetheless throw an error for out-of-range input. The new APIs are not used in this patch, but will be needed in upcoming btree_gin changes. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> Discussion: https://postgr.es/m/262624.1738460652@sss.pgh.pa.us	2025-07-03 16:17:08 -04:00
Tom Lane	a10f21e6ce	Obtain required table lock during cross-table updates, redux. Commits `8319e5cb5` et al missed the fact that ATPostAlterTypeCleanup contains three calls to ATPostAlterTypeParse, and the other two also need protection against passing a relid that we don't yet have lock on. Add similar logic to those code paths, and add some test cases demonstrating the need for it. In v18 and master, the test cases demonstrate that there's a behavioral discrepancy between stored generated columns and virtual generated columns: we disallow changing the expression of a stored column if it's used in any composite-type columns, but not that of a virtual column. Since the expression isn't actually relevant to either sort of composite-type usage, this prohibition seems unnecessary; but changing it is a matter for separate discussion. For now we are just documenting the existing behavior. Reported-by: jian he <jian.universality@gmail.com> Author: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: CACJufxGKJtGNRRSXfwMW9SqVOPEMdP17BJ7DsBf=tNsv9pWU9g@mail.gmail.com Backpatch-through: 13	2025-07-03 13:46:07 -04:00
Álvaro Herrera	647cffd2f3	Prevent creation of duplicate not-null constraints for domains This was previously harmless, but now that we create pg_constraint rows for those, duplicates are not welcome anymore. Backpatch to 18. Co-authored-by: jian he <jian.universality@gmail.com> Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CACJufxFSC0mcQ82bSk58sO-WJY4P-o4N6RD2M0D=DD_u_6EzdQ@mail.gmail.com	2025-07-03 11:46:12 +02:00
Álvaro Herrera	87251e1149	Fix bogus grammar for a CREATE CONSTRAINT TRIGGER error If certain constraint characteristic clauses (NO INHERIT, NOT VALID, NOT ENFORCED) are given to CREATE CONSTRAINT TRIGGER, the resulting error message is ERROR: TRIGGER constraints cannot be marked NO INHERIT which is a bit silly, because these aren't "constraints of type TRIGGER". Hardcode a better error message to prevent it. This is a cosmetic fix for quite a fringe problem with no known complaints from users, so no backpatch. While at it, silently accept ENFORCED if given. Author: Amul Sul <sulamul@gmail.com> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CAAJ_b97hd-jMTS7AjgU6TDBCzDx_KyuKxG+K-DtYmOieg+giyQ@mail.gmail.com Discussion: https://postgr.es/m/CACJufxHSp2puxP=q8ZtUGL1F+heapnzqFBZy5ZNGUjUgwjBqTQ@mail.gmail.com	2025-07-03 11:25:39 +02:00
Michael Paquier	8ec04c8577	Refactor subtype field of AlterDomainStmt AlterDomainStmt.subtype used characters for its subtypes of commands, SET\|DROP DEFAULT\|NOT NULL and ADD\|DROP\|VALIDATE CONSTRAINT, which were hardcoded in a couple of places of the code. The code is improved by using an enum instead, with the same character values as the original code. Note that the field was documented in parsenodes.h and that it forgot to mention 'V' (VALIDATE CONSTRAINT). Author: Quan Zongliang <quanzongliang@yeah.net> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/41ff310b-16bd-44b9-a3ef-97e20f14b709@yeah.net	2025-07-03 16:34:28 +09:00
Fujii Masao	bc2f348e87	Support multi-line headers in COPY FROM command. The COPY FROM command now accepts a non-negative integer for the HEADER option, allowing multiple header lines to be skipped. This is useful when the input contains multi-line headers that should be ignored during data import. Author: Shinya Kato <shinya11.kato@gmail.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp> Discussion: https://postgr.es/m/CAOzEurRPxfzbxqeOPF_AGnAUOYf=Wk0we+1LQomPNUNtyZGBZw@mail.gmail.com	2025-07-03 15:27:26 +09:00
Michael Paquier	fd7d7b7191	Improve checks for GUC recovery_target_timeline Currently check_recovery_target_timeline() converts any value that is not "current", "latest", or a valid integer to 0. So, for example, the following configuration added to postgresql.conf followed by a startup: recovery_target_timeline = 'bogus' recovery_target_timeline = '9999999999' ... results in the following error patterns: FATAL: 22023: recovery target timeline 0 does not exist FATAL: 22023: recovery target timeline 1410065407 does not exist This is confusing, because the server does not reflect the intention of the user, and just reports incorrect data unrelated to the GUC. The origin of the problem is that we do not perform a range check in the GUC value passed-in for recovery_target_timeline. This commit improves the situation by using strtou64() and by providing stricter range checks. Some test cases are added for the cases of an incorrect, an upper-bound and a lower-bound timeline value, checking the sanity of the reports based on the contents of the server logs. Author: David Steele <david@pgmasters.net> Discussion: https://postgr.es/m/e5d472c7-e9be-4710-8dc4-ebe721b62cea@pgbackrest.org	2025-07-03 11:14:20 +09:00
Richard Guo	0da29e4cb1	Enable use of Memoize for ANTI joins Currently, we do not support Memoize for SEMI and ANTI joins because nested loop SEMI/ANTI joins do not scan the inner relation to completion, which prevents Memoize from marking the cache entry as complete. One might argue that we could mark the cache entry as complete after fetching the first inner tuple, but that would not be safe: if the first inner tuple and the current outer tuple do not satisfy the join clauses, a second inner tuple matching the parameters would find the cache entry already marked as complete. However, if the inner side is provably unique, this issue doesn't arise, since there would be no second matching tuple. That said, this doesn't help in the case of SEMI joins, because a SEMI join with a provably unique inner side would already have been reduced to an inner join by reduce_unique_semijoins. Therefore, in this patch, we check whether the inner relation is provably unique for ANTI joins and enable the use of Memoize in such cases. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CAMbWs48FdLiMNrmJL-g6mDvoQVt0yNyJAqMkv4e2Pk-5GKCZLA@mail.gmail.com	2025-07-03 10:57:26 +09:00
Michael Paquier	7b2eb72b1b	Add InjectionPointList() to retrieve list of injection points This routine has come as a useful piece to be able to know the list of injection points currently attached in a system. One area would be to use it in a set-returning function, or just let out-of-core code play with it. This hides the internals of the shared memory array lookup holding the information about the injection points (point name, library and function name), allocating the result in a palloc'd List consumable by the caller. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Discussion: https://postgr.es/m/Z_xYkA21KyLEHvWR@paquier.xyz Discussion: https://postgr.es/m/aBG2rPwl3GE7m1-Q@paquier.xyz	2025-07-03 08:41:25 +09:00
Nathan Bossart	bb109382ef	Make more use of RELATION_IS_OTHER_TEMP(). A few places were open-coding it instead of using this handy macro. Author: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAEG8a3LjTGJcOcxQx-SUOGoxstG4XuCWLH0ATJKKt_aBTE5K8w%40mail.gmail.com	2025-07-02 12:32:19 -05:00
Nathan Bossart	fe07100e82	Add GetNamedDSA() and GetNamedDSHash(). Presently, the dynamic shared memory (DSM) registry only provides GetNamedDSMSegment(), which allocates a fixed-size segment. To use the DSM registry for more sophisticated things like dynamic shared memory areas (DSAs) or a hash table backed by a DSA (dshash), users need to create a DSM segment that stores various handles and LWLock tranche IDs and to write fairly complicated initialization code. Furthermore, there is likely little variation in this initialization code between libraries. This commit introduces functions that simplify allocating a DSA or dshash within the DSM registry. These functions are very similar to GetNamedDSMSegment(). Notable differences include the lack of an initialization callback parameter and the prohibition of calling the functions more than once for a given entry in each backend (which should be trivially avoidable in most circumstances). While at it, this commit bumps the maximum DSM registry entry name length from 63 bytes to 127 bytes. Also note that even though one could presumably detach/destroy the DSAs and dshashes created in the registry, such use-cases are not yet well-supported, if for no other reason than the associated DSM registry entries cannot be removed. Adding such support is left as a future exercise. The test_dsm_registry test module contains tests for the new functions and also serves as a complete usage example. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Florents Tselai <florents.tselai@gmail.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Discussion: https://postgr.es/m/aEC8HGy2tRQjZg_8%40nathan	2025-07-02 11:50:52 -05:00
Peter Geoghegan	9ca30a0b04	Update obsolete row compare preprocessing comments. Restore nbtree preprocessing comments describing how we mark nbtree row compare members required to how they were prior to 2016 bugfix commit `a298a1e0`. Oversight in commit `bd3f59fd`, which made nbtree preprocessing revert to the original 2006 rules, but neglected to revert these comments. Backpatch-through: 18	2025-07-02 12:36:35 -04:00
Tom Lane	7374b3a536	Allow width_bucket()'s "operand" input to be NaN. The array-based variant of width_bucket() has always accepted NaN inputs, treating them as equal but larger than any non-NaN, as we do in ordinary comparisons. But up to now, the four-argument variants threw errors for a NaN operand. This is inconsistent and unnecessary, since we can perfectly well regard NaN as falling after the last bucket. We do still throw error for NaN or infinity histogram-bound inputs, since there's no way to compute sensible bucket boundaries. Arguably this is a bug fix, but given the lack of field complaints I'm content to fix it in master. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/2822872.1750540911@sss.pgh.pa.us	2025-07-02 11:34:40 -04:00
Álvaro Herrera	c989affb52	Fix error message for ALTER CONSTRAINT ... NOT VALID Trying to alter a constraint so that it becomes NOT VALID results in an error that assumes the constraint is a foreign key. This is potentially wrong, so give a more generic error message. While at it, give CREATE CONSTRAINT TRIGGER a better error message as well. Co-authored-by: jian he <jian.universality@gmail.com> Co-authored-by: Fujii Masao <masao.fujii@oss.nttdata.com> Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Co-authored-by: Amul Sul <sulamul@gmail.com> Discussion: https://postgr.es/m/CACJufxHSp2puxP=q8ZtUGL1F+heapnzqFBZy5ZNGUjUgwjBqTQ@mail.gmail.com	2025-07-02 17:02:27 +02:00
Peter Geoghegan	bd3f59fdb7	Make row compares robust during nbtree array scans. Recent nbtree bugfix commit `5f4d98d4` added a special case to the code that sets up a page-level prefix of keys that are definitely satisfied by every tuple on the page: whenever _bt_set_startikey reached a row compare key, we'd refuse to apply the pstate.forcenonrequired behavior in scans where that usually happens (scans with a higher-order array key). That hack made the scan avoid essentially the same infinite cycling behavior that also affected nbtree scans with redundant keys (keys that preprocessing could not eliminate) prior to commit `f09816a0`. There are now serious doubts about this row compare workaround. Testing has shown that a scan with a row compare key and an array key could still read the same leaf page twice (without the scan's direction changing), which isn't supposed to be possible following the SAOP enhancements added by Postgres 17 commit `5bf748b8`. Also, we still allowed a required row compare key to be used with forcenonrequired mode when its header key happened to be beyond the pstate.ikey set by _bt_set_startikey, which was complicated and brittle. The underlying problem was that row compares had inconsistent rules around how scans start (which keys can be used for initial positioning purposes) and how scans end (which keys can set continuescan=false). Quals with redundant keys that could not be eliminated by preprocessing also had that same quality to them prior to today's bugfix `f09816a0`. It now seems prudent to bring row compare keys in line with the new charter for required keys, by making the start and end rules symmetric. This commit fixes two points of disagreement between _bt_first and _bt_check_rowcompare. Firstly, _bt_check_rowcompare was capable of ending the scan at the point where it needed to compare an ISNULL-marked row compare member that came immediately after a required row compare member. _bt_first now has symmetric handling for NULL row compares. Secondly, _bt_first had its own ideas about which keys were safe to use for initial positioning purposes. It could use fewer or more keys than _bt_check_rowcompare. _bt_first now uses the same requiredness markings as _bt_check_rowcompare for this. Now that _bt_first and _bt_check_rowcompare agree on how to start and end scans, we can get rid of the forcenonrequired special case, without any risk of infinite cycling. This approach also makes row compare keys behave more like regular scalar keys, particularly within _bt_first. Fixing these inconsistencies necessitates dealing with a related issue with the way that row compares were marked required by preprocessing: we didn't mark any lower-order row members required following 2016 bugfix commit `a298a1e0`. That approach was over broad. The bug in question was actually an oversight in how _bt_check_rowcompare dealt with tuple NULL values that failed to satisfy a scan key marked required in the opposite scan direction (it was a bug in 2011 commits `6980f817` and `882368e8`, not a bug in 2006 commit `3a0a16cb`). Go back to marking row compare members as required using the original 2006 rules, and fix the 2016 bug in a more principled way: by limiting use of the "set continuescan=false with a key required in the opposite scan direction upon encountering a NULL tuple value" optimization to the first/most significant row member key. While it isn't safe to use an implied IS NOT NULL qualifier to end the scan when it comes from a required lower-order row compare member key, it _is_ generally safe for such a required member key to end the scan -- provided the key is marked required in the _current_ scan direction. This fixes what was arguably an oversight in either commit `5f4d98d4` or commit `8a510275`. It is a direct follow-up to today's commit `f09816a0`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=pcijHL_mA0_TJ5LiTB28QpQ0cGtT-ccFV=KzuunNDDQ@mail.gmail.com Backpatch-through: 18	2025-07-02 09:48:15 -04:00
Peter Geoghegan	f09816a0a7	Make handling of redundant nbtree keys more robust. nbtree preprocessing's handling of redundant (and contradictory) keys created problems for scans with = arrays. It was just about possible for a scan with an = array key and one or more redundant keys (keys that preprocessing could not eliminate due an incomplete opfamily and a cross-type key) to get stuck. Testing has shown that infinite cycling where the scan never manages to make forward progress was possible. This could happen when the scan's arrays were reset in _bt_readpage's forcenonrequired=true path (added by bugfix commit `5f4d98d4`) when the arrays weren't at least advanced up to the same point that they were in at the start of the _bt_readpage call. Earlier redundant keys prevented the finaltup call to _bt_advance_array_keys from reaching lower-order keys that needed to be used to sufficiently advance the scan's arrays. To fix, make preprocessing leave the scan's keys in a state that is as close as possible to how it'll usually leave them (in the common case where there's no redundant keys that preprocessing failed to eliminate). Now nbtree preprocessing _reliably_ leaves behind at most one required >/>= key per index column, and at most one required </<= key per index column. Columns that have one or more = keys that are eligible to be marked required (based on the traditional rules) prioritize the = keys over redundant inequality keys; they'll _reliably_ be left with only one of the = keys as the index column's only required key. Keys that are not marked required (whether due to the new preprocessing step running or for some other reason) are relocated to the end of the so->keyData[] array as needed. That way they'll always be evaluated after the scan's required keys, and so cannot prevent code in places like _bt_advance_array_keys and _bt_first from reaching a required key. Also teach _bt_first to decide which initial positioning keys to use based on the same requiredness markings that have long been used by _bt_checkkeys/_bt_advance_array_keys. This is a necessary condition for reliably avoiding infinite cycling. _bt_advance_array_keys expects to be able to reason about what'll happen in the next _bt_first call should it start another primitive index scan, by evaluating inequality keys that were marked required in the opposite-to-scan scan direction only. Now everybody (_bt_first, _bt_checkkeys, and _bt_advance_array_keys) will always agree on which exact key will be used on each index column to start and/or end the scan (except when row compare keys are involved, which have similar problems not addressed by this commit). An upcoming commit will finish off the work started by this commit by harmonizing how _bt_first, _bt_checkkeys, and _bt_advance_array_keys apply row compare keys to start and end scans. This fixes what was arguably an oversight in either commit `5f4d98d4` or commit `8a510275`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=ds4M+3NXMgwxYxqU8MULaLf696_v5g=9WNmWL2=Uo2A@mail.gmail.com Backpatch-through: 18	2025-07-02 09:40:49 -04:00
Peter Eisentraut	f039c22441	meson: Increase minimum version to 0.57.2 The previous minimum was to maintain support for Python 3.5, but we now require Python 3.6 anyway (commit `45363fca63`), so that reason is obsolete. A small raise to Meson 0.57 allows getting rid of a fair amount of version conditionals and silences some future-deprecated warnings. With the version bump, the following deprecation warnings appeared and are fixed: WARNING: Project targets '>=0.57' but uses feature deprecated since '0.55.0': ExternalProgram.path. use ExternalProgram.full_path() instead WARNING: Project targets '>=0.57' but uses feature deprecated since '0.56.0': meson.build_root. use meson.project_build_root() or meson.global_build_root() instead. It turns out that meson 0.57.0 and 0.57.1 are buggy for our use, so the minimum is actually set to 0.57.2. This is specific to this version series; in the future we won't necessarily need to be this precise. Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/42e13eb0-862a-441e-8d84-4f0fd5f6def0%40eisentraut.org	2025-07-02 11:14:53 +02:00
Masahiko Sawada	3811ca3600	Fix missing FSM vacuum opportunities on tables without indexes. Commit `c120550edb` optimized the vacuuming of relations without indexes (a.k.a. one-pass strategy) by directly marking dead item IDs as LP_UNUSED. However, the periodic FSM vacuum was still checking if dead item IDs had been marked as LP_DEAD when attempting to vacuum the FSM every VACUUM_FSM_EVERY_PAGES blocks. This condition was never met due to the optimization, resulting in missed FSM vacuum opportunities. This commit modifies the periodic FSM vacuum condition to use the number of tuples deleted during HOT pruning. This count includes items marked as either LP_UNUSED or LP_REDIRECT, both of which are expected to result in new free space to report. Back-patch to v17 where the vacuum optimization for tables with no indexes was introduced. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAD21AoBL8m6B9GSzQfYxVaEgvD7-Kr3AJaS-hJPHC+avm-29zw@mail.gmail.com Backpatch-through: 17	2025-07-01 23:25:20 -07:00
Michael Paquier	b45242fd30	Move code for the bytea data type from varlena.c to new bytea.c This commit moves all the routines related to the bytea data type into its own new file, called bytea.c, clearing some of the bloat in varlena.c. This includes the routines for: - Input, output, receive and send - Comparison - Casts to integer types - bytea-specific functions The internals of the routines moved here are unchanged, with one exception. This comes with a twist in bytea_string_agg_transfn(), where the call to makeStringAggState() is replaced by the internals of this routine, still located in varlena.c. This simplifies the move to the new file by not having to expose makeStringAggState(). Author: Aleksander Alekseev <aleksander@timescale.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/CAJ7c6TMPVPJ5DL447zDz5ydctB8OmuviURtSwd=PHCRFEPDEAQ@mail.gmail.com	2025-07-02 09:52:21 +09:00
Michael Paquier	bee23ea4dd	Show sizes of FETCH queries as constants in pg_stat_statements Prior to this patch, every FETCH call would generate a unique queryId with a different size specified. Depending on the workloads, this could lead to a significant bloat in pg_stat_statements, as repeatedly calling a specific cursor would result in a new queryId each time. For example, FETCH 1 c1; and FETCH 2 c1; would produce different queryIds. This patch improves the situation by normalizing the fetch size, so as semantically similar statements generate the same queryId. As a result, statements like the below, which differ syntactically but have the same effect, will now share a single queryId: FETCH FROM c1 FETCH NEXT c1 FETCH 1 c1 In order to do a normalization based on the keyword used in FETCH, FetchStmt is tweaked with a new FetchDirectionKeywords. This matters for "howMany", which could be set to a negative value depending on the direction, and we want to normalize the queries with enough information about the direction keywords provided, including RELATIVE, ABSOLUTE or all the ALL variants. Author: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0tA6LbHCg2qSS+KuM850BZC_+ZgHV7Ug6BXw22TNyF+MA@mail.gmail.com	2025-07-02 08:39:25 +09:00
Nathan Bossart	32bcf568cb	Make more use of binaryheap_empty() and binaryheap_size(). A few places were accessing bh_size directly instead of via these handy macros. Author: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/CAJ7c6TPQMVL%2B028T4zuw9ZqL5Du9JavOLhBQLkJeK0RznYx_6w%40mail.gmail.com	2025-07-01 14:19:07 -05:00
Peter Eisentraut	fff0d1edf5	Improve code comment The previous wording was potentially confusing about the impact of the OVERRIDING clause on generated columns. Reword slightly to avoid that. Reported-by: jian he <jian.universality@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CACJufxFMBe0nPXOQZMLTH4Ry5Gyj4m%2B2Z05mRi9KB4hk8rGt9w%40mail.gmail.com	2025-07-01 18:42:07 +02:00
Tom Lane	29213636e6	Make safeguard against incorrect flags for fsync more portable. The existing code assumed that O_RDONLY is defined as 0, but this is not required by POSIX and is not true on GNU Hurd. We can avoid the assumption by relying on O_ACCMODE to mask the fcntl() result. (Hopefully, all supported platforms define that.) Author: Michael Banck <mbanck@gmx.net> Co-authored-by: Samuel Thibault Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6862e8d1.050a0220.194b8d.76fa@mx.google.com Discussion: https://postgr.es/m/68480868.5d0a0220.1e214d.68a6@mx.google.com Backpatch-through: 13	2025-07-01 12:08:20 -04:00
Jeff Davis	8af0d0ab01	Remove provider field from pg_locale_t. The behavior of pg_locale_t is specified by methods, so a separate provider field is no longer necessary. Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/2830211e1b6e6a2e26d845780b03e125281ea17b.camel%40j-davis.com	2025-07-01 07:50:46 -07:00
Jeff Davis	5a38104b36	Control ctype behavior internally with a method table. Previously, pattern matching and case mapping behavior branched based on the provider. Refactor to use a method table, which is less error-prone. This is also a step toward multiple provider versions, which we may want to support in the future. Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/2830211e1b6e6a2e26d845780b03e125281ea17b.camel%40j-davis.com	2025-07-01 07:44:47 -07:00
Jeff Davis	d81dcc8d62	Use pg_ascii_tolower()/pg_ascii_toupper() where appropriate. Avoids unnecessary dependence on setlocale(). No behavior change. This commit reverts `e1458f2f1b`, which reverted some changes unintentionally committed before the branch for 19. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/a8666c391dfcabe79868d95f7160eac533ace718.camel@j-davis.com Discussion: https://postgr.es/m/7efaaa645aa5df3771bb47b9c35df27e08f3520e.camel@j-davis.com	2025-07-01 07:24:23 -07:00
Tomas Vondra	81f287dc92	Silence valgrind about pg_numa_touch_mem_if_required When querying NUMA status of pages in shared memory, we need to touch the memory first to get valid results. This may trigger valgrind reports, because some of the memory (e.g. unpinned buffers) may be marked as noaccess. Solved by adding a valgrind suppresion. An alternative would be to adjust the access/noaccess status before touching the memory, but that seems far too invasive. It would require all those places to have detailed knowledge of what the shared memory stores. The pg_numa_touch_mem_if_required() macro is replaced with a function. Macros are invisible to suppressions, so it'd have to suppress reports for the caller - e.g. pg_get_shmem_allocations_numa(). So we'd suppress reports for the whole function, and that seems to heavy-handed. It might easily hide other valid issues. Reviewed-by: Christoph Berg <myon@debian.org> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aEtDozLmtZddARdB@msg.df7cb.de Backpatch-through: 18	2025-07-01 12:32:23 +02:00
Amit Langote	c67989789c	Fix typos in comments Commit `19d8e2308b` added enum values with the prefix TU_, but a few comments still referred to TUUI_, which was used in development versions of the patches committed as `19d8e2308b`. Author: Yugo Nagata <nagata@sraoss.co.jp> Discussion: https://postgr.es/m/20250701110216.8ac8a9e4c6f607f1d954f44a@sraoss.co.jp Backpatch-through: 16	2025-07-01 13:13:48 +09:00
Michael Paquier	a3df0d43d9	Fix typo in system_views.sql's definition of pg_stat_activity backend_xmin used a lower-character 's' instead of the upper-character 'S' like the other attributes. This is harmless, but let's be consistent. Issue introduced in `dd1a3bccca`. Author: Daisuke Higuchi <higuchi.daisuke11@gmail.com> Discussion: https://postgr.es/m/CAEVT6c8M39cqWje-df39wWr0KWcDgGKd5fMvQo84zvCXKoEL9Q@mail.gmail.com	2025-07-01 09:41:42 +09:00
Michael Paquier	2e94721747	Improve error handling of libxml2 calls in xml.c This commit fixes some defects in the backend's xml.c, found upon inspection of the internals of libxml2: - xmlEncodeSpecialChars() can fail on malloc(), returning NULL back to the caller. xmltext() assumed that this could never happen. Like other code paths, a TRY/CATCH block is added there, covering also the fact that cstring_to_text_with_len() could fail a memory allocation, where the backend would miss to free the buffer allocated by xmlEncodeSpecialChars(). - Some libxml2 routines called in xmlelement() can return NULL, like xmlAddChildList() or xmlTextWriterStartElement(). Dedicated errors are added for them. - xml_xmlnodetoxmltype() missed that xmlXPathCastNodeToString() can fail on an allocation failure. In this case, the call can just be moved to the existing TRY/CATCH block. All these code paths would cause the server to crash. As this is unlikely a problem in practice, no backpatch is done. Jim and I have caught these defects, not sure who has scored the most. The contrib module xml2/ has similar defects, which will be addressed in a separate change. Reported-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Discussion: https://postgr.es/m/aEEingzOta_S_Nu7@paquier.xyz	2025-07-01 08:57:05 +09:00
Nathan Bossart	bd09f024a1	Add new OID alias type regdatabase. This provides a convenient way to look up a database's OID. For example, the query SELECT * FROM pg_shdepend WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database()); can now be simplified to SELECT * FROM pg_shdepend WHERE dbid = current_database()::regdatabase; Like the regrole type, regdatabase has cluster-wide scope, so we disallow regdatabase constants from appearing in stored expressions. Bumps catversion. Author: Ian Lawrence Barwick <barwick@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/aBpjJhyHpM2LYcG0%40nathan	2025-06-30 15:38:54 -05:00
Peter Eisentraut	cc2ac0e6f9	Remove unused #include's in src/backend/utils/adt/* Author: Aleksander Alekseev <aleksander@timescale.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAJ7c6TOowVbR-0NEvvDm6a_mag18krR0XJ2FKrc9DHXj7hFRtQ%40mail.gmail.com	2025-06-30 12:00:00 +02:00
Michael Paquier	2252fcd427	Rationalize handling of VacuumParams This commit refactors the vacuum routines that rely on VacuumParams, adding const markers where necessary to force a new policy in the code. This structure should not use a pointer as it may be used across multiple relations, and its contents should never be updated. vacuum_rel() stands as an exception as it touches the "index_cleanup" and "truncate" options. VacuumParams has been introduced in `0d83138974`, and `661643deda` has fixed a bug impacting VACUUM operating on multiple relations. The changes done in tableam.h break ABI compatibility, so this commit can only happen on HEAD. Author: Shihao Zhong <zhong950419@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAGRkXqTo+aK=GTy5pSc-9cy8H2F2TJvcrZ-zXEiNJj93np1UUw@mail.gmail.com	2025-06-30 15:42:50 +09:00
Tom Lane	66e9df9f6e	Fix some new issues with planning of PlaceHolderVars. In the wake of commit `a16ef313f`, we need to deal with more cases involving PlaceHolderVars in NestLoopParams than we did before. For one thing, `a16ef313f` was incorrect to suppose that we could rely on the required-outer relids of the lefthand path to decide placement of nestloop-parameter PHVs. As Richard Guo argued at the time, we must look at the required-outer relids of the join path itself. For another, we have to apply replace_nestloop_params() to such a PHV's expression, in case it contains references to values that will be supplied from NestLoopParams of higher-level nestloops. For another, we need to be more careful about the phnullingrels of the PHV than we were being. identify_current_nestloop_params only bothered to ensure that the phnullingrels didn't contain "too many" relids, but now it has to be exact, because setrefs.c will apply both NRM_SUBSET and NRM_SUPERSET checks in different places. We can compute the correct relids by determining the set of outer joins that should be able to null the PHV and then subtracting whatever's been applied at or below this join. Do the same for plain Vars, too. (This should make it possible to use NRM_EQUAL to process nestloop params in setrefs.c, but I won't risk making such a change in v18 now.) Lastly, if a nestloop parameter PHV was pulled up out of a subquery and it contains a subquery that was originally pushed down from this query level, then that will still be represented as a SubLink, because SS_process_sublinks won't recurse into outer PHVs, so it didn't get transformed during expression preprocessing in the subquery. We can substitute the version of the PHV's expression appearing in its PlaceHolderInfo to ensure that that preprocessing has happened. (Seems like this processing sequence could stand to be redesigned, but again, late in v18 development is not the time for that.) It's not very clear to me why the old have_dangerous_phv join-order restriction prevented us from seeing the last three of these problems. But given the lack of field complaints, it must have done so. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18953-1c9883a9d4afeb30@postgresql.org	2025-06-29 15:04:32 -04:00
Tom Lane	8319e5cb54	Obtain required table lock during cross-table constraint updates. Sometimes a table's constraint may depend on a column of another table, so that we have to update the constraint when changing the referenced column's type. We need to have lock on the constraint's table to do that. ATPostAlterTypeCleanup believed that this case was only possible for FOREIGN KEY constraints, but it's wrong at least for CHECK and EXCLUDE constraints; and in general, we'd probably need exclusive lock to alter any sort of constraint. So just remove the contype check and acquire lock for any other table. This prevents a "you don't have lock" assertion failure, though no ill effect is observed in production builds. We'll error out later anyway because we don't presently support physically altering column types within stored composite columns. But the catalog-munging is basically all there, so we may as well make that part work. Bug: #18970 Reported-by: Alexander Lakhin <exclusion@gmail.com> Diagnosed-by: jian he <jian.universality@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18970-a7d1cfe1f8d5d8d9@postgresql.org Backpatch-through: 13	2025-06-29 13:56:03 -04:00
Peter Eisentraut	50fd428b2b	Message style improvements	2025-06-28 19:18:06 +02:00
Alexander Korotkov	7195c804bd	Fix CheckPointReplicationSlots() with max_replication_slots == 0 `ca307d5cec` made CheckPointReplicationSlots() unconditionally call ReplicationSlotsComputeRequiredLSN(). It causes an assertion trap when max_replication_slots equals 0. This commit makes CheckPointReplicationSlots() call ReplicationSlotsComputeRequiredLSN() only when at least one slot gets its last_saved_restart_lsn updated. That avoids an assert trap and also saves some cycles when no one slot has last_saved_restart_lsn updated. Based on ideas from Dilip Kumar <dilipbalaut@gmail.com> and Hayato Kuroda <kuroda.hayato@fujitsu.com>. Reported-by: Zhijie Hou <houzj.fnst@fujitsu.com> Discussion: https://postgr.es/m/OS0PR01MB5716BB506AF934376FF3A8BB947BA%40OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-06-27 11:49:00 +03:00
Michael Paquier	94e2e150ec	Correct list of files in src/backend/lib/README binaryheap.c and stringinfo.c have been moved to src/common/ by respectively `5af0263afd` and `26aaf97b68`, and the README patched here still mentioned these two files as available in src/backend/lib/. Author: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/CAJ7c6TPg-=tC+fzq0tGTtmL7r79-aWeCmpwAyQiGu0N+sKGj8Q@mail.gmail.com	2025-06-27 09:31:23 +09:00
Peter Eisentraut	95e12d4d9b	Correct misleading error messages Commit `7d6d2c4bbd` dropped opcintype from the index AM strategy translation API. But some error messages about failed lookups still mentioned it, even though it was not used for the lookup. Fix by removing ipcintype from the error messages as well.	2025-06-26 22:02:16 +02:00
Melanie Plageman	483f7246f3	Remove unused check in heap_xlog_insert() `8e03eb92e9` reverted the commit `39b66a91bd` which allowed freezing in the heap_insert() code path but forgot to remove the corresponding check in heap_xlog_insert(). This code is extraneous but not harmful. However, cleaning it up makes it very clear that, as of now, we do not support any freezing of pages in the heap_insert() path. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/flat/CAAKRu_Zp4Pi-t51OFWm1YZ-cctDfBhHCMZ%3DEx6PKxv0o8y2GvA%40mail.gmail.com Backpatch-through: 14	2025-06-26 15:03:48 -04:00
Melanie Plageman	060f420a03	Simplify vacuum VM update logging counters We can simplify the VM counters added in `dc6acfd910` to lazy_vacuum_heap_page() and lazy_scan_new_or_empty(). We won't invoke lazy_vacuum_heap_page() unless there are dead line pointers, so we know the page can't be all-visible. In lazy_scan_new_or_empty(), we only update the VM if the page-level hint PD_ALL_VISIBLE is clear, and the VM bit cannot be set if the page level bit is clear because a subsequent page update would fail to clear the visibility map bit. Simplify the logic for determining which log counters to increment based on this knowledge. Doing so is worthwhile because the old logic was confusing and misguided. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_a9w_n2mwY%3DG4LjfWTvRTJtjbfvnYAKi4WjO8QXHHrA0g%40mail.gmail.com	2025-06-26 14:25:45 -04:00
Fujii Masao	81ce602d48	Make CREATE TABLE LIKE copy comments on NOT NULL constraints when requested. Commit `14e87ffa5c` introduced support for adding comments to NOT NULL constraints. However, CREATE TABLE LIKE INCLUDING COMMENTS did not copy these comments to the new table. This was an oversight in that commit. This commit corrects the behavior by ensuring CREATE TABLE LIKE to also copy the comments on NOT NULL constraints when INCLUDING COMMENTS is specified. Author: Jian He <jian.universality@gmail.com> Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/127debef-e558-4784-9e24-0d5eaf91e2d1@oss.nttdata.com	2025-06-26 20:25:34 +09:00
Richard Guo	5069fef1cf	Expand virtual generated columns for ALTER COLUMN TYPE For the subcommand ALTER COLUMN TYPE of the ALTER TABLE command, the USING expression may reference virtual generated columns. These columns must be expanded before the expression is fed through expression_planner and the expression-execution machinery. Failing to do so can result in incorrect rewrite decisions, and can also lead to "ERROR: unexpected virtual generated column reference". Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/b5f96b24-ccac-47fd-9e20-14681b894f36@gmail.com	2025-06-26 12:17:12 +09:00
Peter Eisentraut	0cd69b3d7e	Restrict virtual columns to use built-in functions and types Just like selecting from a view is exploitable (CVE-2024-7348), selecting from a table with virtual generated columns is exploitable. Users who are concerned about this can avoid selecting from views, but telling them to avoid selecting from tables is less practical. To address this, this changes it so that generation expressions for virtual generated columns are restricted to using built-in functions and types, and the columns are restricted to having a built-in type. We assume that built-in functions and types cannot be exploited for this purpose. In the future, this could be expanded by some new mechanism to declare other functions and types as safe or trusted for this purpose, but that is to be designed. (An alternative approach might have been to expand the restrict_nonsystem_relation_kind GUC to handle this, like the fix for CVE-2024-7348. But that is kind of an ugly approach. That fix had to fit in the constraints of fixing an ancient vulnerability in all branches. Since virtual generated columns are new, we're free from the constraints of the past, and we can and should use cleaner options.) Reported-by: Feike Steenbergen <feikesteenbergen@gmail.com> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAK_s-G2Q7de8Q0qOYUR%3D_CTB5FzzVBm5iZjOp%2BmeVWpMpmfO0w%40mail.gmail.com	2025-06-25 09:56:49 +02:00
Michael Paquier	661643deda	Avoid scribbling of VACUUM options This fixes two issues with the handling of VacuumParams in vacuum_rel(). This code path has the idea to change the passed-in pointer of VacuumParams for the "truncate" and "index_cleanup" options for the relation worked on, impacting the two following scenarios where incorrect options may be used because a VacuumParams pointer is shared across multiple relations: - Multiple relations in a single VACUUM command. - TOAST relations vacuumed with their main relation. The problem is avoided by providing to the two callers of vacuum_rel() copies of VacuumParams, before the pointer is updated for the "truncate" and "index_cleanup" options. The refactoring of the VACUUM option and parameters done in `0d83138974` did not introduce an issue, but it has encouraged the problem we are dealing with in this commit, with `b84dbc8eb8` for "truncate" and `a96c41feec` for "index_cleanup" that have been added a couple of years after the initial refactoring. HEAD will be improved with a different patch that hardens the uses of VacuumParams across the tree. This cannot be backpatched as it introduces an ABI breakage. The backend portion of the patch has been authored by Nathan, while I have implemented the tests. The tests rely on injection points to check the option values, making them faster, more reliable than the tests originally proposed by Shihao, and they also provide more coverage. This part can only be backpatched down to v17. Reported-by: Shihao Zhong <zhong950419@gmail.com> Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAGRkXqTo+aK=GTy5pSc-9cy8H2F2TJvcrZ-zXEiNJj93np1UUw@mail.gmail.com Backpatch-through: 13	2025-06-25 10:03:46 +09:00
Tom Lane	fd519419c9	Prevent excessive delays before launching new logrep workers. The logical replication launcher process would sometimes sleep for as much as 3 minutes before noticing that it is supposed to launch a new worker. This could happen if (1) WaitForReplicationWorkerAttach absorbed a process latch wakeup that was meant to cause ApplyLauncherMain to do work, or (2) logicalrep_worker_launch reported failure, either because of resource limits or because the new worker terminated immediately. In case (2), the expected behavior is that we retry the launch after wal_retrieve_retry_interval, but that didn't reliably happen. It's not clear how often such conditions would occur in the field, but in our subscription test suite they are somewhat common, especially in tests that exercise cases that cause quick worker failure. That causes the tests to take substantially longer than they ought to do on typical setups. To fix (1), make WaitForReplicationWorkerAttach re-set the latch before returning if it cleared it while looping. To fix (2), ensure that we reduce wait_time to no more than wal_retrieve_retry_interval when logicalrep_worker_launch reports failure. In passing, fix a couple of perhaps-hypothetical race conditions, e.g. examining worker->in_use without a lock. Backpatch to v16. Problem (2) didn't exist before commit `5a3a95385` because the previous code always set wait_time to wal_retrieve_retry_interval when launching a worker, regardless of success or failure of the launch. That behavior also greatly mitigated problem (1), so I'm not excited about adapting the remainder of the patch to the substantially-different code in older branches. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/817604.1750723007@sss.pgh.pa.us Backpatch-through: 16	2025-06-24 14:14:07 -04:00
Álvaro Herrera	c2da1a5d63	Make query jumbling also squash PARAM_EXTERN params Commit `62d712ecfd` made query jumbling squash lists of Consts as a single element, but there's no reason not to treat PARAM_EXTERN parameters the same. For these purposes, these values are indeed constants for any particular execution of a query. In particular, this should make list squashing more useful for applications using extended query protocol, which would use parameters extensively. A complication arises: if a query has both external parameters and squashable lists, then the parameter number used as placeholder for the squashed list might be inconsistent with regards to the parameter numbers used by the query literal. To reduce the surprise factor, all parameters are renumbered starting from 1 in that case. Author: Sami Imseih <samimseih@gmail.com> Author: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAA5RZ0tRXoPG2y6bMgBCWNDt0Tn=unRerbzYM=oW0syi1=C1OA@mail.gmail.com	2025-06-24 19:36:32 +02:00
Álvaro Herrera	debad29d22	Improve jumble squashing through CoerceViaIO and RelabelType There's no principled reason for query jumbling to only remove the first layer of RelabelType and CoerceViaIO. Change it to see through as many layers as there are.	2025-06-24 19:36:12 +02:00
Peter Eisentraut	49fe1c83ec	Fix virtual generated column type checking for ALTER TABLE Virtual generated columns have some special checks in CheckAttributeType(), mainly to check that domains are not used. But this check was only applied during CREATE TABLE, not during ALTER TABLE. This fixes that. Reported-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/CACJufxE0KHR__-h=zHXbhSNZXMMs4LYo4-dbj8H3YoStYBok1Q@mail.gmail.com	2025-06-24 11:31:26 +02:00
Amit Kapila	6531f36283	Fix missing comment update in `1462aad2e4`. Remove the part of comment that says we don't allow toggling two_phase option as that is supported in commit `1462aad2e4`. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Author: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OSCPR01MB1496656725F3951AEE8749EBDF579A@OSCPR01MB14966.jpnprd01.prod.outlook.com	2025-06-24 09:51:07 +05:30
Alexander Korotkov	70d8a91f82	Remove excess assert from InvalidatePossiblyObsoleteSlot() `ca307d5cec` introduced keeping WAL segments by slot's last saved restart LSN. It also added an assertion that the slot's restart LSN never goes backward. However, situations when the restart LSN goes backward have been spotted by buildfarm animals and investigated in the thread. When pg_receivewal starts the replication, it sets the last replayed LSN to the beginning of the segment, which is older than what ReplicationSlotReserveWal() set for the slot. A similar situation can happen to pg_basebackup. When standby reconnects to the primary, it sends the last replayed LSN, which might be older than the last confirmed flush LSN. In both these situations, a concurrent checkpoint may trigger an assert trap. Based on ideas from Vitaly Davydov <v.davydov@postgrespro.ru>, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>, Vignesh C <vignesh21@gmail.com>, Amit Kapila <amit.kapila16@gmail.com>. Reported-by: Vignesh C <vignesh21@gmail.com> Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CALDaNm3s-jpQTe1MshsvQ8GO%3DTLj233JCdkQ7uZ6pwqRVpxAdw%40mail.gmail.com Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>	2025-06-23 21:27:42 +03:00
Tom Lane	ea06263c4a	Doc: improve documentation about width_bucket(). Specify whether the bucket bounds are inclusive or exclusive, and improve some other vague language. Explain the behavior that occurs when the "low" bound is greater than the "high" bound. Make width_bucket_numeric's comment more like that for width_bucket_float8, in particular noting that infinite bounds are rejected (since they became possible in v14). Reported-by: Ben Peachey Higdon <bpeacheyhigdon@gmail.com> Author: Robert Treat <rob@xzilla.net> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/2BD74F86-5B89-4AC1-8F13-23CED3546AC1@gmail.com Backpatch-through: 13	2025-06-21 12:52:37 -04:00
Tom Lane	a16ef313f2	Remove planner's have_dangerous_phv() join-order restriction. Commit `85e5e222b`, which added (a forerunner of) this logic, argued that Adding the necessary complexity to make this work doesn't seem like it would be repaid in significantly better plans, because in cases where such a PHV exists, there is probably a corresponding join order constraint that would allow a good plan to be found without using the star-schema exception. The flaw in this claim is that there may be other join-order restrictions that prevent us from finding a join order that doesn't involve a "dangerous" PHV. In particular we now recognize that small join_collapse_limit or from_collapse_limit could prevent it. Therefore, let's bite the bullet and make the case work. We don't have to extend the executor's support for nestloop parameters as I thought at the time, because we can instead push the evaluation of the placeholder's expression into the left-hand input of the NestLoop node. So there's not really a lot of downside to this solution, and giving the planner more join-order flexibility should have value beyond just avoiding failure. Having said that, there surely is a nonzero risk of introducing new bugs. Since this failure mode escaped detection for ten years, such cases don't seem common enough to justify a lot of risk. Therefore, let's put this fix into master but leave the back branches alone (for now anyway). Bug: #18953 Reported-by: Alexander Lakhin <exclusion@gmail.com> Diagnosed-by: Richard Guo <guofenglinux@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18953-1c9883a9d4afeb30@postgresql.org	2025-06-20 15:55:12 -04:00
Tom Lane	5861b1f343	Use SnapshotDirty when checking for conflicting index names. While choosing an autogenerated name for an index, look for pre-existing relations using a SnapshotDirty snapshot, instead of the previous behavior that considered only committed-good pg_class rows. This allows us to detect and avoid conflicts against indexes that are still being built. It's still possible to fail due to a race condition, but the window is now just the amount of time that it takes DefineIndex to validate all its parameters, call smgrcreate(), and enter the index's pg_class row. Formerly the race window covered the entire time needed to create and fill an index, which could be very long if the table is large. Worse, if the conflicting index creation is part of a larger transaction, it wouldn't be visible till COMMIT. So this isn't a complete solution, but it should greatly ameliorate the problem, and the patch is simple enough to be back-patchable. It might at some point be useful to do the same for pg_constraint entries (cf. ChooseConstraintName, ConstraintNameExists, and related functions). However, in the absence of field complaints, I'll leave that alone for now. The relation-name test should be good enough for index-based constraints, while foreign-key constraints seem to be okay since they require exclusive locks to create. Bug: #18959 Reported-by: Maximilian Chrzan <maximilian.chrzan@here.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/18959-f63b53b864bb1417@postgresql.org Backpatch-through: 13	2025-06-20 13:41:11 -04:00
Amit Kapila	1546e17f9d	Improve log messages and docs for slot synchronization. Improve the clarity of LOG messages when a failover logical slot synchronization fails, making the reasons more explicit for easier debugging. Update the documentation to outline scenarios where slot synchronization can fail, especially during the initial sync, and emphasize that pg_sync_replication_slot() is primarily intended for testing and debugging purposes. We also discussed improving the functionality of pg_sync_replication_slot() so that it can be used reliably, but we would take up that work for next version after some more discussion and review. Reported-by: Suraj Kharage <suraj.kharage@enterprisedb.com> Author: shveta malik <shveta.malik@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 17, where it was introduced Discussion: https://postgr.es/m/CAF1DzPWTcg+m+x+oVVB=y4q9=PYYsL_mujVp7uJr-_oUtWNGbA@mail.gmail.com	2025-06-19 09:48:08 +05:30
Fujii Masao	db0c93f172	doc: Mention GIN indexes support parallel builds. Commit `8492feb98f` added support for parallel CREATE INDEX on GIN indexes. However, previously two places in the documentation and two in the source code comments still stated that only B-tree and BRIN indexes support parallel builds. This commit updates those references to correctly include GIN indexes. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Discussion: https://postgr.es/m/7d27d068-90e2-4022-9bd7-09b0fd3d4f47@oss.nttdata.com	2025-06-19 09:12:34 +09:00
Michael Paquier	9e1183953f	Document "relrewrite" at the top of heap_create_with_catalog() This parameter has been introduced in `325f2ec555`, and it was not documented contrary to all the other arguments of heap_create_with_catalog(). Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Steven Niu <niushiji@gmail.com> Discussion: https://postgr.es/m/aE--bmEv-gJUTH5v@paquier.xyz	2025-06-18 11:03:21 +09:00
Masahiko Sawada	d87d07b7ad	Fix re-distributing previously distributed invalidation messages during logical decoding. Commit `4909b38af0` introduced logic to distribute invalidation messages from catalog-modifying transactions to all concurrent in-progress transactions. However, since each transaction distributes not only its original invalidation messages but also previously distributed messages to other transactions, this leads to an exponential increase in allocation request size for invalidation messages, ultimately causing memory allocation failure. This commit fixes this issue by tracking distributed invalidation messages separately per decoded transaction and not redistributing these messages to other in-progress transactions. The maximum size of distributed invalidation messages that one transaction can store is limited to MAX_DISTR_INVAL_MSG_PER_TXN (8MB). Once the size of the distributed invalidation messages exceeds this threshold, we invalidate all caches in locations where distributed invalidation messages need to be executed. Back-patch to all supported versions where we introduced the fix by commit `4909b38af0`. Note that this commit adds two new fields to ReorderBufferTXN to store the distributed transactions. This change breaks ABI compatibility in back branches, affecting third-party extensions that depend on the size of the ReorderBufferTXN struct, though this scenario seems unlikely. Additionally, it adds a new flag to the txn_flags field of ReorderBufferTXN to indicate distributed invalidation message overflow. This should not affect existing implementations, as it is unlikely that third-party extensions use unused bits in the txn_flags field. Bug: #18938 #18942 Author: vignesh C <vignesh21@gmail.com> Reported-by: Duncan Sands <duncan.sands@deepbluecap.com> Reported-by: John Hutchins <john.hutchins@wicourts.gov> Reported-by: Laurence Parry <greenreaper@hotmail.com> Reported-by: Max Madden <maxmmadden@gmail.com> Reported-by: Braulio Fdo Gonzalez <brauliofg@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/680bdaf6-f7d1-4536-b580-05c2760c67c6@deepbluecap.com Discussion: https://postgr.es/m/18942-0ab1e5ae156613ad@postgresql.org Discussion: https://postgr.es/m/18938-57c9a1c463b68ce0@postgresql.org Discussion: https://postgr.es/m/CAD1FGCT2sYrP_70RTuo56QTizyc+J3wJdtn2gtO3VttQFpdMZg@mail.gmail.com Discussion: https://postgr.es/m/CANO2=B=2BT1hSYCE=nuuTnVTnjidMg0+-FfnRnqM6kd23qoygg@mail.gmail.com Backpatch-through: 13	2025-06-16 17:36:01 -07:00
David Rowley	33b06a2001	Fix possible Assert failure in verify_compact_attribute() Sometimes the TupleDesc used in verify_compact_attribute() is shared among backends, and since CompactAttribute.attcacheoff gets updated during tuple deformation, it was possible that another backend would set attcacheoff on a given CompactAttribute in the small window of time from when the attcacheoff from the live CompactAttribute was being set in the 'tmp' CompactAttribute and before the Assert verifying that the live and tmp CompactAttributes matched. Here we adjust the code to make a copy of the live CompactAttribute so that we're not trying to Assert against a shared copy of it. Author: David Rowley <dgrowleyml@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/7195e408-758c-4031-8e61-4f842c716ac0@gmail.com	2025-06-17 10:49:36 +12:00
Andres Freund	e9a3615a52	aio: Add missing memory barrier when waiting for IO handle Previously there was no memory barrier enforcing correct memory ordering when waiting for a free IO handle. However, in the much more common case of waiting for IO to complete, memory barriers already were present. On strongly ordered architectures like x86 this had no negative consequences, but on some armv8 hardware (observed on Apple hardware), it was possible for the update, in the IO worker, to PgAioHandle->state to become visible before ->distilled_result becoming visible, leading to rather confusing assertion failures. The failures were rare enough that the bug sometimes took days to reproduce when running 027_stream_regress in a loop. Once finally debugged, it was easy enough to come up with a much quicker repro: Trigger a lot of very fast IO by limiting io_combine_limit to 1 and ensure that we always have to wait for a free handle by setting io_max_concurrency to 1. Triggering lots of concurrent seqscans in that setup triggers the issue within seconds. One reason this was hard to debug was that the assertion failure most commonly happened in WaitReadBuffers(), rather than in the AIO subsystem itself. The assertions added in this commit make problems like this easier to understand. Also add a comment to the IO worker explaining that we rely on the lwlock acquisition for correct memory ordering. I think it'd be good to add a tap test that stress tests buffer IO, but that's material for a separate patch. Thanks a lot to Alexander and Konstantin for all the debugging help. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Reported-by: Alexander Lakhin <exclusion@gmail.com> Investigated-by: Andres Freund <andres@anarazel.de> Investigated-by: Alexander Lakhin <exclusion@gmail.com> Investigated-by: Konstantin Knizhnik <knizhnik@garret.ru> Discussion: https://postgr.es/m/2dkz7azclpeiqcmouamdixyn5xhlzy4rvikxrbovyzvi6rnv5c@pz7o7osv2ahf	2025-06-16 12:36:01 -04:00
Tom Lane	b27644bade	Sync typedefs.list with the buildfarm. Our maintenance of typedefs.list has been a little haphazard (and apparently we can't alphabetize worth a darn). Replace the file with the authoritative list from our buildfarm, and run pgindent using that. I also updated the additions/exclusions lists in pgindent where necessary to keep pgindent from messing things up significantly. Notably, now that regex_t and some related names are macros not real typedefs, we have to whitelist them explicitly. The exclusions list has also drifted noticeably, presumably due to changes of system headers on the buildfarm animals that contribute to the list. Unlike in prior years, I've not manually added typedef names that are missing from the buildfarm's list because they are not used to declare any variables or fields. So there are a few places where the typedef declaration itself is formatted worse than before, e.g. typedef enum IoMethod. I could preserve the names that were manually added to the list previously, but I'd really prefer to find a less manual way of dealing with these cases. A quick grep finds about 75 such symbols, most of which have never gotten any special treatment. Per discussion among pgsql-release, doing this now seems appropriate even though we're still a week or two away from making the v18 branch.	2025-06-15 13:04:24 -04:00
David Rowley	2f98f967fa	Improve comments for TidRangeEval Here we provide a bit more detail on why TidRangeEval() does return false when trss_mintid is greater than trss_maxtid. Reported-by: Junwang Zhao <zhjwpku@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAEG8a3KUbUUqQgfK5X8Sj-%2BppPtGNTU%2BZiep0Rxr7SLjoR%2BB6w%40mail.gmail.com	2025-06-14 17:18:31 +12:00
Alexander Korotkov	eb124c3d6d	Add TAP tests to check replication slot advance during the checkpoint The new tests verify that logical and physical replication slots are still valid after an immediate restart on checkpoint completion when the slot was advanced during the checkpoint. This commit introduces two new injection points to make these tests possible: * checkpoint-before-old-wal-removal - triggered in the checkpointer process just before old WAL segments cleanup; * logical-replication-slot-advance-segment - triggered in LogicalConfirmReceivedLocation() when restart_lsn was changed enough to point to the next WAL segment. Discussion: https://postgr.es/m/flat/1d12d2-67235980-35-19a406a0%4063439497 Author: Vitaly Davydov <v.davydov@postgrespro.ru> Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 17	2025-06-14 03:55:21 +03:00
Alexander Korotkov	ca307d5cec	Keep WAL segments by slot's last saved restart LSN The patch fixes the issue with the unexpected removal of old WAL segments after checkpoint, followed by an immediate restart. The issue occurs when a slot is advanced after the start of the checkpoint and before old WAL segments are removed at the end of the checkpoint. The patch introduces a new in-memory state for slots: last_saved_restart_lsn, which is used to calculate the oldest LSN for removing WAL segments. This state is updated every time with the current restart_lsn at the moment when the slot is saved to disk. This fix changes the shared memory layout. It's applied to HEAD only because we don't have to preserve ABI compatibility during the beta stage. Another fix that doesn't affect the ABI is committed to back branches. Discussion: https://postgr.es/m/1d12d2-67235980-35-19a406a0%4063439497 Author: Vitaly Davydov <v.davydov@postgrespro.ru> Author: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>	2025-06-14 03:36:04 +03:00
Peter Geoghegan	c45a1dba0d	nbtree: _bt_readnextpage doesn't affect markPos. _bt_readnextpage expects so->currPos.buf to be InvalidBuffer (and for the position's page to be unlocked) when called. However, it does not expect there to be no pins held on any page. In particular, so->markPos might hold a separate pin, both before and after the call. Fix some comments that seemed to suggest otherwise. Follow-up commit to commit `7c319f54`, which made _bt_killitems drop pins it acquired itself.	2025-06-13 19:58:47 -04:00
Jeff Davis	a0c7b76537	Comment fixups from `626df47ad9`. Reported-by: Peter Smith <smithpb2250@gmail.com> Discussion: https://postgr.es/m/CAHut+PspbHQmRCBL1c-opoJeTUKUaFFfUQJd2rhDZqwUrWCi7w@mail.gmail.com	2025-06-13 10:02:24 -07:00
Michael Paquier	2c76c6ac47	Replace %llu by PRIu64 in AIO io_uring code This is a continuation of `15a79c7311`, cleaning up the AIO io_uring code that has been committed after that while still using %llu. The code changed here is new in v18, so cleaning things now means less conflicts if this area of the code changes on backpatch once the 18 stable branch is created. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/aEZcGCnYFq642q8k@paquier.xyz	2025-06-13 08:59:47 +09:00
Álvaro Herrera	0f65f3eec4	Fix squashing algorithm for query texts The algorithm to squash lists of constants added by commit `62d712ecfd` was a bit too simplistic; we wanted to avoid adding unnecessary complexity, but cases like direct function calls of typecasting functions (and others) were missed, and bogus SQL syntax was being shown in pg_stat_statements normalized query text field. To fix normalization for those cases, we need the parser to transmit information about were each list of constant values starts and ends, so add that to a couple of nodes. Also add a few more test cases to make sure we're doing the right thing. The patch initially submitted by Sami added a new private struct in gram.y to carry the start/end information for A_Expr, but I (Álvaro) decided that a better fix was to remove the parser indirection via the in_expr production, and instead create separate components in the a_expr rule. I'm surprised that this works and doesn't require more changes, but I assume (without checking) that the grammar used to be more complex and got simplified at some point. Bump catversion. Author: Sami Imseih <samimseih@gmail.com> Author: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAA5RZ0tRXoPG2y6bMgBCWNDt0Tn=unRerbzYM=oW0syi1=C1OA@mail.gmail.com	2025-06-12 14:21:21 +02:00
Michael Paquier	f85f6ab051	Revert support for improved tracking of nested queries This commit reverts the two following commits: - `499edb0974`, track more precisely query locations for nested statements. - `06450c7b8c`, a follow-up fix of `499edb0974` with query locations. The test introduced in this commit is not reverted. This is proving useful to track a problem that only pgaudit was able to detect. These prove to have issues with the tracking of SELECT statements, when these use multiple parenthesis which is something supported by the grammar. Incorrect location and lengths are causing pg_stat_statements to become confused, failing its job in query normalization with potential out-of-bound writes because the location and the length may not match with what can be handled. A lot of the query patterns discussed when this issue was reported have no test coverage in the main regression test suite, or the recovery test 027_stream_regress.pl would have caught the problems as pg_stat_statements is loaded by the node running the regression tests. A first step would be to improve the test coverage to stress more the query normalization logic. A different portion of this work was done in `45e0ba30fc`, with the addition of tests for nested queries. These can be left in the tree. They are useful to track the way inner queries are currently tracked by PGSS with non-top-level entries, and will be useful when reconsidering in the future the work reverted here. Reported-by: Alexander Kozhemyakin <a.kozhemyakin@postgrespro.ru> Discussion: https://postgr.es/m/18947-cdd2668beffe02bf@postgresql.org	2025-06-12 10:08:55 +09:00
Peter Geoghegan	dd2ce37927	Revert "nbtree: Remove useless row compare arg." This reverts commit `54c6ea8c81`. Further analysis has shown that the forcenonrequired row compare behavior is in fact necessary, despite the new restrictions on RowCompares imposed by _bt_set_startikey following commit `5f4d98d4`. Discussion: https://postgr.es/m/CAH2-Wzm3bKcz3TbHGem3_+SinEyG=VZVPbApQghp7YiZj+MM3g@mail.gmail.com	2025-06-11 18:16:15 -04:00
Jeff Davis	e1458f2f1b	Revert a few small patches that were intended for version 19. - `4c787a24e7` - `78bd364ee3` - `7a6880fadc` - `8898082a5d` Suggested-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CA+TgmoZ=J=PVNZUNKaxULu+KUVSt3Y-aJ1DZ9Y3Co6mu0z62jA@mail.gmail.com Discussion: https://postgr.es/m/60e8c6d0a6c08e67f15dbbe9e53df0119c710065.camel@j-davis.com	2025-06-11 15:10:12 -07:00
Peter Geoghegan	7c319f5491	Make _bt_killitems drop pins it acquired itself. Teach nbtree's _bt_killitems to leave the so->currPos page that it sets LP_DEAD items on in whatever state it was in when _bt_killitems was called. In particular, make sure that so->dropPin scans don't acquire a pin whose reference is saved in so->currPos.buf. Allowing _bt_killitems to change so->currPos.buf like this is wrong. The immediate consequence of allowing it is that code in _bt_steppage (that copies so->currPos into so->markPos) will behave as if the scan is a !so->dropPin scan. so->markPos will therefore retain the buffer pin indefinitely, even though _bt_killitems only needs to acquire a pin (along with a lock) for long enough to mark known-dead items LP_DEAD. This issue came to light following a report of a failure of an assertion from recent commit `e6eed40e`. The test case in question involves the use of mark and restore. An initial call to _bt_killitems takes place that leaves so->currPos.buf in a state that is inconsistent with the scan being so->dropPin. A subsequent call to _bt_killitems for the same position (following so->currPos being saved in so->markPos, and then restored as so->currPos) resulted in the failure of an assertion that tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin (non-assert builds got a "resource was not closed" WARNING instead). The same problem exists on earlier releases, though the issue is far more subtle there. Recent commit `e6eed40e` introduced the so->dropPin field as a partial replacement for testing so->currPos.buf directly. Earlier releases won't get an assertion failure (or buffer pin leak), but they will allow the second _bt_killitems call from the test case to behave as if a buffer pin was consistently held since the original call to _bt_readpage. This is wrong; there will have been an initial window during which no pin was held on the so->currPos page, and yet the second _bt_killitems call will neglect to check if so->currPos.lsn continues to match the page's now-current LSN. As a result of all this, it's just about possible that _bt_killitems will set the wrong items LP_DEAD (on release branches). This could only happen with merge joins (the sole user of nbtree mark/restore support), when a concurrently inserted index tuple used a recently-recycled TID (and only when the new tuple was inserted onto the same page as a distinct concurrently-removed tuple with the same TID). This is exactly the scenario that _bt_killitems' check of the page's now-current LSN against the LSN stashed in currPos was supposed to prevent. A follow-up commit will make nbtree completely stop conditioning whether or not a position's pin needs to be dropped on whether the 'buf' field is set. All call sites that might need to drop a still-held pin will be taught to rely on the scan-level so->dropPin field recently introduced by commit `e6eed40e`. That will make bugs of the same general nature as this one impossible (or make them much easier to detect, at least). Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com Backpatch-through: 13	2025-06-11 09:17:35 -04:00
Tom Lane	137935bd11	Don't reduce output request size on non-Unix-socket connections. Traditionally, libpq's pqPutMsgEnd has rounded down the amount-to-send to be a multiple of 8K when it is eagerly writing some data. This still seems like a good idea when sending through a Unix socket, as pipes typically have a buffer size of 8K or some fraction/multiple of that. But there's not much argument for it on a TCP connection, since (a) standard MTU values are not commensurate with that, and (b) the kernel typically applies its own packet splitting/merging logic. Worse, our SSL and GSSAPI code paths both have API stipulations that if they fail to send all the data that was offered in the previous write attempt, we mustn't offer less data in the next attempt; else we may get "SSL error: bad length" or "GSSAPI caller failed to retransmit all data needing to be retried". The previous write attempt might've been pqFlush attempting to send everything in the buffer, so pqPutMsgEnd can't safely write less than the full buffer contents. (Well, we could add some more state to track exactly how much the previous write attempt was, but there's little value evident in such extra complication.) Hence, apply the round-down only on AF_UNIX sockets, where we never use SSL or GSSAPI. Interestingly, we had a very closely related bug report before, which I attempted to fix in commit `d053a879b`. But the test case we had then seemingly didn't trigger this pqFlush-then-pqPutMsgEnd scenario, or at least we failed to recognize this variant of the bug. Bug: #18907 Reported-by: Dorjpalam Batbaatar <htgn.dbat.95@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18907-d41b9bcf6f29edda@postgresql.org Backpatch-through: 13	2025-06-10 18:39:34 -04:00
Jeff Davis	8898082a5d	inet_net_pton.c: use pg_ascii_tolower() rather than tolower(). Avoid dependence on setlocale(). No behavior change. Discussion: https://postgr.es/m/9875f7f9-50f1-4b5d-86fc-ee8b03e8c162@eisentraut.org Reviewed-by: Peter Eisentraut <peter@eisentraut.org>	2025-06-10 11:23:20 -07:00
Jeff Davis	4c787a24e7	copyfromparse.c: use pg_ascii_tolower() rather than tolower(). Avoid dependence on setlocale(). No behavior change. Discussion: https://postgr.es/m/9875f7f9-50f1-4b5d-86fc-ee8b03e8c162@eisentraut.org Reviewed-by: Peter Eisentraut <peter@eisentraut.org>	2025-06-10 11:22:57 -07:00
Etsuro Fujita	7d4667c620	Revert "postgres_fdw: Inherit the local transaction's access/deferrable modes." We concluded that commit `e5a3c9d9b` is a feature rather than a fix; since it was added after feature freeze, revert it. Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reported-by: Michael Paquier <michael@paquier.xyz> Reported-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/ed2296f1-1a6b-4932-b870-5bb18c2591ae%40oss.nttdata.com	2025-06-08 17:30:00 +09:00
Jeff Davis	5b40feab59	Improve CREATE DATABASE error message for invalid libc locale. Discussion: https://postgr.es/m/73959a14-267b-49c1-8293-291b175682cb@manitou-mail.org Reviewed-by: Daniel Verite <daniel@manitou-mail.org>	2025-06-06 15:28:51 -07:00
Nathan Bossart	a31767fc09	Use NULL instead of 0 for pointer arguments. Commit `5fe08c006c` fixed this for calls to dshash_create(). This commit fixes calls to dshash_attach() and dsa_create_in_place(). Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aECi_gSD9JnVWQ8T%40nathan	2025-06-06 12:08:17 -05:00
Peter Geoghegan	e6eed40e44	Avoid BufferGetLSNAtomic() calls during nbtree scans. Delay calling BufferGetLSNAtomic() until we finish reading a page that actually contains items that btgettuple will return to the executor. This reduces the number of calls during plain index scans (we'll only call BufferGetLSNAtomic() when _bt_readpage returns true), and totally eliminates calls during index-only scans, bitmap index scans, and plain index scans of an unlogged relation. Currently, when checksums (or wal_log_hints) are enabled, acquiring a page's LSN in BufferGetLSNAtomic() involves locking the buffer header (which involves the use of spinlocks). Testing has shown that enabling page-level checksums causes large regressions with certain workloads, especially on larger multi-socket systems. The regression isn't tied to any Postgres 18 commit. However, Postgres 18 commit `04bec894` made initdb use checksums by default, so it seems prudent to address the problem now. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/941f0190-e3c6-4622-9ac7-c04e936e5fdb@vondra.me Discussion: https://postgr.es/m/CAH2-Wzk-Dg5XWs_jDuiHt4_7ryrSY+n=vxmHY51EVqPDFsKXmg@mail.gmail.com	2025-06-06 10:19:44 -04:00
Peter Geoghegan	54c6ea8c81	nbtree: Remove useless row compare arg. Use of a RowCompare key makes nbtree index scans ineligible to use pstate.forcenonrequired following recent bugfix commit `5f4d98d4`. There's no longer any need for _bt_check_rowcompare to accept a forcenonrequired argument, so remove it.	2025-06-05 14:50:43 -04:00
Álvaro Herrera	e6f98d8848	Avoid bogus scans of partitions when marking FKs enforced Similar to commit `cc733ed164`: when an unenforced foreign key that references a partitioned table is altered to be enforced, we scan the constrained table based on each partition on the referenced partitioned table. This is bogus and likely to cause the ALTER TABLE to fail: we must only scan the constrained table as pointing to the top-level partitioned table. Oversight in commit `eec0040c4b`. Fix by eliding those scans. Author: Amul Sul <sulamul@gmail.com> Reported-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxF1e_gPOLtsDoaE4VCgQPC8KZW_kPAjPR5Rvv4Ew=fb2A@mail.gmail.com	2025-06-05 18:39:06 +02:00
Álvaro Herrera	cc733ed164	Avoid bogus scans of partitions when validating FKs to partitioned tables Validating an unvalidated foreign key that references a partitioned table would try to queue validations for each individual partition of the referenced table, but this is wrong: each individual partition would not necessarily have all the referenced rows, so errors would be raised. Avoid doing that. The pg_constraint rows that cause this to happen are only there to support the action triggers that implement the DELETE/ UPDATE actions of the FK, so no validating scan is necessary. This was an oversight in commit `b663b9436e`. An equivalent oversight exists for NOT ENFORCED constraints, which is not fixed in this commit. Author: Amul Sul <sulamul@gmail.com> Reported-by: Antonin Houska <ah@cybertec.at> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/26983.1748418675@localhost	2025-06-05 17:17:13 +02:00
Michael Paquier	b87163e5f3	Fix copy-pasto with process count calculation in method_io_uring.c This commit replaces the formula used for "TotalProcs" with a call to pgaio_uring_procs() in pgaio_uring_shmem_init() for the shared memory initialization, which is exactly the same, removing a duplication. pgaio_uring_procs() is used for shared memory sizing and a sanity check, and it has some documentation explaining some reasoning behind the formula. Author: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/ME0P300MB044521067A1EDDA9EDEC3793B66DA@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM	2025-06-05 09:39:24 +09:00
Peter Eisentraut	f777d77387	Don't strip $libdir from LOAD command Commit `4f7f7b0375` implemented the extension_control_path GUC, and to make it work it was decided that we should strip the $libdir/ on module_pathname from .control files, so that extensions don't need to worry about this change. This strip logic was implemented on expand_dynamic_library_name() which works fine when executing the SQL functions from extensions, but this function is also called when the LOAD command is executed, and since the user may explicitly pass the $libdir prefix on LOAD parameter, we should not strip in this case. This commit fixes this issue by moving the strip logic from expand_dynamic_library_name() to load_external_function() that is called when the running the SQL script from extensions. Reported-by: Evan Si <evsi@amazon.com> Author: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Bug: #18920 Discussion: https://www.postgresql.org/message-id/flat/18920-b350b1c0a30af006%40postgresql.org	2025-06-04 11:38:12 +02:00
Peter Eisentraut	58fbfde152	Fix incorrect format placeholders	2025-06-03 21:38:04 +02:00
Fujii Masao	73bdcfab35	Rename log_lock_failure GUC to log_lock_failures for consistency. This commit renames the GUC log_lock_failure to log_lock_failures to align with the existing similar setting log_lock_waits, which uses the plural form. This improves naming consistency across related GUCs. Suggested-by: Peter Eisentraut <peter@eisentraut.org> Author: Fujii Masao <masao.fujii@gmail.com Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/7a8198b6-d5b8-4910-b41e-8d3efcbb015d@eisentraut.org	2025-06-03 10:02:55 +09:00
Tom Lane	aa87f69c00	Disallow "=" in names of reloptions and foreign-data options. We store values for these options as array elements with the syntax "name=value", hence a name containing "=" confuses matters when it's time to read the array back in. Since validation of the options is often done (long) after this conversion to array format, that leads to confusing and off-point error messages. We can improve matters by rejecting names containing "=" up-front. (Probably a better design would have involved pairs of array elements, but it's too late now --- and anyway, there's no evident use-case for option names like this. We already reject such names in some other contexts such as GUCs.) Reported-by: Chapman Flack <jcflack@acm.org> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Chapman Flack <jcflack@acm.org> Discussion: https://postgr.es/m/6830EB30.8090904@acm.org Backpatch-through: 13	2025-06-02 15:22:44 -04:00
Melanie Plageman	31a7e175fd	Correct heap vacuum boundary state setup ordering `052026c9b9` mistakenly reordered setup steps in heap_vacuum_rel(), incorrectly moving RelationGetNumberOfBlocks() before vacuum_get_cutoffs(). OldestXmin must be determined before RelationGetNumberOfBlocks() calculates the number of blocks in the relation that will be vacuumed. Otherwise tuples older than OldestXmin may be inserted into the end of the relation into blocks that are not vacuumed. If additional tuples newer than those inserted into unscanned blocks but older than OldestXmin are inserted into free space earlier in the relation, the result could be advancing pg_class.relfrozenxid to a newer value than an unfrozen XID in one of the unscanned heap pages. Assigning an incorrect relfrozenxid can lead to data loss, so it is imperative that it correctly reflect the oldest unfrozen xid. Reported-by: Peter Geoghegan <pg@bowt.ie> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzntqvVEdbbpqG5JqSZGuLWmy4PBfUO-OswfivKchr2gvw%40mail.gmail.com	2025-06-02 10:54:07 -04:00
Peter Eisentraut	fc32be3c94	Fix incorrect format placeholders Fixes for return type of dclist_count().	2025-06-02 10:12:58 +02:00
Peter Eisentraut	32edf732e8	Rename gist stratnum support function Commit `7406ab623f` added a gist support function that we internally refer to by the symbol GIST_STRATNUM_PROC. This translated from "well-known" strategy numbers to opfamily-specific strategy numbers. However, we later (commit `630f9a43ce`) changed this to fit into index-AM-level compare type mapping, so this function actually now maps from compare type to opfamily-specific strategy numbers. So this name is no longer fitting. Moreover, the index AM level also supports the opposite, a function to map from strategy number to compare type. This is currently not supported in gist, but one might wonder what this function is supposed to be called when it is added. This patch changes the naming of the gist-level functionality to be more in line with the index-AM-level functionality. This makes sense because these are essentially the same thing on different levels. This also changes the names of the externally visible functions that are provided for use as such a support function. Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Discussion: https://www.postgresql.org/message-id/37ebb1d9-9036-485f-a215-e55435689917%40eisentraut.org	2025-06-02 08:41:27 +02:00
Michael Paquier	5231ed8262	Use replay LSN as target for cascading logical WAL senders A cascading WAL sender doing logical decoding (as known as doing its work on a standby) has been using as flush LSN the value returned by GetStandbyFlushRecPtr() (last position safely flushed to disk). This is incorrect as such processes are only able to decode changes up to the LSN that has been replayed by the startup process. This commit changes cascading logical WAL senders to use the replay LSN, as returned by GetXLogReplayRecPtr(). This distinction is important particularly during shutdown, when WAL senders need to send any remaining available data to their clients, switching WAL senders to a caught-up state. Using the latest flush LSN rather than the replay LSN could cause the WAL senders to be stuck in an infinite loop preventing them to shut down, as the startup process does not run when WAL senders attempt to catch up, so they could keep waiting for work that would never happen. Backpatch down to v16, where logical decoding on standbys has been introduced. Author: Alexey Makhmutov <a.makhmutov@postgrespro.ru> Reviewed-by: Ajin Cherian <itsajin@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/52138028-7246-421c-9161-4fa108b88070@postgrespro.ru Backpatch-through: 16	2025-06-02 12:03:59 +09:00
Etsuro Fujita	e5a3c9d9b5	postgres_fdw: Inherit the local transaction's access/deferrable modes. Previously, postgres_fdw always 1) opened a remote transaction in READ WRITE mode even when the local transaction was READ ONLY, causing a READ ONLY transaction using it that references a foreign table mapped to a remote view executing a volatile function to write in the remote side, and 2) opened the remote transaction in NOT DEFERRABLE mode even when the local transaction was DEFERRABLE, causing a SERIALIZABLE READ ONLY DEFERRABLE transaction using it to abort due to a serialization failure in the remote side. To avoid these, modify postgres_fdw to open a remote transaction in the same access/deferrable modes as the local transaction. This commit also modifies it to open a remote subtransaction in the same access mode as the local subtransaction. Although these issues exist since the introduction of postgres_fdw, there have been no reports from the field. So it seems fine to just fix them in master only. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com	2025-06-01 17:30:00 +09:00
Dean Rasheed	b006bcd531	Fix MERGE into a plain inheritance parent table. When a MERGE's target table is the parent of an inheritance tree, any INSERT actions insert into the parent table using ModifyTableState's rootResultRelInfo. However, there are two bugs in the way is initialized: 1. ExecInitMerge() incorrectly uses a different ResultRelInfo entry from ModifyTableState's resultRelInfo array to build the insert projection, which may not be compatible with rootResultRelInfo. 2. ExecInitModifyTable() does not fully initialize rootResultRelInfo. Specifically, ri_WithCheckOptions, ri_WithCheckOptionExprs, ri_returningList, and ri_projectReturning are not initialized. This can lead to crashes, or incorrect query results due to failing to check WCO's or process the RETURNING list for INSERT actions. Fix both these bugs in ExecInitMerge(), noting that it is only necessary to fully initialize rootResultRelInfo if the MERGE has INSERT actions and the target table is a plain inheritance parent. Backpatch to v15, where MERGE was introduced. Reported-by: Andres Freund <andres@anarazel.de> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/4rlmjfniiyffp6b3kv4pfy4jw3pciy6mq72rdgnedsnbsx7qe5@j5hlpiwdguvc Backpatch-through: 15	2025-05-31 12:12:58 +01:00
Michael Paquier	e050af2868	Change internal plan ID type from uint64 to int64 uint64 was chosen to be consistent with the type used by the query ID, but the conclusion of a recent discussion for the query ID is that int64 is a better fit as the signed form is shown to the user, for PGSS or EXPLAIN outputs. This commit changes the plan ID to use int64, following `c3eda50b06` that has done the same for the query ID. The plan ID is new to v18, introduced in `2a0cd38da5`. Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/aCvzJNwetyEI3Sgo@paquier.xyz	2025-05-31 09:40:45 +09:00
Nathan Bossart	706054b11b	Ensure we have a snapshot when updating various system catalogs. A few places that access system catalogs don't set up an active snapshot before potentially accessing their TOAST tables. To fix, push an active snapshot just before each section of code that might require accessing one of these TOAST tables, and pop it shortly afterwards. While at it, this commit adds some rather strict assertions in an attempt to prevent such issues in the future. Commit `16bf24e0e4` recently removed pg_replication_origin's TOAST table in order to fix the same problem for that catalog. On the back-branches, those bugs are left in place. We cannot easily remove a catalog's TOAST table on released major versions, and only replication origins with extremely long names are affected. Given the low severity of the issue, fixing older versions doesn't seem worth the trouble of significantly modifying the patch. Also, on v13 and v14, the aforementioned strict assertions have been omitted because commit `2776922201`, which added HaveRegisteredOrActiveSnapshot(), was not back-patched. While we could probably back-patch it now, I've opted against it because it seems unlikely that new TOAST snapshot issues will be introduced in the oldest supported versions. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/18127-fe54b6a667f29658%40postgresql.org Discussion: https://postgr.es/m/18309-c0bf914950c46692%40postgresql.org Discussion: https://postgr.es/m/ZvMSUPOqUU-VNADN%40nathan Backpatch-through: 13	2025-05-30 15:17:28 -05:00
Tom Lane	d98cefe114	Allow larger packets during GSSAPI authentication exchange. Our GSSAPI code only allows packet sizes up to 16kB. However it emerges that during authentication, larger packets might be needed; various authorities suggest 48kB or 64kB as the maximum packet size. This limitation caused login failure for AD users who belong to many AD groups. To add insult to injury, we gave an unintelligible error message, typically "GSSAPI context establishment error: The routine must be called again to complete its function: Unknown error". As noted in code comments, the 16kB packet limit is effectively a protocol constant once we are doing normal data transmission: the GSSAPI code splits the data stream at those points, and if we change the limit then we will have cross-version compatibility problems due to the receiver's buffer being too small in some combinations. However, during the authentication exchange the packet sizes are not determined by us, but by the underlying GSSAPI library. So we might as well just try to send what the library tells us to. An unpatched recipient will fail on a packet larger than 16kB, but that's not worse than the sender failing without even trying. So this doesn't introduce any meaningful compatibility problem. We still need a buffer size limit, but we can easily make it be 64kB rather than 16kB until transport negotiation is complete. (Larger values were discussed, but don't seem likely to add anything.) Reported-by: Chris Gooch <cgooch@bamfunds.com> Fix-suggested-by: Jacob Champion <jacob.champion@enterprisedb.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Discussion: https://postgr.es/m/DS0PR22MB5971A9C8A3F44BCC6293C4DABE99A@DS0PR22MB5971.namprd22.prod.outlook.com Backpatch-through: 13	2025-05-30 12:55:15 -04:00
Fujii Masao	961553daf5	Make XactLockTableWait() and ConditionalXactLockTableWait() interruptable more. Previously, XactLockTableWait() and ConditionalXactLockTableWait() could enter a non-interruptible loop when they successfully acquired a lock on a transaction but the transaction still appeared to be running. Since this loop continued until the transaction completed, it could result in long, uninterruptible waits. Although this scenario is generally unlikely since XactLockTableWait() and ConditionalXactLockTableWait() can basically acquire a transaction lock only when the transaction is not running, it can occur in a hot standby. In such cases, the transaction may still appear active due to the KnownAssignedXids list, even while no lock on the transaction exists. For example, this situation can happen when creating a logical replication slot on a standby. The cause of the non-interruptible loop was the absence of CHECK_FOR_INTERRUPTS() within it. This commit adds CHECK_FOR_INTERRUPTS() to the loop in both functions, ensuring they can be interrupted safely. Back-patch to all supported branches. Author: Kevin K Biju <kevinkbiju@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAM45KeELdjhS-rGuvN=ZLJ_asvZACucZ9LZWVzH7bGcD12DDwg@mail.gmail.com Backpatch-through: 13	2025-05-31 00:08:40 +09:00
David Rowley	c3eda50b06	Change internal queryid type from uint64 to int64 uint64 was perhaps chosen in `cff440d36` as the type was uint32 prior to that widening work. Having this as uint64 doesn't make much sense and just adds the overhead of having to remember that we always output this in its signed form. Let's remove that overhead. The signed form output is seemingly required since we have no way to represent the full range of uint64 in an SQL type. We use BIGINT in places like pg_stat_statements, which maps directly to int64. The release notes "Source Code" section may want to mention this adjustment as some extensions may wish to adjust their code. Author: David Rowley <dgrowleyml@gmail.com> Suggested-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/50cb0c8b-994b-48f9-a1c4-13039eb3536b@eisentraut.org	2025-05-30 22:59:39 +12:00
Michael Paquier	c3623703f3	Add AioUringCompletion in wait_event_names.txt Oversight in `c325a7633f`, where the LWLock tranche AioUringCompletion has been added. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aDT5sBOxJTdulXnE@paquier.xyz	2025-05-29 13:25:05 +09:00
Tom Lane	e5d64fd654	Tighten parsing of datetime input. ParseFraction only expects to deal with fields that contain a decimal point and digit(s). However it's possible in some edge cases for it to be passed input that doesn't look like that. In particular the input could look like a valid floating-point number, such as ".123e6". strtod() will happily eat that, possibly producing a result that is not within the expected range 0..1, which can result in integer overflow in the callers. That doesn't have any security consequences, but it's still not very desirable. Fix by checking that the input has the expected form. Similarly, DecodeNumberField only expects to deal with fields that contain a decimal point and digit(s), but it's sometimes abused to parse strings that might not look like that. This could result in failure to reject bogus input, yielding silly results. Again, fix by rejecting input that doesn't look as-expected. That decision also means that we can affirmatively answer the very old comment questioning whether we couldn't save some duplicative code by using ParseFractionalSecond here. While these changes should only reject input that nobody would consider valid, it still doesn't seem like a change to make in stable branches. Apply to HEAD only. Reported-by: Evgeniy Gorbanev <gorbanev.es@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1328335.1748371099@sss.pgh.pa.us	2025-05-28 15:10:48 -04:00
Tom Lane	be86ca103a	Fix memory leakage when function compilation fails. In pl_comp.c, initially create the plpgsql function's cache context under the assumed-short-lived caller's context, and reparent it under CacheMemoryContext only upon success. This avoids a process-lifespan leak of 8kB or more if the function contains syntax errors. (This leakage has existed for a long time without many complaints, but as we move towards a possibly multi-threaded future, getting rid of process-lifespan leaks grows more important.) In funccache.c, arrange to reclaim the CachedFunction struct in case the language-specific compile callback function throws an error; previously, that resulted in an independent process-lifespan leak. This is arguably a new bug in v18, since the leakage now occurred for SQL-language functions as well as plpgsql. Also, don't fill fn_xmin/fn_tid/dcallback until after successful completion of the compile callback. This avoids a scenario where a partially-built function cache might appear already valid upon later inspection, and another scenario where dcallback might fail upon being presented with an incomplete cache entry. We would have to reach such a faulty cache entry via a pre-existing fn_extra pointer, so I'm not sure these scenarios correspond to any live bug. (The predecessor code in pl_comp.c never took any care about this, and we've heard no complaints about that.) Still, it's better to be careful. Given the lack of field complaints, I'm not very excited about back-patching any of this; but it seems still in-scope for v18. Discussion: https://postgr.es/m/999171.1748300004@sss.pgh.pa.us	2025-05-28 13:29:45 -04:00
Michael Paquier	d46911e584	Fix conversion of SIMILAR TO regexes for character classes The code that translates SIMILAR TO pattern matching expressions to POSIX-style regular expressions did not consider that square brackets can be nested. For example, in an expression like [[:alpha:]%_], the logic replaced the placeholders '_' and '%' but it should not. This commit fixes the conversion logic by tracking the nesting level of square brackets marking character class areas, while considering that in expressions like []] or [^]] the first closing square bracket is a regular character. Multiple tests are added to show how the conversions should or should not apply applied while in a character class area, with specific cases added for all the characters converted outside character classes like an opening parenthesis '(', dollar sign '$', etc. Author: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/16ab039d1af455652bdf4173402ddda145f2c73b.camel@cybertec.at Backpatch-through: 13	2025-05-28 08:58:40 +09:00
Masahiko Sawada	4c08ecd161	Fix assertion when decrementing eager scanning success and failure counters. Previously, we asserted that the eager scan's success and failure counters were positive before decrementing them. However, this assumption was incorrect, as it's possible that some blocks have already been eagerly scanned by the time eager scanning is disabled. This commit replaces the assertions with guards to handle this scenario gracefully. With this change, we continue to allow read-ahead operations by the read stream that exceed the success and failure caps. While there is a possibility that overruns will trigger eager scans of additional pages, this does not pose a practical concern as the overruns will not be substantial and remain within an acceptable range. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAD21AoConf6tkVCv-=JhQJj56kYsDwo4jG5+WqgT+ukSkYomSQ@mail.gmail.com	2025-05-27 11:42:36 -07:00
Peter Eisentraut	c53f3b9cc8	Improve file_copy_method entry in postgresql.conf.sample Improve the wording of the comment a bit, fix whitespace. Also move the entry so that the section order is consistent with config.sgml.	2025-05-26 14:52:00 +02:00
Daniel Gustafsson	1f62dbf5f0	doc: Fix wording in JIT README Remove superfluous 'is' from sentence. Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/20250526154412.5f77dfead87af9afc089cc48@sraoss.co.jp	2025-05-26 13:30:01 +02:00
Tom Lane	02502c1bca	Fix per-relation memory leakage in autovacuum. PgStat_StatTabEntry and AutoVacOpts structs were leaked until the end of the autovacuum worker's run, which is bad news if there are a lot of relations in the database. Note: pfree'ing the PgStat_StatTabEntry structs here seems a bit risky, because pgstat_fetch_stat_tabentry_ext does not guarantee anything about whether its result is long-lived. It appears okay so long as autovacuum forces PGSTAT_FETCH_CONSISTENCY_NONE, but I think that API could use a re-think. Also ensure that the VacuumRelation structure passed to vacuum() is in recoverable storage. Back-patch to v15 where we started to manage table statistics this way. (The AutoVacOpts leakage is probably older, but I'm not excited enough to worry about just that part.) Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us Backpatch-through: 15	2025-05-23 14:43:43 -04:00
Tom Lane	6aa33afe6d	Fix AlignedAllocRealloc to cope sanely with OOM. If the inner allocation call returns NULL, we should restore the previous state and return NULL. Previously this code pfree'd the old chunk anyway, which is surely wrong. Also, make it call MemoryContextAllocationFailure rather than summarily returning NULL. The fact that we got control back from the inner call proves that MCXT_ALLOC_NO_OOM was passed, so this change is just cosmetic, but someday it might be less so. This is just a latent bug at present: AFAICT no in-core callers use this function at all, let alone call it with MCXT_ALLOC_NO_OOM. Still, it's the kind of bug that might bite back-patched code pretty hard someday, so let's back-patch to v17 where the bug was introduced (by commit `743112a2e`). Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us Backpatch-through: 17	2025-05-23 11:47:33 -04:00
Daniel Gustafsson	fb844b9f06	Revert function to get memory context stats for processes Due to concerns raised about the approach, and memory leaks found in sensitive contexts the functionality is reverted. This reverts commits `45e7e8ca9`, `f8c115a6c`, `d2a1ed172`, `55ef7abf8` and `042a66291` for v18 with an intent to revisit this patch for v19. Discussion: https://postgr.es/m/594293.1747708165@sss.pgh.pa.us	2025-05-23 15:44:54 +02:00
Peter Eisentraut	70a13c528b	Move oauth_validator_libraries in postgresql.conf.sample Move oauth_validator_libraries in postgresql.conf.sample to be grouped with the other CONN_AUTH_AUTH settings, rather than making up a new ad-hoc category. This matches the internal categorization and also how it is listed in the documentation.	2025-05-23 09:03:09 +02:00
Melanie Plageman	cb1456423d	Replace deprecated log_connections values in docs and tests `9219093cab` modularized log_connections output to allow more granular control over which aspects of connection establishment are logged. It converted the boolean log_connections GUC into a list of strings and deprecated previously supported boolean-like values on, off, true, false, 1, 0, yes, and no. Those values still work, but they are supported mainly for backwards compatability. As such, documented examples of log_connections should not use these deprecated values. Update references in the docs to deprecated log_connections values. Many of the tests use log_connections. This commit also updates the tests to use the new values of log_connections. In some of the tests, the updated log_connections value covers a narrower set of aspects (e.g. the 'authentication' aspect in the tests in src/test/authentication and the 'receipt' aspect in src/test/postmaster). In other cases, the new value for log_connections is a superset of the previous included aspects (e.g. 'all' in src/test/kerberos/t/001_auth.pl). Reported-by: Peter Eisentraut <peter@eisentraut.org> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Discussion: https://postgr.es/m/e1586594-3b69-4aea-87ce-73a7488cdc97%40eisentraut.org	2025-05-22 17:14:54 -04:00
Tom Lane	d376ab570e	In ExecInitModifyTable, don't scribble on the source plan. The code carelessly modified mtstate->ps.plan->targetlist, which it's not supposed to do. Fortunately, there's not really any need to do that because the planner already set up a perfectly acceptable targetlist for the plan node. We just need to remove the erroneous assignments and update some relevant comments. As it happens, the erroneous assignments caused the targetlist to point to a different part of the source plan tree, so that there isn't really a risk of the pointer becoming dangling after executor termination. The only visible effect of this change we can find is that EXPLAIN will show upper references to the ModifyTable's output expressions using different variables. Formerly it showed Vars from the first target relation that survived executor-startup pruning. Now it always shows such references using the first relation appearing in the planner output, independently of what happens during executor pruning. On the whole that seems like a good thing. Also make a small tweak in ExplainPreScanNode to ensure that the first relation will receive a refname assignment in set_rtable_names, even if it got pruned at startup. Previously the Vars might be shown without any table qualification, which is confusing in a multi-table query. I considered back-patching this, but since the bug doesn't seem to have any really terrible consequences in existing branches, it seems better to not change their EXPLAIN output. It's not too late for v18 though, especially since v18 already made other changes in the EXPLAIN output for these cases. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Andres Freund <andres@anarazel.de> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/213261.1747611093@sss.pgh.pa.us	2025-05-22 14:28:51 -04:00
Tom Lane	f24605e2dc	Fix memory leak in XMLSERIALIZE(... INDENT). xmltotext_with_options sometimes tries to replace the existing root node of a libxml2 document. In that case xmlDocSetRootElement will unlink and return the old root node; if we fail to free it, it's leaked for the remainder of the session. The amount of memory at stake is not large, a couple hundred bytes per occurrence, but that could still become annoying in heavy usage. Our only other xmlDocSetRootElement call is not at risk because it's working on a just-created document, but let's modify that code too to make it clear that it's dependent on that. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Discussion: https://postgr.es/m/1358967.1747858817@sss.pgh.pa.us Backpatch-through: 16	2025-05-22 13:52:46 -04:00
Amit Langote	1722d5eb05	Revert "Don't lock partitions pruned by initial pruning" As pointed out by Tom Lane, the patch introduced fragile and invasive design around plan invalidation handling when locking of prunable partitions was deferred from plancache.c to the executor. In particular, it violated assumptions about CachedPlan immutability and altered executor APIs in ways that are difficult to justify given the added complexity and overhead. This also removes the firstResultRels field added to PlannedStmt in commit `28317de72`, which was intended to support deferred locking of certain ModifyTable result relations. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/605328.1747710381@sss.pgh.pa.us	2025-05-22 17:02:35 +09:00
Michael Paquier	3d0c3a418f	Adjust operation names of pg_aios to match the documentation pg_aios used the terms "read" and "write" for vectored I/O read and write operations, respectively. The documentation refers to them as "readv" and "writev", and the code uses internally the terms PGAIO_OP_READV and PGAIO_OP_WRITEV for them, as of "vectored". This commit adjusts these operation names to match with the code and the documentation. Oversight in `8e293e689b`. Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Discussion: https://postgr.es/m/6df1e949d1d759ad2767c18e5845963e@oss.nttdata.com	2025-05-21 15:58:03 +09:00
Fujii Masao	0bd762e81f	Fix incorrect WAL description for PREPARE TRANSACTION record. Since commit `8b1dccd37c`, the PREPARE TRANSACTION WAL record includes information about dropped statistics entries. However, the WAL resource manager description function for PREPARE TRANSACTION record failed to parse this information correctly and always assumed there were no such entries. As a result, for example, pg_waldump could not display the dropped statistics entries stored in PREPARE TRANSACTION records. The root cause was that ParsePrepareRecord() did not set the number of statistics entries to drop on commit or abort. These values remained zero-initialized and were never updated from the parsed record. This commit fixes the issue by properly setting those values during parsing. With this fix, pg_waldump can now correctly report dropped statistics entries in PREPARE TRANSACTION records. Back-patch to v15, where commit `8b1dccd37c` was introduced. Author: Daniil Davydov <3danissimo@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAJDiXgh-6Epb2XiJe4uL0zF-cf0_s_7Lw1TfEHDMLzYjEmfGOw@mail.gmail.com Backpatch-through: 15	2025-05-21 11:55:14 +09:00
Michael Paquier	06450c7b8c	Fix regression with location calculation of nested statements The statement location calculated for some nested query cases was wrong when multiple queries are sent as a single string, these being separated by semicolons. As pointed by Sami Imseih, the location calculation was incorrect when the last query of nested statement with multiple queries does NOT finish with a semicolon for the last statement. In this case, the statement length tracked by RawStmt is 0, which is equivalent to say that the string should be used until its end. The code previously discarded this case entirely, causing the location to remain at 0, the same as pointing at the beginning of the string. This caused pg_stat_statements to store incorrect query strings. This issue has been introduced in `499edb0974`. I have looked at the diffs generated by pgaudit back then, and noticed the difference generated for this nested query case, but I have missed the point that it was an actual regression with an existing case. A test case is added in pg_stat_statements to provide some coverage, restoring the pre-17 behavior for the calculation of the query locations. Special thanks to David Steele, who, through an analysis of the test diffs generated by pgaudit with the new v18 logic, has poked me about the fact that my original analysis of the matter was wrong. The test output of pg_overexplain is updated to reflect the new logic, as the new locations refer to the beginning of the argument passed to the function explain_filter(). When the module was introduced in `8d5ceb113e`, which was after `499edb0974` (for the new calculation method), the locations of the test were not actually right: the plan generated for the query string given in input of the function pointed to the top-level query, not the nested one. Reported-by: David Steele <david@pgbackrest.org> Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: David Steele <david@pgbackrest.org> Discussion: https://postgr.es/m/844a3b38-bbf1-4fb2-9fd6-f58c35c09917@pgbackrest.org	2025-05-21 10:22:12 +09:00
Andres Freund	acad909321	aio: Fix possible state confusions due to interrupt processing elog()/ereport() process interrupts, iff the log message is < ERROR and the log message will be emitted. aio's debug messages are emitted via ereport(), but in some places the code is not ready for interrupts to be processed. Fix the issue using a few different methods: 1) handle interrupts arriving concurrently - in some places it's easy to detect that by fetching the handle's generation a bit earlier 2) Check if interrupts made the work needing to be done obsolete 3) Disallow interrupts, as there's no sane way to make interrupt processing safe To prevent some similar issues from being re-introduced, assert that interrupts are held in pgaio_io_update_state(). This commit also fixes the contents of a debug message I added in `039bfc457e`. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/mvpm7ga3dfgz7bvum22hmuz26cariylmcppb3irayftc7bwk3r@l7gb6gr7azhc	2025-05-19 21:07:06 -04:00
Heikki Linnakangas	29f7ce6fe7	Fix deparsing FETCH FIRST <expr> ROWS WITH TIES In the grammar, <expr> is a c_expr, which accepts only a limited set of integer literals and simple expressions without parens. The deparsing logic didn't quite match the grammar rule, and failed to use parens e.g. for "5::bigint". To fix, always surround the expression with parens. Would be nice to omit the parens in simple cases, but unfortunately it's non-trivial to detect such simple cases. Even if the expression is a simple literal 123 in the original query, after parse analysis it becomes a FuncExpr with COERCE_IMPLICIT_CAST rather than a simple Const. Reported-by: yonghao lee Backpatch-through: 13 Discussion: https://www.postgresql.org/message-id/18929-077d6b7093b176e2@postgresql.org	2025-05-19 18:50:26 +03:00
Amit Kapila	ad5eaf390c	Don't retreat slot's confirmed_flush LSN. Prevent moving the confirmed_flush backwards, as this could lead to data duplication issues caused by replicating already replicated changes. This can happen when a client acknowledges an LSN it doesn't have to do anything for, and thus didn't store persistently. After a restart, the client can send the prior LSN that it stored persistently as an acknowledgement, but we need to ignore such an LSN to avoid retreating confirm_flush LSN. Diagnosed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Author: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Tested-by: Nisha Moond <nisha.moond412@gmail.com> Backpatch-through: 13 Discussion: https://postgr.es/m/CAJpy0uDZ29P=BYB1JDWMCh-6wXaNqMwG1u1mB4=10Ly0x7HhwQ@mail.gmail.com Discussion: https://postgr.es/m/OS0PR01MB57164AB5716AF2E477D53F6F9489A@OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-05-19 12:13:06 +05:30
Alexander Korotkov	3d3a81fc24	Fix tuple_fraction calculation in generate_orderedappend_paths() `6b94e7a6da` adjusted generate_orderedappend_paths() to consider fractional paths. However, it didn't manage to interpret the tuple_fraction value correctly. According to the header comment of grouping_planner(), the tuple_fraction >= 1 specifies the absolute number of expected tuples. That number must be divided by the expected total number of tuples to get the actual fraction. Even though this is a bug fix, we don't backpatch it. The risks of the side effects of plan changes on stable branches are too high. Reported-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/3ca271fa-ca5c-458c-8934-eb148622b270%40gmail.com Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>	2025-05-18 23:49:50 +03:00
Daniel Gustafsson	0d4dad200d	Fix function name reference in comment Ensure that we refer to the function being used, rather than the name of the resulting function in question. Author: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CA+renyVZNiHEv5ceKDjA4j5xC6NT6mRuW33BDERBQMi_90_t6A@mail.gmail.com	2025-05-18 10:05:38 +02:00
Richard Guo	fe29b2a1da	Fix Assert failure in XMLTABLE parser In an XMLTABLE expression, columns can be marked NOT NULL, and the parser internally fabricates an option named "is_not_null" to represent this. However, the parser also allows users to specify arbitrary option names. This creates a conflict: a user can explicitly use "is_not_null" as an option name and assign it a non-Boolean value, which violates internal assumptions and triggers an assertion failure. To fix, this patch checks whether a user-supplied name collides with the internally reserved option name and raises an error if so. Additionally, the internal name is renamed to "__pg__is_not_null" to further reduce the risk of collision with user-defined names. Reported-by: Евгений Горбанев <gorbanyoves@basealt.ru> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/6bac9886-65bf-4cec-96bd-e304159f28db@basealt.ru Backpatch-through: 15	2025-05-15 17:09:04 +09:00
Richard Guo	2c0ed86d39	Add explicit initialization for all PlannerGlobal fields When creating a new PlannerGlobal node in standard_planner(), most fields are explicitly initialized, but a few are not. This doesn't cause any functional issues, as makeNode() zeroes all fields by default. However, the inconsistency is undesirable from a clarity and maintenance perspective. This patch explicitly initializes the remaining fields to improve consistency and readability. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-TgQHNOiouqGcuHoBqbJjWyx4UxGKxUY3FrF4trGbcPA@mail.gmail.com	2025-05-14 09:59:31 +09:00
Álvaro Herrera	0588656366	Fix comment of tsquerysend() The comment describes the order in which fields are sent, and it had one of the fields in the wrong place. This has been wrong since `e6dbcb72fa` (2008), so backpatch all the way back. Author: Emre Hasegeli <emre@hasegeli.com> Discussion: https://postgr.es/m/CAE2gYzzf38bR_R=izhpMxAmqHXKeM5ajkmukh4mNs_oXfxcMCA@mail.gmail.com	2025-05-11 09:47:10 -04:00
Álvaro Herrera	dc9a2d54fd	relcache: Avoid memory leak on tables with no CHECK constraints As complained about by Valgrind, in commit `a379061a22` I failed to realize that I was causing rd_att->constr->check to become allocated when no CHECK constraints exist; previously it'd remain NULL. (This was my bug, not the mentioned commit author's). Fix by making the allocation conditional, and set ->check to NULL if unallocated. Reported-by: Yasir <yasir.hussain.shah@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/202505082025.57ijx3qrbx7u@alvherre.pgsql	2025-05-11 09:22:12 -04:00
Álvaro Herrera	7b2ad43426	Sort includes in alphabetical order Added by commit `042a66291b`, no backpatch needed.	2025-05-11 09:15:05 -04:00
Tom Lane	d4a7e4e179	Fix incorrect "return NULL" in BumpAllocLarge(). This must be "return MemoryContextAllocationFailure(context, size, flags)" instead. The effect of this oversight is that if we got a malloc failure right here, the code would act as though MCXT_ALLOC_NO_OOM had been specified, whether it was or not. That would likely lead to a null-pointer-dereference crash at the unsuspecting call site. Noted while messing with a patch to improve our Valgrind leak detection support. Back-patch to v17 where this code came in.	2025-05-10 20:22:39 -04:00
Noah Misch	4a4ee0c2c1	Remove GLOBALTABLESPACE_OID assert for locked buffers. Commit `f4ece891fc` added the assertion in an attempt to catch some defects even after VACUUM FULL or REINDEX. However, IsCatalogTextUniqueIndexOid(tag.relNumber) always returns false after a relfilenode change, provoking unintended assertion failures. Reported-by: Adam Guo <adamguo@amazon.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Bug: #18912 Discussion: https://postgr.es/m/18912-a41c9bd0e0ad19b1@postgresql.org	2025-05-10 07:36:27 -07:00
Michael Paquier	c259ba881c	aio: Use runtime arguments with injections points in tests This cleans up the code related to the testing infrastructure of AIO that used injection points, switching the test code to use the new facility for injection points added by `371f2db8b0` rather than tweaks to pass and reset arguments to the callbacks run. This removes all the dependencies to USE_INJECTION_POINTS in the AIO code. pgaio_io_call_inj(), pgaio_inj_io_get() and pgaio_inj_cur_handle are now gone. Reviewed-by: Greg Burd <greg@burd.me> Discussion: https://postgr.es/m/Z_y9TtnXubvYAApS@paquier.xyz	2025-05-10 12:36:57 +09:00
Michael Paquier	371f2db8b0	Add support for runtime arguments in injection points The macros INJECTION_POINT() and INJECTION_POINT_CACHED() are extended with an optional argument that can be passed down to the callback attached when an injection point is run, giving to callbacks the possibility to manipulate a stack state given by the caller. The existing callbacks in modules injection_points and test_aio have their declarations adjusted based on that. `da7226993f` (core AIO infrastructure) and `93bc3d75d8` (test_aio) and been relying on a set of workarounds where a static variable called pgaio_inj_cur_handle is used as runtime argument in the injection point callbacks used by the AIO tests, in combination with a TRY/CATCH block to reset the argument value. The infrastructure introduced in this commit will be reused for the AIO tests, simplifying them. Reviewed-by: Greg Burd <greg@burd.me> Discussion: https://postgr.es/m/Z_y9TtnXubvYAApS@paquier.xyz	2025-05-10 06:56:26 +09:00
Heikki Linnakangas	b28c59a6cd	Use 'void ' for arbitrary buffers, 'uint8 ' for byte arrays A 'void ' argument suggests that the caller might pass an arbitrary struct, which is appropriate for functions like libc's read/write, or pq_sendbytes(). 'uint8 ' is more appropriate for byte arrays that have no structure, like the cancellation keys or SCRAM tokens. Some places used 'char ', but 'uint8 ' is better because 'char *' is commonly used for null-terminated strings. Change code around SCRAM, MD5 authentication, and cancellation key handling to follow these conventions. Discussion: https://www.postgresql.org/message-id/61be9e31-7b7d-49d5-bc11-721800d89d64@eisentraut.org	2025-05-08 22:01:25 +03:00
Richard Guo	c06e909c26	Track the number of presorted outer pathkeys in MergePath When creating an explicit Sort node for the outer path of a mergejoin, we need to determine the number of presorted keys of the outer path to decide whether explicit incremental sort can be applied. Currently, this is done by repeatedly calling pathkeys_count_contained_in. This patch caches the number of presorted outer pathkeys in MergePath, allowing us to save several calls to pathkeys_count_contained_in. It can be considered a complement to the changes in commit `828e94c9d`. Reported-by: David Rowley <dgrowleyml@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAApHDvqvBireB_w6x8BN5txdvBEHxVgZBt=rUnpf5ww5P_E_ww@mail.gmail.com	2025-05-08 18:21:32 +09:00
Richard Guo	773db22269	Suppress unnecessary explicit sorting for EPQ mergejoin path When building a ForeignPath for a joinrel, if there's a possibility that EvalPlanQual will be executed, we must identify a suitable path for EPQ checks. If the outer or inner path of the chosen path is a ForeignPath representing a pushed-down join, we replace it with its fdw_outerpath to ensure that the EPQ check path consists entirely of local joins. If the chosen path is a MergePath, and its outer or inner path is a ForeignPath that is not already well enough ordered, the MergePath will have non-NIL outersortkeys or innersortkeys indicating the desired ordering to be created by an explicit Sort node. If we then replace the outer or inner path with its corresponding fdw_outerpath, and that path is already sufficiently ordered, we end up in an inconsistent state: the MergePath has non-NIL outersortkeys or innersortkeys, and its input path is already properly ordered. This inconsistency can result in an Assert failure or the addition of a redundant Sort node. To fix, check if the new outer or inner path of a MergePath is already properly sorted, and set its outersortkeys or innersortkeys to NIL if so. Bug: #18902 Reported-by: Nikita Kalinin <n.kalinin@postgrespro.ru> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/18902-71c1bed2b9f7c46f@postgresql.org	2025-05-08 18:20:18 +09:00
Nathan Bossart	16bf24e0e4	Remove pg_replication_origin's TOAST table. A few places that access this catalog don't set up an active snapshot before potentially accessing its TOAST table. However, roname (the replication origin name) is the only varlena column, so this is only a problem if the name requires out-of-line storage. This commit removes its TOAST table to avoid needing to set up a snapshot. It also places a limit on replication origin names so that attempts to set long names will fail with a more user-friendly error. Those chosen limit of 512 bytes should be sufficient to avoid "row is too big" errors independent of BLCKSZ, but it should also be lenient enough for all reasonable use-cases. Bumps catversion. Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/ZvMSUPOqUU-VNADN%40nathan	2025-05-07 14:47:36 -05:00
Peter Geoghegan	5f4d98d4f3	Prevent premature nbtree array advancement. nbtree array index scans could fail to return matching tuples in rare cases where the missed tuples cover key space that the scan's arrays incorrectly indicate has already been read. These cases involved nearby tuples with NULL values that were evaluated using a skip array key while in pstate.forcenonrequired mode. To fix, prevent forcenonrequired mode from prematurely advancing the scan's array keys beyond key space that the scan has yet to read tuples from: reset the scan's array keys (to the first elements in the current scan direction) before the _bt_checkkeys call for pstate.finaltup. That way _bt_checkkeys starts from a clean slate, which ensures that it will call _bt_advance_array_keys (while passing it sktrig_required=true). This reliably restores the invariant that the scan's arrays always accurately track its progress through the index's key space (at least when the scan is "between pages"). Oversight in commit `8a510275`, which optimized nbtree search scan key comparisons. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://postgr.es/m/CAH2-WzmodSE+gpTd1CRGU9ez8ytyyDS+Kns2r9NzgUp1s56kpw@mail.gmail.com	2025-05-07 15:20:42 -04:00
Peter Geoghegan	7e25c9363a	nbtree: tighten up array recheck rules. Be more conservative when performing a scheduled recheck of an nbtree scan's array keys once on the next page, having set so->scanBehind: back out of reading the page (perform another primitive scan instead) when the next page's high key/finaltup has an untruncated prefix of matching values and truncated suffix attributes associated with lower-order keys. In other words, stop assuming that the lower-order keys have been satisfied by the truncated suffix attributes in this context (only do so when considering scheduling a recheck within _bt_advance_array_keys). The new behavior is more logical: if the next page read after setting so->scanBehind can only contain tuples that are themselves "behind the scan", that's reason enough to cut our losses. In general, when we set so->scanBehind, we only expect to perform one recheck on the next page to make a final decision about whether or not to continue the current primitive index scan. It seems unprincipled for the recheck to allow a _bt_readpage to continue unless the scan's arrays will advance/unless the page might actually contain relevant tuples. In practice it is highly unlikely that things will line up like this (the untruncated prefix of attribute values from the next page's high key is seldom an exact match for their corresponding array's current element following array advancement on the original/previous page). That gives us all the more reason to keep things simple and consistent. This was arguably an oversight in commit `9a2e2a285a`, which improved nbtree array primitive scan scheduling. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkXzJajgyW-pCQ7vaDPhaT3huU+Zw_j448rpCBEsu2YOQ@mail.gmail.com	2025-05-07 15:17:40 -04:00
Alexander Korotkov	ab42d643c1	Refactor ChangeVarNodesExtended() using the custom callback `fc069a3a63` implemented Self-Join Elimination (SJE) and put related logic to ChangeVarNodes_walker(). This commit provides refactoring to remove the SJE-related logic from ChangeVarNodes_walker() but adds a custom callback to ChangeVarNodesExtended(), which has a chance to process a node before ChangeVarNodes_walker(). Passing this callback to ChangeVarNodesExtended() allows SJE-related node handling to be kept within the analyzejoins.c. Reported-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com Author: Andrei Lepikhov <lepihov@gmail.com> Author: Alexander Korotkov <aekorotkov@gmail.com>	2025-05-07 11:10:16 +03:00
Michael Paquier	c4c236ab5c	Fix some comments related to IO workers IO workers are treated as auxiliary processes. The comments fixed in this commit stated that there could be only one auxiliary process of each BackendType at the same time. This is not true for IO workers, as up to MAX_IO_WORKERS of them can co-exist at the same time. Author: Cédric Villemain <Cedric.Villemain@data-bene.io> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/e4a3ac45-abce-4b58-a043-b4a31cd11113@Data-Bene.io	2025-05-07 14:55:57 +09:00
Noah Misch	627acc3caa	With GB18030, prevent SIGSEGV from reading past end of allocation. With GB18030 as source encoding, applications could crash the server via SQL functions convert() or convert_from(). Applications themselves could crash after passing unterminated GB18030 input to libpq functions PQescapeLiteral(), PQescapeIdentifier(), PQescapeStringConn(), or PQescapeString(). Extension code could crash by passing unterminated GB18030 input to jsonapi.h functions. All those functions have been intended to handle untrusted, unterminated input safely. A crash required allocating the input such that the last byte of the allocation was the last byte of a virtual memory page. Some malloc() implementations take measures against that, making the SIGSEGV hard to reach. Back-patch to v13 (all supported versions). Author: Noah Misch <noah@leadboat.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Backpatch-through: 13 Security: CVE-2025-4207	2025-05-05 04:52:04 -07:00
Peter Eisentraut	18c4fff640	Translation updates Source-Git-URL: https://git.postgresql.org/git/pgtranslation/messages.git Source-Git-Hash: f90ee4803c30491e5c49996b973b8a30de47bfb2	2025-05-05 12:04:49 +02:00
Alexander Korotkov	2782f3b845	Revert "Refactor ChangeVarNodesExtended() using the custom callback" This reverts commit `250a718aad`. It shouldn't be pushed during the release freeze. Reported-by: Tom Lane Discussion: https://postgr.es/m/E1uBIbY-000owH-0O%40gemulon.postgresql.org	2025-05-03 22:42:05 +03:00
Alexander Korotkov	250a718aad	Refactor ChangeVarNodesExtended() using the custom callback `fc069a3a63` implemented Self-Join Elimination (SJE) and put related logic to ChangeVarNodes_walker(). This commit provides refactoring to remove the SJE-related logic from ChangeVarNodes_walker() but adds a custom callback to ChangeVarNodesExtended(), which has a chance to process a node before ChangeVarNodes_walker(). Passing this callback to ChangeVarNodesExtended() allows SJE-related node handling to be kept within the analyzejoins.c. Reported-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com Author: Andrei Lepikhov <lepihov@gmail.com> Author: Alexander Korotkov <aekorotkov@gmail.com>	2025-05-03 22:30:52 +03:00
Etsuro Fujita	5201bba266	Fix memory allocation/copy mistakes. The previous code was allocating more memory and copying more data than necessary because it specified the wrong PgStat_KindInfo member as the size argument for MemoryContextAlloc and memcpy, respectively. Although these issues exist since `5891c7a8e`, there have been no reports from the field. So for now, it seems sufficient to fix them in master. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Gurjeet Singh <gurjeet@singh.im> Discussion: https://postgr.es/m/CAPmGK15eTRCZTnfgQ4EuBNo%3DQLYGFEbXS_7m2dXqtkcT7L8qrQ%40mail.gmail.com	2025-05-03 20:00:00 +09:00
Etsuro Fujita	6e91b9c16f	Fix typos in comments. Also adjust the phrasing in the comments. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Author: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Gurjeet Singh <gurjeet@singh.im> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAPmGK17%3DPHSDZ%2B0G6jcj12buyyE1bQQc3sbp1Wxri7tODT-SDw%40mail.gmail.com Backpatch-through: 15	2025-05-03 19:10:00 +09:00
Peter Geoghegan	0f08df4068	Avoid treating nonrequired nbtree keys as required. Consistently prevent nbtree array advancement from treating a scankey as required when operating in pstate.forcenonrequired mode. Otherwise, we risk a NULL pointer dereference. This was possible in the path where _bt_check_compare is called to recheck a tuple that advanced all of the scan's arrays to matching values: its continuescan=false handling expects _bt_advance_array_keys to have been called with a valid pstate, but it'll always be NULL during sktrig_required=false calls (which is how _bt_advance_array_keys must be called when pstate.forcenonrequired). Oversight in commit `8a510275`, which optimized nbtree search scan key comparisons. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://postgr.es/m/CAHgHdKsn2W=gPBmj7p6MjQFvxB+zZDBkwTSg0o3f5Hh8rkRrsA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzmodSE+gpTd1CRGU9ez8ytyyDS+Kns2r9NzgUp1s56kpw@mail.gmail.com	2025-05-02 17:50:58 -04:00
Tomas Vondra	1681a70df3	Fix memory leak in _gin_parallel_merge To insert the merged GIN entries in _gin_parallel_merge, the leader calls ginEntryInsert(). This may allocate memory, e.g. for a new leaf tuple. This was allocated in the PortalContext, and kept until the end of the index build. For most GIN indexes the amount of leaked memory is negligible, but for custom opclasses with large keys it may cause OOMs. Fixed by calling ginEntryInsert() in a temporary memory context, reset after each insert. Other ginEntryInsert() callers do this too, except that the context is reset after batches of inserts. More frequent resets don't seem to hurt performance, it may even help it a bit. Report and fix by Vinod Sridharan. Author: Vinod Sridharan <vsridh90@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAFMdLD4p0VBd8JG=Nbi=BKv6rzFAiGJ_sXSFrw-2tNmNZFO5Kg@mail.gmail.com	2025-05-02 23:05:18 +02:00
Tom Lane	e83a8ae447	Don't use a tuplestore if we don't have to for SQL-language functions. We only need a tuplestore if we're actually going to accumulate multiple result tuples. Obviously then we don't need one for non-set- returning functions; but even a SRF doesn't need one if we decide to use "lazyEval" (one row at a time) mode. In these cases, it's sufficient to use the junkfilter's result slot to hold the single row that's due to be returned. We just need to "materialize" that slot to ensure it holds onto the data past shutdown of the sub-executor. The original intent of this patch was partially to save a few cycles (by not putting tuples into a tuplestore only to pull them back out immediately), but mostly to ensure that we don't use a tuplestore in non-set-returning functions. That's because I had concerns about whether a tuplestore is safe to keep across queries, which was possible for functions invoked via long-lived FmgrInfos such as those kept in the typcache. There are no cases where SRFs are called that way, so getting rid of the tuplestore in non-SRFs should make things safer. However, it emerges that running fmgr_sql in a short-lived context (as `595d1efed` made it do) makes the existing coding unsafe anyway: we can end up with a long-lived TupleTableSlot holding a freeable reference to a short-lived tuple, resulting in a double-free crash. Not trying to pull tuples out of the tuplestore using that slot dodges the problem, so I'm going to commit this now rather than invent a band-aid solution for v18. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2443532.1744919968@sss.pgh.pa.us Discussion: https://postgr.es/m/9f975803-1a1c-4f21-b987-f572e110e860@gmail.com	2025-05-02 16:16:20 -04:00
Álvaro Herrera	c83a38758d	Handle self-referencing FKs correctly in partitioned tables For self-referencing foreign keys in partitioned tables, we weren't handling creation of pg_constraint rows during CREATE TABLE PARTITION AS as well as ALTER TABLE ATTACH PARTITION. This is an old bug -- mostly, we broke this in `614a406b4f` while trying to fix it (so 12.13, 13.9, 14.6 and 15.0 and up all behave incorrectly). This commit reverts part of that with additional fixes for full correctness, and installs more tests to verify the parts we broke, not just the catalog contents but also the user-visible behavior. Backpatch to all live branches. In branches 13 and 14, commit `46a8c27a72` changed the behavior during DETACH to drop a FK constraint rather than trying to repair it, because the complete fix of repairing catalog constraints was problematic due to lack of previous fixes. For this reason, the test behavior in those branches is a bit different. However, as best as I can tell, the fix works correctly there. In release notes we have to recommend that all self-referencing foreign keys on partitioned tables be recreated if partitions have been created or attached after the FK was created, keeping in mind that violating rows might already be present on the referencing side. Reported-by: Guillaume Lelarge <guillaume@lelarge.info> Reported-by: Matthew Gabeler-Lee <fastcat@gmail.com> Reported-by: Luca Vallisa <luca.vallisa@gmail.com> Discussion: https://postgr.es/m/CAECtzeWHCA+6tTcm2Oh2+g7fURUJpLZb-=pRXgeWJ-Pi+VU=_w@mail.gmail.com Discussion: https://postgr.es/m/18156-a44bc7096f0683e6@postgresql.org Discussion: https://postgr.es/m/CAAT=myvsiF-Attja5DcWoUWh21R12R-sfXECY2-3ynt8kaOqjw@mail.gmail.com	2025-05-02 21:25:50 +02:00
Peter Eisentraut	81eaaa2c41	Make "directory" setting work with extension_control_path The extension_control_path setting (commit `4f7f7b0375`) did not support extensions that set a custom "directory" setting in their control file. Very few extensions use that and during the discussion on the previous commit it was suggested to maybe remove that functionality. But a fix was easier than initially thought, so this just adds that support. The fix is to use the control->control_dir as a share dir to return the path of the extension script files. To make this work more sensibly overall, the directory suffix "extension" is no longer to be included in the extension_control_path value. To quote the patch, it would be -extension_control_path = '/usr/local/share/postgresql/extension:/home/my_project/share/extension:$system' +extension_control_path = '/usr/local/share/postgresql:/home/my_project/share:$system' During the initial patch, there was some discussion on which of these two approaches would be better, and the committed patch was a 50/50 decision. But the support for the "directory" setting pushed it the other way, and also it seems like many people didn't like the previous behavior much. Author: Matheus Alcantara <mths.dev@pm.me> Reviewed-by: Christoph Berg <myon@debian.org> Reviewed-by: David E. Wheeler <david@justatheory.com> Discussion: https://www.postgresql.org/message-id/flat/aAi1VACxhjMhjFnb%40msg.df7cb.de#0cdf7b7d727cc593b029650daa3c4fbc	2025-05-02 16:35:48 +02:00
Peter Geoghegan	9d924dbb37	Adjust overstrong nbtree skip array assertion. Make an nbtree array preprocessing assertion account for scans that add fewer skip arrays than initially expected due to preprocessing finding an unsatisfiable array qual. Oversight in commit `92fe23d9`. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://postgr.es/m/CAHgHdKtQMhHy5qcB3KqCcGiW-Rp8P7KzUFRa9ZMKUiv6zen7LQ@mail.gmail.com	2025-04-30 23:15:51 -04:00
Daniel Gustafsson	45e7e8ca9e	Convert strncpy to strlcpy We try to avoid using strncpy() due to the ease of which it can be misused. Convert this callsite to use strlcpy() instead to match similar codepaths in this file. Suggested-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/2a796830-de2d-4030-b480-d673f6cc5d94@eisentraut.org	2025-04-30 23:00:47 +02:00
Daniel Gustafsson	f8c115a6cb	Typo and doc fixups for memory context reporting This fixes comment and docs typos as well as a small documentation change to make it clearer. Found via post-commit review. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CAH2L28vt16C9xTuK+K7QZvtA3kCNWXOEiT=gEekUw3Xxp9LVQw@mail.gmail.com	2025-04-30 11:10:27 +02:00
Daniel Gustafsson	d2a1ed1727	Add missing string terminator When copying the string strncpy won't add nul termination since the string length is equal to the length specified. Explicitly set a nul terminator after copying to properly terminate. Found via post-commit review. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CAH2L28vt16C9xTuK+K7QZvtA3kCNWXOEiT=gEekUw3Xxp9LVQw@mail.gmail.com	2025-04-30 10:34:08 +02:00
David Rowley	918e7287ed	Fix broken indentation I forgot to run pgindent in `d8555e522`. Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Discussion: https://postgr.es/m/156083c9-eac0-418d-9667-92dec4d6d6cd@oss.nttdata.com	2025-04-30 19:18:30 +12:00
David Rowley	d8555e522e	Fix a couple of comment typos Author: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAEG8a3+MRwDKc4YSFKKPKq7Y+vMufVC5u94wM5KZPB2CbgCxnQ@mail.gmail.com	2025-04-30 13:40:46 +12:00
Tom Lane	810a8b1c80	Give up on running with NetBSD/OpenBSD's default semaphore settings. This reverts commit `38da053463`, which attempted to preserve our ability to start with only 60 semaphores. Subsequent changes (particularly `55b454d0e`) have put that idea pretty much permanently out of reach: people wishing to use Postgres v18 on OpenBSD or NetBSD will have no choice but to increase those platforms' default values of SEMMNI and SEMMNS. Hence, revert 38da05346's changes in SEMAS_PER_SET and the minimum tested value of max_connections. Adjust a comment from the subsequent patch `6d0154196`, and tweak the wording in runtime.sgml to make it clear that changing SEMMNI/SEMMNS is no longer even a little bit optional on these platforms. Although `38da05346` was later back-patched into v17, leave that branch alone: it's still capable of starting with 60 semaphores, and there's no reason to break that. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/E1tuZNv-0037Gs-34@gemulon.postgresql.org Discussion: https://postgr.es/m/1052019.1745947915@sss.pgh.pa.us	2025-04-29 17:27:52 -04:00
Alexander Korotkov	2260c7f6d9	Fixes for ChangeVarNodes_walker() This commit fixes two bug in ChangeVarNodes_walker() function. * When considering RestrictInfo, walk down to its clauses based on the presense of relid to be deleted not just in clause_relids but also in required_relids. * Incrementally adjust num_base_rels based on the change of clause_relids instead of recalculating it using clause_relids, which could contain outer-join relids. Reported-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-04-29 14:34:44 +03:00
Amit Kapila	3ff2a1f0c9	Fix assertion failure during decoding from synced slots. The slot synchronization skips updating the confirmed_flush LSN of the local slot if the local slot has a newer catalog_xmin or restart_lsn, but still allows updating the two_phase and two_phase_at fields of the slot. This opens up a window for the prepared transactions between old confirmed_flush LSN and two_phase_at to unexpectedly get decoded and sent to the downstream after promotion. Then, while decoding the commit prepared the assert will fail, which expects that the prepare hasn't been sent to the downstream. The fix is to skip updating the other slot fields when we are skipping to update the confirmed_flush LSN of the slot. We didn't backpatch this commit as two_phase_at was not synced in back branches, which means prepared transactions won't be unexpectedly sent to downstream. We discovered this problem while analyzing BF failure reported in the discussion link. Reliably reproducing this issue without a debugger is difficult. Given its rarity, adding specific injection point to test it doesn't seem worthwhile, so we won't be adding a dedicated test case. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS0PR01MB5716B44052000EB91EFAE60E94BC2@OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-04-29 12:52:05 +05:30
Melanie Plageman	f132815fd7	Add maintenance_io_concurrency flag to some read stream users Index vacuuming and [auto]prewarm AIO concurrency should be governed by maintenance_io_concurrency. As such, pass those read stream users the READ_STREAM_MAINTENANCE flag which will calculate their read stream distance with maintenance_io_concurrency instead of effective_io_concurrency. This was an oversight in the original commits making those operations use the read stream API. Discussion: https://postgr.es/m/flat/CAAKRu_aopDxTo4b41Mt_7Zc-z0_ngocrY8SFCCY6Aph1HgwuNw%40mail.gmail.com	2025-04-28 14:19:45 -04:00
Peter Geoghegan	ce72e7e02e	Fix obsolete nbtree array advancement comment. Checking if another primitive scan is required after all once the next leaf page was moved from _bt_checkkeys to its _bt_readpage caller by commit `9a2e2a28`. Update a comment that incorrectly described the recheck mechanism as something that takes place in _bt_checkkeys. Also fix an older typo in related code comments.	2025-04-28 12:49:17 -04:00
Peter Geoghegan	b75fedcab7	Make NULL tuple values always advance skip arrays. _bt_check_compare neglected to handle a case that can arise when the scan's keys are temporarily treated as nonrequired, as an optimization: whenever a NULL tuple value was encountered that had a skip array whose current element wasn't already NULL, _bt_check_compare failed to advance the array to the NULL element. This allowed _bt_check_compare to fail to return matching tuples containing a NULL value (though only with an array column that came before a skip array column with NULLs, and only during _bt_readpage calls that set pstate.forcenonrequired=true on a page where the higher-order column also had to advance). To fix, teach _bt_check_compare to handle this case just like any other case where a skip array key is unsatisfied and must be advanced directly (due to the key being considered a nonrequired key). Oversight in commit `8a510275`, which optimized nbtree search scan key comparisons with skip arrays. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://postgr.es/m/CAHgHdKtLFWZcjr87hMH0hYDHgcifu4Tj7iHz-xh8qsJREt5cqA@mail.gmail.com	2025-04-28 12:11:08 -04:00
Alexander Korotkov	73e7361376	Restore comments in ChangeVarNodesExtended() This commit restores comments in ChangeVarNodesExtended(), which were accidentally removed by `fc069a3a63`. Reported-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com	2025-04-28 11:20:22 +03:00
Amit Kapila	aaf9e95e87	Fix xmin advancement during fast_forward decoding. During logical decoding, we advance catalog_xmin of logical too early in fast_forward mode, resulting in required catalog data being removed by vacuum. This mode is normally used to advance the slot without processing the changes, but we still can't let the slot's xmin to advance to an incorrect value. Commit `f49a80c481` fixed a similar issue where the logical slot's catalog_xmin was getting advanced prematurely during non-fast-forward mode. During xl_running_xacts processing, instead of directly advancing the slot's xmin to the oldest running xid in the record, it allowed the xmin to be held back for snapshots that can be used for not-yet-replayed transactions, as those might consider older txns as running too. However, it missed the fact that the same problem can happen during fast_forward mode decoding, as we won't build a base snapshot in that mode, and the future call to get_changes from the same slot can miss seeing the required catalog changes leading to incorrect reslts. This commit allows building the base snapshot even in fast_forward mode to prevent the early advancement of xmin. Reported-by: Amit Kapila <amit.kapila16@gmail.com> Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 13 Discussion: https://postgr.es/m/CAA4eK1LqWncUOqKijiafe+Ypt1gQAQRjctKLMY953J79xDBgAg@mail.gmail.com Discussion: https://postgr.es/m/OS0PR01MB57163087F86621D44D9A72BF94BB2@OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-04-28 11:35:54 +05:30
Michael Paquier	b225c5e76e	Remove circular #include's between wait_event.h and wait_event_types.h wait_event_types.h is generated by the code, and included wait_event.h. wait_event.h did the opposite move, including wait_event_types.h, causing a circular dependency between both. wait_event_types.h only needs to now about the wait event classes, so this information is moved into its own file, and wait_event_types.h uses this new header so as it does not depend anymore on wait_event.h. Note that such errors can be found with clang-tidy, with commands like this one: clang-tidy source_file.c --checks=misc-header-include-cycle -- \ -I/install/path/include/ -I/install/path/include/server/ Issue introduced by `fa88928470`. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/350192.1745768770@sss.pgh.pa.us	2025-04-28 09:08:15 +09:00
Alexander Korotkov	1aa7cf9eb8	Disallow removing placeholders during Self-Join Elimination. `fc069a3a63` implements Self-Join Elimination (SJE), which can remove base relations when appropriate. However, regressions tests for SJE only cover the case when placeholder variables (PHVs) are evaluated and needed only in a single base rel. If this baserel is removed due to SJE, its clauses, including PHVs, will be transferred to the keeping relation. Removing these PHVs may trigger an error on plan creation -- thanks to the `b3ff6c742f` for detecting that. This commit skips removal of PHVs during SJE. This might also happen that we skip the removal of some PHVs that could be removed. However, the overhead of extra PHVs is small compared to the complexity of analysis needed to remove them. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com>	2025-04-28 01:40:42 +03:00
Tom Lane	94b84a6072	Don't use double-quotes in #include's of system headers, redux. This cleans up some loose ends left by commit `e8ca9ed1d`. I hadn't looked closely enough at these places before, but now I have. The use of double-quoted #includes for Perl headers in plperl_system.h seems to be simply a mistake introduced in `6c944bf3c` and faithfully copied forward since then. (I had thought possibly it was required by some weird Windows build setup, but there's no evidence of that in our history.) The occurrences in SectionMemoryManager.h and SectionMemoryManager.cpp evidently stem from those files' origin as LLVM code. It's understandable that LLVM would treat their own files as needing double-quoted #includes; but they're still system headers to us. I also applied the same check to *.c files, and found a few other random incorrect usages in both directions. Our ECPG headers and test files routinely use angle brackets to refer to ECPG headers. I left those usages alone, since it seems reasonable for an ECPG user to regard those headers as system headers.	2025-04-27 13:23:19 -04:00
David Rowley	936457419d	Eliminate divide in new fast-path locking code `c4d5cb71d2` adjusted the fast-path locking code to allow some configuration of the number of fast-path locking slots via the max_locks_per_transaction GUC. In that commit the FAST_PATH_REL_GROUP() macro used integer division to determine the fast-path locking group slot to use for the lock. The divisor in this case is always a power-of-two value. Here we swap out the divide by a bitwise-AND, which is a significantly faster operation to perform. In passing, adjust the code that's setting FastPathLockGroupsPerBackend so that it's more clear that the value being set is a power-of-two. Also, adjust some comments in the area which contained some magic numbers. It seems better to justify the 1024 upper limit in the location where the #define is made instead of where it is used. Author: David Rowley <drowleyml@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAApHDvodr3bcnpxcs7+k-3cFwYR0tP-BYhyd2PpDhe-bCx9i=g@mail.gmail.com	2025-04-27 11:53:40 +12:00
Andres Freund	039bfc457e	aio: Improve debug logging around waiting for IOs Trying to investigate a bug report by Alexander Lakhin made it apparent that the debug logging around waiting for IO completion is insufficient. Fix that. Discussion: https://postgr.es/m/h4in2db37vepagmi2oz5vvqymjasc5gyb4lpqkunj4eusu274i@37jpd3c2spd3	2025-04-25 13:31:25 -04:00
Andres Freund	500b61769f	Fix bug allowing io_combine_limit > io_max_combine_combine limit `10f6646847` intended to limit the value of io_combine_limit to the minimum of io_combine_limit and io_max_combine_limit. To avoid issues with interdependent GUCs, it introduced io_combine_limit_guc and set io_combine_limit in assign hooks. That plan was thwarted by guc_tables.c accidentally still referencing io_combine_limit, instead of io_combine_limit_guc. That lead to the GUC machinery overriding the work done in the assign hooks, potentially leaving io_combine_limit with a too high value. The consequence of this bug was that when running with io_combine_limit > io_combine_limit_guc the AIO machinery would not have reserved large enough iovec and IO data arrays, with one IO's arrays overlapping with another IO's, leading to total confusion. To make such a problem easier to detect in the future, add assertions to pgaio_io_set_handle_data_* checking the length is smaller than io_max_combine_limit (not just PG_IOV_MAX). It'd be nice to have a few tests for this, but it's not entirely obvious how to do so portably. As remarked upon by Tom, the GUC assignment hooks really shouldn't set the underlying variable, that's the job of the GUC machinery. Change that as well. Discussion: https://postgr.es/m/c5jyqnuwrpigd35qe7xdypxsisdjrdba5iw63mhcse4mzjogxo@qdjpv22z763f	2025-04-25 13:31:24 -04:00
Andres Freund	0d9114b704	aio: Fix crash potential for pg_aios views due to late state update pgaio_io_reclaim() reset the fields in PgAioHandle before updating the state to IDLE or incrementing the generation. For most things that's OK, but for pg_get_aios() it is not - if it copied the PgAioHandle while fields were being reset, we wouldn't detect that and could call pgaio_io_get_target_description() with ioh->target == PGAIO_TID_INVALID, leading to a crash. Fix this issue by incrementing the generation and state earlier, before resetting. Also add an assertion to pgaio_io_get_target_description() for the target to be valid - that'd have made this case a bit easier to debug. While at it, add/update a few related assertions. Author: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/062daca9-dfad-4750-9da8-b13388301ad9@gmail.com	2025-04-25 13:31:13 -04:00
Michael Paquier	923ae50cf5	Add sanity check for dshash entries when reading pgstats file Not having this check would produce a core dump at startup when running pgstat_read_statsfile(), in the case where the information of a stats kind for an entry in the dshash could not be found. The same check already happens for fixed-numbered stats and entries that are stored with their names. This issue can be seen with custom stats kinds. Note that this problem can be reproduced what what is in the core code: - Tweak the test module injection_points to not load the fixed-numbered stats part, leaving only the variable-numbered stats. - Create an instance with injection_points defined in shared_preload_libraries. - Create a pgstats entry by attaching and running a point. - Restart the server without shared_preload_libraries. The startup process detects that something is wrong and reports a WARNING. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aAieZAvM+K1d89R2@ip-10-97-1-34.eu-west-3.compute.internal	2025-04-24 09:20:01 +09:00
Tom Lane	bc19f63f80	Avoid possibly-theoretical OOM crash hazard in hash_create(). One place in hash_create() used DynaHashAlloc() as a convenient shorthand for MemoryContextAlloc(). That was fine when it was written, but it stopped being fine when `9c911ec06` changed DynaHashAlloc() to use MCXT_ALLOC_NO_OOM (mea culpa). Change the code to call plain MemoryContextAlloc() as intended. I think that this bug may be unreachable in practice, since we now always create AllocSets with some space already allocated, so that an OOM failure here for a non-shared hash table should be impossible (with a hash table name of reasonable length anyway). And there aren't enough shared hash tables to make a crash for one of those probable. Nonetheless it's clearly not operating as designed, so back-patch to v16 where `9c911ec06` came in. Reported-by: Maksim Korotkov <m.korotkov@postgrespro.ru> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/219bdccd460510efaccf90b57e5e5ef2@postgrespro.ru Backpatch-through: 16	2025-04-23 16:04:55 -04:00
Alexander Korotkov	bb78e42678	Maintain RelIdToTypeIdCacheHash in TypeCacheOpcCallback() `b85a9d046e` introduced a new RelIdToTypeIdCacheHash, whose entries should exist for typecache entries with TCFLAGS_HAVE_PG_TYPE_DATA flag set or any of TCFLAGS_OPERATOR_FLAGS set or tupDesc set. However, TypeCacheOpcCallback(), which resets TCFLAGS_OPERATOR_FLAGS, was forgotten to update RelIdToTypeIdCacheHash. This commit adds a delete_rel_type_cache_if_needed() call to the TypeCacheOpcCallback() function to maintain RelIdToTypeIdCacheHash after resetting TCFLAGS_OPERATOR_FLAGS. Also, this commit fixes the name of the delete_rel_type_cache_if_needed() function in its mentions in the comments. Reported-by: Noah Misch Discussion: https://postgr.es/m/20250411203241.e9.nmisch%40google.com	2025-04-23 20:26:52 +03:00
Alexander Korotkov	9f404d7922	Properly prepare varinfos in estimate_multivariate_bucketsize() To estimate with extended statistics, we need to clear the varnullingrels field in the expression, and duplicates are not allowed in the GroupVarInfo list. We might re-use add_unique_group_var(), but we don't do so for two reasons. 1) We must keep the origin_rinfos list ordered exactly the same way as varinfos. 2) add_unique_group_var() is designed for estimate_num_groups(), where a larger number of groups is worse. While estimating the number of hash buckets, we have the opposite: a lesser number of groups is worse. Therefore, we don't have to remove "known equal" vars: the removed var may valuably contribute to the multivariate statistics to grow the number of groups. This commit adds custom code to estimate_multivariate_bucketsize() to initialize varinfos properly. Reported-by: Robins Tharakan <tharakan@gmail.com> Discussion: https://postgr.es/m/18885-da51324078588253%40postgresql.org Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-04-23 20:25:21 +03:00
Tom Lane	3db61db48e	Change the names generated for child foreign key constraints. When a foreign key constraint is placed on a partitioned table, we actually make two pg_constraint entries associated with that table. (I have my doubts about the wisdom of that, but it's been like that since v12 and post-feature-freeze is no time to be messing with such entrenched decisions.) The second "child" entry always had a name generated according to the default rule, "table_column(s)_fkey[nnn]", even if the primary entry had an unrelated user-specified name. The trouble with doing that is that the default name could collide with the user-specified name of some other constraint on the same table. While we were willing to adjust the generated name to avoid collisions, that only helps if it's made second; if it's made first then creation of the other constraint would fail, potentially causing dump/reload or pg_upgrade failures. The core of the problem here is that we're infringing on user namespace, so I doubt that there's any 100% solution other than to find a way to not need the "child" entry. In the meantime, it seems like it'd be an improvement to make the child's name be the name of the parent constraint with an underscore and digit(s) appended as necessary to make it unique. This rule can in theory fail in the same way, but it seems much less probable; for one thing, this rule is guaranteed not to match primary entries having auto-generated names. (While an auto-generated primary name isn't user-specified to begin with, it acts like that during dump/reload, so collisions against such names are definitely possible.) An additional bonus, visible in some of the regression test cases that change here, arises from the fact that some error messages cite the child constraint's name not the parent's. In the previous approach the two names could be completely unrelated, leading to user confusion --- the more so since psql's \d command hides child constraints. With this approach it's hopefully much clearer which constraint-the-user-knows-about is failing. However, that does mean that there's user-visible behavior change occurring here, making it seem like not something to back-patch. I feel it's not too late for v18, though. Reported-by: Kirill Reshke <reshkekirill@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CALdSSPhGitjpTfzEMJN-Y2x+Q-5QChSxAsmSJ1-E8mQJLkHOqQ@mail.gmail.com	2025-04-23 12:03:02 -04:00
Amit Kapila	0e091ce409	Fix an oversight in `3f28b2fcac`. Commit `3f28b2fcac` tried to ensure that the replication origin shouldn't be advanced in case of an ERROR in the apply worker, so that it can request the same data again after restart. However, it is possible that an ERROR was caught and handled by a (say PL/pgSQL) function, and the apply worker continues to apply further changes, in which case, we shouldn't reset the replication origin. Ensure to reset the origin only when the apply worker exits after an ERROR. Commit `3f28b2fcac` added new function geterrlevel, which we removed in HEAD as part of this commit, but kept it in backbranches to avoid breaking any applications. A separate case can be made to have such a function even for HEAD. Reported-by: Shawn McCoy <shawn.the.mccoy@gmail.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16, where it was introduced Discussion: https://postgr.es/m/CALsgZNCGARa2mcYNVTSj9uoPcJo-tPuWUGECReKpNgTpo31_Pw@mail.gmail.com	2025-04-23 11:08:24 +05:30
Michael Paquier	1f7878c33c	Remove assertion based on pending_since in pgstat_report_stat() This assertion, based on pending_since (timestamp used to prevent stats reports to be too frequent or should a partial flush happen), is reached when it is found that no data can be flushed but a previous call of pgstat_report_stat() determined that some stats data has been found as in need of a flush. So pending_since is set when some stats data is pending (in non-force mode) or if report attempts are too frequent, and reset to 0 once all stats have been flushed. Since `5cbbe70a9c`, WAL senders have begun to report their stats on a periodic basis for IO stats in v16~ and backend stats on HEAD, creating some friction with the concurrent pgstat_report_stat() calls that can happen in the context of a WAL sender (shutdown callback doing a final report or backend-related code paths). This problem is the cause of spurious failures in the TAP tests. In theory, this assertion can be also reached in v15, even if that's very unlikely. For example, a process, say a background worker, could do periodic and direct stats flushes with concurrent calls of pgstat_report_stat() that could cause conflicting values of pending_since. This can be done with WAL or SLRU stats flushes using pgstat_flush_wal() or pgstat_slru_flush(). HEAD makes this situation easier to happen with custom cumulative stats. This commit removes the assertion altogether, per discussion, as it is more useful to keep the state of things as they are for the WAL sender. The assertion could use a special state based on for example am_walsender, but I doubt that this would be meaningful in the long run based on the other arguments raised while discussing this issue. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/1489124.1744685908@sss.pgh.pa.us Discussion: https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk@omg4g3cc6dgq Backpatch-through: 15	2025-04-23 13:53:29 +09:00
Tom Lane	eaf582806c	gen_node_support.pl: improve error message for unclosed struct. This error message was 'runaway "struct_name"', which isn't all that clear; I think 'could not find closing brace for "struct_name"' is better. Also, provide the location of the struct start using the script's usual '$file:$lineno' style. Bug: #18901 Reported-by: Clemens Ruck <clemens.ruck@t-online.de> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18901-424272abe01357e6@postgresql.org	2025-04-22 13:56:31 -04:00
Michael Paquier	02c63f9438	Rename injection point for invalidation messages at end of transaction This injection point was named "AtEOXact_Inval-with-transInvalInfo", not respecting the implied naming convention that injection points should use lower-case characters, with terms separated by dashes. All the other points defined in the tree follow this style, so let's be more consistent. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/OSCPR01MB14966E14C1378DEE51FB7B7C5F5B32@OSCPR01MB14966.jpnprd01.prod.outlook.com Backpatch-through: 17	2025-04-22 10:01:38 +09:00
Jeff Davis	90260e2ec6	Fix INITCAP() word boundaries for PG_UNICODE_FAST. Word boundaries are based on whether a character is alphanumeric or not. For the PG_UNICODE_FAST collation, alphanumeric includes non-ASCII digits; whereas for the PG_C_UTF8 collation, it only includes digits 0-9. Pass down the right information from the pg_locale_t into initcap_wbnext to differentiate the behavior. Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250417135841.33.nmisch@google.com	2025-04-21 12:34:58 -07:00
Tom Lane	80b727eb9d	Use the same cmd_context throughout a walsender's lifetime. exec_replication_command created a cmd_context to work in and then deleted it on exit. This is pretty dangerous because some replication commands start/finish transactions. In the wake of commit `1afe31f03`, that could lead to re-selecting a CurrentMemoryContext that's already been deleted, leading to hilarity such as a memory context that is its own parent. To fix, let's make the cmd_context persist across exec_replication_command calls; instead of deleting it, we'll just reset it each time. In this way it retains the same identity and there's no problem if transaction abort restores it as the working context. It probably even saves a few microseconds to do this. This fix also ensures that exec_replication_command returns to the caller (PostgresMain) with the same context active that had been when it was called (probably MessageContext). The previous coding could get that wrong too. Reported-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAO6_XqoJA7-_G6t7Uqe5nWF3nj+QBGn4F6Ptp=rUGDr0zo+KvA@mail.gmail.com	2025-04-21 12:09:36 -04:00
Tom Lane	5ec8b01c30	MemoryContextCreate: assert parent is valid and different from node. The case of "node == parent" might seem impossible, since we just allocated the new node. But it's possible if parent is a dangling reference to a recently-deleted context. In fact, given aset.c's habit of recycling contexts, it's actually rather likely if that's so. If we'd had this assertion before, it would have simplified debugging a recently-identified walsender issue. Reported-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAO6_XqoJA7-_G6t7Uqe5nWF3nj+QBGn4F6Ptp=rUGDr0zo+KvA@mail.gmail.com	2025-04-21 11:34:36 -04:00
David Rowley	78eda9e264	Fix a few more duplicate words in comments Similar to `84fd3bc14` but these ones were found using a regex that can span multiple lines. Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrMcr8XD107H3NV=WHgyBcu=sx5+7=WArr-n_cWUqdFXQ@mail.gmail.com	2025-04-21 13:50:50 +12:00
David Rowley	84fd3bc141	Fix a few duplicate words in comments These are all new to v18 Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrMcr8XD107H3NV=WHgyBcu=sx5+7=WArr-n_cWUqdFXQ@mail.gmail.com	2025-04-21 10:41:18 +12:00
Noah Misch	8180136652	Comment on need to MarkBufferDirty() if omitting DELAY_CHKPT_START. Blocking checkpoint phase 2 requires MarkBufferDirty() and BUFFER_LOCK_EXCLUSIVE; neither suffices by itself. transam/README documents this, citing SyncOneBuffer(). Update the DELAY_CHKPT_START documentation to say this. Expand the heap_inplace_update_and_unlock() comment that cites XLogSaveBufferForHint() as precedent, since heap_inplace_update_and_unlock() could have opted not to use DELAY_CHKPT_START. Commit `8e7e672cda` added DELAY_CHKPT_START to heap_inplace_update_and_unlock(). Since commit `bc6bad8857` reverted it in non-master branches, no back-patch. Discussion: https://postgr.es/m/20250406180054.26.nmisch@google.com	2025-04-20 12:00:17 -07:00
Noah Misch	2d5350cfbd	Avoid ERROR at ON COMMIT DELETE ROWS after relhassubclass=f. Commit `7102070329` fixed a similar bug, but it missed the case of database-wide ANALYZE ("use_own_xacts" mode). Commit `a07e03fd8f` changed consequences from silent discard of a pg_class stats (relpages et al.) update to ERROR "tuple to be updated was already modified". Losing a relpages update of an ON COMMIT DELETE ROWS table was negligible, but a COMMIT-time error isn't negligible. Back-patch to v13 (all supported versions). Reported-by: Richard Guo <guofenglinux@gmail.com Reported-by: Robins Tharakan <tharakan@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-XwMKMKJ_GT=p3_-_=j9rQSEs1FbDFUnW9zHuKPsPNEQ@mail.gmail.com Backpatch-through: 13	2025-04-20 08:28:48 -07:00
David Rowley	d47f922246	Fix issue with ORDER BY / DISTINCT aggregates and FILTER `1349d2790` added support so that aggregate functions with an ORDER BY or DISTINCT clause could make use of presorted inputs to avoid an implicit sort within nodeAgg.c. That commit failed to consider that a FILTER clause may exist that filters rows before the aggregate function arguments are evaluated. That can be problematic if an aggregate argument contains an expression which could error out during evaluation. It's perfectly valid to want to have a FILTER clause which eliminates such values, and with the pre-sorted path added in `1349d2790`, it was possible that the planner would produce a plan with a Sort node above the Aggregate to perform the sort on the aggregate's arguments long before the Aggregate node would filter out the non-matching values. Here we fix this by inspecting ORDER BY / DISTINCT aggregate functions which have a FILTER clause to see if the aggregate's arguments are anything more complex than a Var or a Const. Evaluating these isn't going to cause an error. If we find any non-Var, non-Const parameters then the planner will now opt to perform the sort in the Aggregate node for these aggregates, i.e. disable the presorted aggregate optimization. An alternative fix would have been to completely disallow the presorted optimization for Aggrefs with any FILTER clause, but that wasn't done as that could cause large performance regressions for queries that see significant gains from `1349d2790` due to presorted results coming in from an Index Scan. Backpatch to 16, where `1349d2790` was introduced Author: David Rowley <dgrowleyml@gmail.com> Reported-by: Kaimeh <kkaimeh@gmail.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAK-%2BJz9J%3DQ06-M7cDJoPNeYbz5EZDqkjQbJnmRyQyzkbRGsYkA%40mail.gmail.com Backpatch-through: 16	2025-04-20 22:12:07 +12:00
Michael Paquier	88e947136b	Fix typos and grammar in the code The large majority of these have been introduced by recent commits done in the v18 development cycle. Author: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/9a7763ab-5252-429d-a943-b28941e0e28b@gmail.com	2025-04-19 19:17:42 +09:00
Michael Paquier	114f7fa81c	Rename injection points used in AIO tests The format of the injection point names used by the AIO code does not match the existing naming convention used everywhere else in the code, so let's be consistent. These points are used in test_aio. Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/Z_yTB80bdu1sYDqJ@paquier.xyz	2025-04-19 18:53:35 +09:00
David Rowley	d9e03864b6	Make levels 1-based in pg_log_backend_memory_contexts() Both pg_get_process_memory_contexts() and pg_backend_memory_contexts have 1-based levels, whereas pg_log_backend_memory_contexts() was using 0-based levels. Align these. This results in slightly saner behavior from MemoryContextStatsDetail() in regards to the max_level. Previously it would stop at 1 level before the maximum requested level rather than at that level. Reported-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Author: David Rowley <drowleyml@gmail.com Reviewed-by: Melih Mutlu <m.melihmutlu@gmail.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Discussion: https://postgr.es/m/395ea5d4fe190480efa95bf533485c70@oss.nttdata.com	2025-04-18 09:04:28 +12:00
Tom Lane	fc5e966f73	Suppress "may be used uninitialized" warnings from older compilers. The "children" list won't be used until "got_children" has been set true, but older compilers don't get that; about half a dozen buildfarm animals are warning about this. Issue added by `11ff192b5`. While here, improve slightly-shaky grammar in comment. Discussion: https://postgr.es/m/2057835.1744833309@sss.pgh.pa.us	2025-04-17 16:47:04 -04:00
Tom Lane	0400ae4a68	Cache typlens of a SQL function's input arguments. This gets rid of repetitive get_typlen calls in postquel_sub_params, which show up as costing a few percent of the runtime in simple test cases (more with more parameters). In combination with the preceding patches, this gets us most of the way back down to the amount of per-call overhead that functions.c had before commit `0dca5d68d`. There are some more things that could be done, but this seems like an okay place to stop for v18.	2025-04-17 12:56:40 -04:00
Tom Lane	0313c5dc62	Make SQLFunctionCache long-lived again. At this point, the only data structures we allocate directly in fcontext are the SQLFunctionCache struct itself, the ParamListInfo struct, and the execution_state array, all of which are small and perfectly capable of being re-used across executions of the same FmgrInfo. Hence, let's give them the same lifespan as the FmgrInfo. This step gets rid of the separate SQLFunctionLink struct and makes fn_extra point to SQLFunctionCache again. We also get rid of the separate fcontext memory context and allocate these items directly in fn_mcxt. For notational simplicity, SQLFunctionCache still has an fcontext field, but it's just a copy of fn_mcxt. The motivation for this is to allow these structures to live as long as the FmgrInfo and be re-used across calls, restoring the original design without its propensity for memory leaks. This gets rid of some per-call overhead that we added in `0dca5d68d`. We also make an effort to re-use the JunkFilter and result slot. Those might need to change if the function definition changes, so we compromise by rebuilding them if the cached plan changes. This also moves the tuplestore into fn_mcxt so that it can be re-used across calls, again undoing a change made in `0dca5d68d`.	2025-04-17 12:56:31 -04:00
Tom Lane	f45a5444ee	Split some storage out to separate subcontexts of fcontext. Put the JunkFilter and its result slot (and thence also some subsidiary data such as the result tupledesc) into a separate subcontext "jfcontext". This doesn't accomplish a lot at this point, because we make a new JunkFilter each time through the SQL function. However, the plan is to make the fcontext long-lived, and that raises the possibility that we'll need a new JunkFilter because the plan for the result-generating query changes. A separate context makes it easy to free the obsoleted data when that happens. Also, instead of always running the sub-executor in fcontext, make a separate context for it if we're doing lazy eval of a SRF, and otherwise just run it inside CurrentMemoryContext.	2025-04-17 12:56:21 -04:00
Tom Lane	595d1efeda	Make functions.c mostly run in a short-lived memory context. Previously, much of this code ran with CurrentMemoryContext set to be the function's fcontext, so that we tended to leak a lot of stuff there. Commit `0dca5d68d` dealt with that by releasing the fcontext at the completion of each SQL function call, but we'd like to go back to the previous approach of allowing the fcontext to be query-lifespan. To control the leakage problem, rearrange the code so that we mostly run in the memory context that fmgr_sql is called in (which we expect to be short-lived). Notably, this means that parsing/planning is all done in the short-lived context and doesn't leak cruft into fcontext. This patch also fixes the allocation of execution_state records so that we don't leak them across executions. I set that up with a re-usable array that contains at least as many execution_state structs as we need for the current querytree. The chain structure is still there, but it's not really doing much for us, and maybe somebody will be motivated to get rid of it. I'm not though. This incidentally also moves the call of BlessTupleDesc to be with the code that creates the JunkFilter. That doesn't make much difference now, but a later patch will reduce the number of times the JunkFilter gets made, and we needn't bless the results any more often than that. We still leak a fair amount in fcontext, particularly when executing utility statements, but that's material for a separate patch step; the point here is only to get rid of unintentional allocations in fcontext.	2025-04-17 12:56:08 -04:00
Tom Lane	09b07c2953	Minor performance improvement for SQL-language functions. Late in the development of commit `0dca5d68d`, I added a step to copy the result tlist we extract from the cached final query, because I was afraid that that might not last as long as the JunkFilter that we're passing it off to. However, that turns out to cost a noticeable number of cycles, and it's really quite unnecessary because the JunkFilter will not examine that tlist after it's been created. (ExecFindJunkAttribute would use it, but we don't use that function on this JunkFilter.) Hence, remove the copy step. For safety, reset the might-become-dangling jf_targetList pointer to NIL. In passing, remove DR_sqlfunction.cxt, which we don't use anymore; it's confusing because it's not entirely clear which context it ought to point at.	2025-04-17 12:55:58 -04:00
Noah Misch	f4ece891fc	Assert lack of hazardous buffer locks before possible catalog read. Commit `0bada39c83` fixed a bug of this kind, which existed in all branches for six days before detection. While the probability of reaching the trouble was low, the disruption was extreme. No new backends could start, and service restoration needed an immediate shutdown. Hence, add this to catch the next bug like it. The new check in RelationIdGetRelation() suffices to make autovacuum detect the bug in commit `243e9b40f1` that led to commit `0bada39`. This also checks in a number of similar places. It replaces each Assert(IsTransactionState()) that pertained to a conditional catalog read. No back-patch for now, but a back-patch of commit `243e9b4` should back-patch this, too. A back-patch could omit the src/test/regress changes, since back branches won't gain new index columns. Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/20250410191830.0e.nmisch@google.com Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com	2025-04-17 05:00:30 -07:00
Jeff Davis	2e5353be25	Another unintentional behavior change in commit `e9931bfb75`. Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250412123430.8c.nmisch@google.com	2025-04-16 16:49:42 -07:00
Jeff Davis	b107744ce7	Improve comment in regc_pg_locale.c. Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250412123430.8c.nmisch@google.com	2025-04-16 16:49:35 -07:00
David Rowley	f3281f9f93	Improve comments for estimate_multivariate_ndistinct() estimate_multivariate_ndistinct() is coded to assume the caller handles passing it a list of GroupVarInfos with unique 'var' fields over the entire list. `6bb6a62f3` added code which didn't ensure this and that could result in estimate_multivariate_ndistinct() erroring out with: ERROR: corrupt MVNDistinct entry This occurred because estimate_multivariate_ndistinct() first searches for a set of stats that match to at least two of the given GroupVarInfos and then later assumes that the MVNDistinctItem.items array of the best matching stats will have an entry for those two columns. If the GroupVarInfos List contained a duplicate entry then the same column could be matched to twice and that could trick the code into thinking we have >= 2 columns matched in cases where only a single distinct column has been matched. This could result in a failure to find the correct MVNDistinctItem in the stats as the array containing those never contains an item for single columns. Here we make it more clear that the function needs a distinct set of GroupVarInfos and also tidy up a few other comments to make things a bit easier to follow. Author: David Rowley <drowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvocZCUhM9W9mJ39d6oQz7ePKoqFnao_347mvC-A7QatcQ@mail.gmail.com	2025-04-17 11:03:24 +12:00
Tom Lane	ab3d8afc7f	Sync declarations and definitions of two new tablecmds.c functions. Buildfarm member drongo complained because the definitions of these functions used "const Oid foo" where the forward declarations just had "Oid foo". (I'm a bit surprised that drongo seems to be the only complainant.) I chose to fix this by removing the "consts" because (a) I'm generally not a fan of using const that way, and (b) it was a minority usage even within these two functions, let alone compared to the rest of our code base. Oversight in commit `eec0040c4`, so no need for back-patch.	2025-04-16 17:59:08 -04:00
Álvaro Herrera	11ff192b5b	Elide not-null constraint checks on child tables during PK creation We were unnecessarily acquiring AccessExclusiveLock on all child tables when "ALTER TABLE ONLY sometab ADD PRIMARY KEY" was run on their parent table, an oversight in commit `14e87ffa5c`. This caused deadlocks during pg_restore of partitioned tables. The reason to acquire the AEL was that we need to verify that child tables have the involved columns already marked as not-null; but if the parent table has an inheritable not-null constraint, then all children must necessarily be in the correct state already, so we can skip the check, which avoids acquiring the lock. Reorder the code so that it works that way. This doesn't change things in the case where the constraint doesn't exist, but that case is of lesser importance because it doesn't occur during parallel pg_restore. While at it, reword some errmsg() and add errhint() to similar cases in related but not adjacent code. Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/67469c1c-38bc-7d94-918a-67033f5dd731@gmx.net Discussion: https://postgr.es/m/2045026.1743801143@sss.pgh.pa.us Discussion: https://postgr.es/m/1280408.1744650810@sss.pgh.pa.us	2025-04-16 21:51:23 +02:00
Richard Guo	3b35f9a4c5	Fix an incorrect check in get_memoize_path Memoize typically marks cache entries as complete after fully scanning the inner side of a join. However, in the case of unique joins, we skip to the next outer tuple as soon as the first matching inner tuple is found, leaving no opportunity to scan the inner side to completion. To work around that, we mark cache entries as complete after fetching the first matching inner tuple in unique joins. This approach is only safe when all of the join's restriction clauses are parameterized; otherwise, there is no guarantee that reading just one tuple from the inner side is sufficient. Currently, we check for this by verifying that the number of clauses in ppi_clauses is no less than the number of the join's restriction clauses. However, this check isn't entirely reliable, as ppi_clauses includes join clauses available from all outer rels, not just the current outer rel. This means the check could pass even if a restriction clause isn't parameterized, as long as another join clause, which doesn't belong to the current join, is included in ppi_clauses. To fix this, we explicitly check whether each restriction clause of the current join is present in ppi_clauses. While we're here, remove the XXX comment from the modified code, as it's not justified; in certain cases, it's not possible to move a join clause to the inner side. This is arguably a bugfix, but no backpatch given the lack of field reports. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-8JPouj=wBDj4DhK-WO4+Xdx=A2jbjvvyyTBQneJ1=BQ@mail.gmail.com	2025-04-16 10:55:44 +09:00
Tom Lane	7c87284940	Fix failure for generated column with a not-null domain constraint. If a GENERATED column is declared to have a domain data type where the domain's constraints disallow null values, INSERT commands failed because we built a targetlist that included coercing a null constant to the domain's type. The failure occurred even when the generated value would have been perfectly OK. This is adjacent to the issues fixed in `0da39aa76`, but we didn't notice for lack of testing a domain with such a constraint. We aren't going to use the result of the targetlist entry for the generated column --- ExecComputeStoredGenerated will overwrite it. So it's not really necessary that it have the exact datatype of the generated column. This patch fixes the problem by changing the targetlist entry to be a null Const of the domain's base type, which should be sufficiently legal. (We do have to tweak ExecCheckPlanOutput to accept the situation, though.) This has been broken since we implemented generated columns. However, this patch only applies easily as far back as v14, partly because I (tgl) only carried `0da39aa76` back that far, but mostly because v14 significantly refactored the handling of INSERT/UPDATE targetlists. Given the lack of field complaints and the short remaining support lifetime of v13, I judge the cost-benefit ratio not good for devising a version that would work in v13. Reported-by: jian he <jian.universality@gmail.com> Author: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxG59tip2+9h=rEv-ykOFjt0cbsPVchhi0RTij8bABBA0Q@mail.gmail.com Backpatch-through: 14	2025-04-15 12:08:34 -04:00
Tom Lane	e708ffe79d	Fix GIN's shimTriConsistentFn to not corrupt its input. Commit `0f21db36d` made an assumption that GIN triConsistentFns would not modify their input entryRes[] arrays. But in fact, the "shim" triConsistentFn that we use for opclasses that don't supply their own did exactly that, potentially leading to wrong answers from a GIN index search. Through bad luck, none of the test cases that we have for such opclasses exposed the bug. One response to this could be that the assumption of consistency check functions not modifying entryRes[] arrays is a bad one, but it still seems reasonable to me. Notably, shimTriConsistentFn is itself assuming that with respect to the underlying boolean consistentFn, so it's sure being self-centered in supposing that it gets to do so. Fortunately, it's quite simple to fix shimTriConsistentFn to restore the entry-time state of entryRes[], so let's do that instead. This issue doesn't affect any core GIN opclasses, since they all supply their own triConsistentFns. It does affect contrib modules btree_gin, hstore, and intarray. Along the way, I (tgl) noticed that shimTriConsistentFn failed to pick up on a "recheck" flag returned by its first call to the boolean consistentFn. This may be only a latent problem, since it would be unlikely for a consistentFn to set recheck for the all-false case and not any other cases. (Indeed, none of our contrib modules do that.) Nonetheless, it's formally wrong. Reported-by: Vinod Sridharan <vsridh90@gmail.com> Author: Vinod Sridharan <vsridh90@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAFMdLD7XzsXfi1+DpTqTgrD8XU0i2C99KuF=5VHLWjx4C1pkcg@mail.gmail.com Backpatch-through: 13	2025-04-12 12:28:02 -04:00
Peter Geoghegan	a6cab6a78e	Harmonize function parameter names for Postgres 18. Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. These inconsistencies were all introduced during Postgres 18 development. This commit was written with help from clang-tidy, by mechanically applying the same rules as similar clean-up commits (the earliest such commit was commit `035ce1fe`).	2025-04-12 12:07:36 -04:00
Daniel Gustafsson	847bbb21f8	Fix recently introduced typos This fixes typos in docs and comments introduced during the v18 development cycle, to keep them from ending up in backbranches. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CA+COZaCgGua25f2hSrjrDLJcJJAHkwoKgTTqUy-wyL1=64JNjw@mail.gmail.com	2025-04-11 22:17:12 +02:00
Michael Paquier	2e57790836	Fix race with synchronous_standby_names at startup synchronous_standby_names cannot be reloaded safely by backends, and the checkpointer is in charge of updating a state in shared memory if the GUC is enabled in WalSndCtl, to let the backends know if they should wait or not for a given LSN. This provides a strict control on the timing of the waiting queues if the GUC is enabled or disabled, then reloaded. The checkpointer is also in charge of waking up the backends that could be waiting for a LSN when the GUC is disabled. This logic had a race condition at startup, where it would be possible for backends to not wait for a LSN even if synchronous_standby_names is enabled. This would cause visibility issues with transactions that we should be waiting for but they were not. The problem lasts until the checkpointer does its initial update of the shared memory state when it loads synchronous_standby_names. In order to take care of this problem, the shared memory state in WalSndCtl is extended to detect if it has been initialized by the checkpointer, and not only check if synchronous_standby_names is defined. In WalSndCtlData, sync_standbys_defined is renamed to sync_standbys_status, a bits8 able to know about two states: - If the shared memory state has been initialized. This flag is set by the checkpointer at startup once, and never removed. - If synchronous_standby_names is known as defined in the shared memory state. This is the same as the previous sync_standbys_defined in WalSndCtl. This method gives a way for backends to decide what they should do until the shared memory area is initialized, and they now ultimately fall back to a check on the GUC value in this case, which is the best thing that can be done. Fortunately, SyncRepUpdateSyncStandbysDefined() is called immediately by the checkpointer when this process starts, so the window is very narrow. It is possible to enlarge the problematic window by making the checkpointer wait at the beginning of SyncRepUpdateSyncStandbysDefined() with a hardcoded sleep for example, and doing so has showed that a 2PC visibility test is indeed failing. On machines slow enough, this bug would cause spurious failures. In 17~, we have looked at the possibility of adding an injection point to have a reproducible test, but as the problematic window happens at early startup, we would need to invent a way to make an injection point optionally persistent across restarts when attached, something that would be fine for this case as it would involve the checkpointer. This issue is quite old, and can be reproduced on all the stable branches. Author: Melnikov Maksim <m.melnikov@postgrespro.ru> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/163fcbec-900b-4b07-beaa-d2ead8634bec@postgrespro.ru Backpatch-through: 13	2025-04-11 10:00:21 +09:00
David Rowley	530050d8d2	Add code comment explaining ins_since_vacuum and aborted inserts Sami complained that there's a discrepancy between n_mod_since_analyze and n_ins_since_vacuum, as the former only accounts for committed changes and the latter tracks committed and aborted inserts. Nobody seemed overly concerned that this would cause any concerning issues. The repercussions, from what I can tell, are limited to causing an autovacuum to trigger for inserts sooner than it otherwise might. For typical ratios of commits to aborts, it's unlikely to ever be noticed. Fixing things to make it so n_ins_since_vacuum only displays committed inserts would require an additional field in PgStat_TableCounts, which does not quite seem worthwhile at this stage. This commit just adds a comment with some details to mention that we know about it, which will hopefully prevent repeat discussions. Reported-by: Sami Imseih <samimseih@gmail.com> Author: David Rowley <drowleyml@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAApHDvpgV3a-R2EGmPOh0L-x3pHbZpM3y4dySWfy+UqUazwDQA@mail.gmail.com	2025-04-11 11:36:21 +12:00
David Rowley	928394b664	Improve various new-to-v18 appendStringInfo calls Similar to `8461424fd`, here we adjust a few new locations which were not using the most suitable appendStringInfo* function for the intended purpose. Author: David Rowley <drowleyml@gmail.com Discussion: https://postgr.es/m/CAApHDvqJnNjueb=Eoj8K+8n0g7nj_AcPWSiCj5RNV4fDejAfqA@mail.gmail.com	2025-04-11 10:07:22 +12:00
Daniel Gustafsson	55ef7abf88	Rename global variable backing DSA area The global variable backing the DSA area for Memory Context stats reporting had a too generic name, rename to be more descriptive. Independently reported by Peter and Laurenz. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Peter Eisentraut <peter@eisentraut.org> Reported-by: Laurenz Albe <laurenz.albe@cybertec.at> Discussion: https://postgr.es/m/d51172bd4e7f4b07a18a0288ca1b1c28a71a5f6a.camel@cybertec.at Discussion: https://postgr.es/m/25095db5-b595-4b85-9100-d358907c25b5@eisentraut.org	2025-04-10 22:40:27 +02:00
Tom Lane	f27eb0325b	Remove useless check for negative result of ip_addrsize(). By inspection, ip_addrsize() can't return a negative result. (If it could, we'd have way bigger problems elsewhere.) So delete useless check in network_send(). Most C compilers are probably perfectly capable of removing this code by themselves, but it's confusing/misleading. Bug: #18889 Reported-by: Daniel Elishakov <dan-eli@mail.ru> Discussion: https://postgr.es/m/18889-73d4f19e953a629e@postgresql.org	2025-04-10 14:18:07 -04:00
Amit Kapila	4909b38af0	Fix data loss in logical replication. Data loss can happen when the DDLs like ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take a strong lock on table happens concurrently to DMLs on the tables involved in the DDL. This happens because logical decoding doesn't distribute invalidations to concurrent transactions and those transactions use stale cache data to decode the changes. The problem becomes bigger because we keep using the stale cache even after those in-progress transactions are finished and skip the changes required to be sent to the client. This commit fixes the issue by distributing invalidation messages from catalog-modifying transactions to all concurrent in-progress transactions. This allows the necessary rebuild of the catalog cache when decoding new changes after concurrent DDL. We observed performance regression primarily during frequent execution of publication DDL statements that modify the published tables. The regression is minor or nearly nonexistent for DDLs that do not affect the published tables or occur infrequently, making this a worthwhile cost to resolve a longstanding data loss issue. An alternative approach considered was to take a strong lock on each affected table during publication modification. However, this would only address issues related to publication DDLs (but not the ALTER TYPE ...) and require locking every relation in the database for publications created as FOR ALL TABLES, which is impractical. The bug exists in all supported branches, but we are backpatching till 14. The fix for 13 requires somewhat bigger changes than this fix, so the fix for that branch is still under discussion. Reported-by: hubert depesz lubaczewski <depesz@depesz.com> Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Tested-by: Benoit Lobréau <benoit.lobreau@dalibo.com> Backpatch-through: 14 Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com Discussion: https://postgr.es/m/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com	2025-04-10 13:14:40 +05:30
David Rowley	d7c04db27a	Update wording in optimizer/README for EquivalenceClasses `d69d45a5a` changed how em_is_child members are stored in EquivalenceClasses. Children are no longer stored in the ec_members list. optimizer/README mentioned that most operations "should ignore child members", but that felt a little untrue now since child members are now stored in a separate place, they simply won't be found by the normal means of looking (a foreach loop over ec_members), and if you don't find them, there's technically no need to "ignore" them. Here we tweak the wording slightly to reflect the new storage location for child members. Reported-by: Amit Langote <amitlangote09@gmail.com> Author: Amit Langote <amitlangote09@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CA+HiwqE8v=EuAP_3F_A2xn8zWx+nG_etW_Fe_DvKO-Fkx=+DdQ@mail.gmail.com	2025-04-10 17:33:58 +12:00
Tomas Vondra	3887d0cfeb	Cleanup of pg_numa.c This moves/renames some of the functions defined in pg_numa.c: * pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and moved to src/backend/storage/ipc/shmem.c. The new name better reflects that the page size is not related to NUMA, and it's specifically about the page size used for the main shared memory segment. * move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into the backend (which more appropriate for functions callable from SQL). While at it, improve the comment to explain what page size it returns. * remove unnecessary includes from src/port/pg_numa.c, adding unnecessary dependencies (src/port should be suitable for frontent). These were either leftovers or unnecessary thanks to the other changes in this commit. This eliminates unnecessary dependencies on backend symbols, which we don't want in src/port. Reported-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com	2025-04-09 21:50:17 +02:00
Heikki Linnakangas	0f1433f053	Fix a few oversights in the longer cancel keys patch Change MyCancelKeyLength's type from uint8 to int. While it always fits in a uint8, plain int is less surprising, as there's no particular reason for it to be uint8. Fix one ProcSignalInit caller that passed 'false' instead of NULL for the pointer argument. Author: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/61be9e31-7b7d-49d5-bc11-721800d89d64@eisentraut.org	2025-04-09 13:11:42 +03:00
Tom Lane	dd496eedea	Doc: note that two examples in optimizer/README are oversimplified. These examples fail to account for join clauses generated by EquivalenceClasses, but since we haven't mentioned EquivalenceClasses yet it seems like it'd just add confusion to make them fully accurate. Instead, parenthetically note that they're oversimplified. Reported-by: Zeyuan Hu <ferrishu3886@gmail.com> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACvHWmYFo+60yMqKJajDDvKN5EM41YHrCT3oxukwXmGAqpWvyw@mail.gmail.com	2025-04-08 23:03:33 -04:00
Amit Kapila	12eece5fd5	Fix uninitialized index information access during apply. The issue happens when building conflict information during apply of INSERT or UPDATE operations that violate unique constraints on leaf partitions. The problem was introduced in commit `9ff68679b5`, which removed the redundant calls to ExecOpenIndices/ExecCloseIndices. The previous code was relying on the redundant ExecOpenIndices call in apply_handle_tuple_routing() to build the index information required for unique key conflict detection. The fix is to delay building the index information until a conflict is detected instead of relying on ExecOpenIndices to do the same. The additional benefit of this approach is that it avoids building index information when there is no conflict. Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by:Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/TYAPR01MB57244ADA33DDA57119B9D26494A62@TYAPR01MB5724.jpnprd01.prod.outlook.com	2025-04-08 15:35:42 +05:30
Thomas Munro	f78ca6f3eb	Introduce file_copy_method setting. It can be set to either COPY (the default) or CLONE if the system supports it. CLONE causes callers of copydir(), currently CREATE DATABASE ... STRATEGY=FILE_COPY and ALTER DATABASE ... SET TABLESPACE = ..., to use copy_file_range (Linux, FreeBSD) or copyfile (macOS) to copy files instead of a read-write loop over the contents. CLONE gives the kernel the opportunity to share block ranges on copy-on-write file systems and push copying down to storage on others, depending on configuration. On some systems CLONE can be used to clone large databases quickly with CREATE DATABASE ... TEMPLATE=source STRATEGY=FILE_COPY. Other operating systems could be supported; patches welcome. Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Ranier Vilela <ranier.vf@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com	2025-04-08 21:35:38 +12:00
Daniel Gustafsson	042a66291b	Add function to get memory context stats for processes This adds a function for retrieving memory context statistics and information from backends as well as auxiliary processes. The intended usecase is cluster debugging when under memory pressure or unanticipated memory usage characteristics. When calling the function it sends a signal to the specified process to submit statistics regarding its memory contexts into dynamic shared memory. Each memory context is returned in detail, followed by a cumulative total in case the number of contexts exceed the max allocated amount of shared memory. Each process is limited to use at most 1Mb memory for this. A summary can also be explicitly requested by the user, this will return the TopMemoryContext and a cumulative total of all lower contexts. In order to not block on busy processes the caller specifies the number of seconds during which to retry before timing out. In the case where no statistics are published within the set timeout, the last known statistics are returned, or NULL if no previously published statistics exist. This allows dash- board type queries to continually publish even if the target process is temporarily congested. Context records contain a timestamp to indicate when they were submitted. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com	2025-04-08 11:06:56 +02:00
Andres Freund	15f0cb26b5	Increase BAS_BULKREAD based on effective_io_concurrency Before, BAS_BULKREAD was always of size 256kB. With the default io_combine_limit of 16, that only allowed 1-2 IOs to be in flight - insufficient even on very low latency storage. We don't just want to increase the size to a much larger hardcoded value, as very large rings (10s of MBs of of buffers), appear to have negative performance effects when reading in data that the OS has cached (but not when actually needing to do IO). To address this, increase the size of BAS_BULKREAD to allow for io_combine_limit * effective_io_concurrency buffers getting read in. To prevent the ring being much larger than useful, limit the increased size with GetPinLimit(). The formula outlined above keeps the ring size to sizes for which we have not observed performance regressions, unless very large effective_io_concurrency values are used together with large shared_buffers setting. Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6 Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt	2025-04-08 02:41:03 -04:00
Andres Freund	dcf7e1697b	Add pg_buffercache_evict_{relation,all} functions In addition to the added functions, the pg_buffercache_evict() function now shows whether the buffer was flushed. pg_buffercache_evict_relation(): Evicts all shared buffers in a relation at once. pg_buffercache_evict_all(): Evicts all shared buffers at once. Both functions provide mechanism to evict multiple shared buffers at once. They are designed to address the inefficiency of repeatedly calling pg_buffercache_evict() for each individual buffer, which can be time-consuming when dealing with large shared buffer pools. (e.g., ~477ms vs. ~2576ms for 16GB of fully populated shared buffers). These functions are intended for developer testing and debugging purposes and are available to superusers only. Minimal tests for the new functions are included. Also, there was no test for pg_buffercache_evict(), test for this added too. No new extension version is needed, as it was already increased this release by `ba2a3c2302`. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru> Reviewed-by: Joseph Koshakow <koshy44@gmail.com> Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com	2025-04-08 02:19:32 -04:00
David Rowley	d69d45a5a9	Speedup child EquivalenceMember lookup in planner When planning queries to partitioned tables, we clone all EquivalenceMembers belonging to the partitioned table into em_is_child EquivalenceMembers for each non-pruned partition. For partitioned tables with large numbers of partitions, this meant the ec_members list could become large and code searching that list would become slow. Effectively, the more partitions which were present, the more searches needed to be performed for operations such as find_ec_member_matching_expr() during create_plan() and the more partitions present, the longer these searches would take, i.e., a quadratic slowdown. To fix this, here we adjust how we store EquivalenceMembers for em_is_child members. Instead of storing these directly in ec_members, these are now stored in a new array of Lists in the EquivalenceClass, which is indexed by the relid. When we want to find EquivalenceMembers belonging to a certain child relation, we can narrow the search to the array element for that relation. To make EquivalenceMember lookup easier and to reduce the amount of code change, this commit provides a pair of functions to allow iteration over the EquivalenceMembers of an EC which also handles finding the child members, if required. Callers that never need to look at child members can remain using the foreach loop over ec_members, which will now often be faster due to only parent-level members being stored there. The actual performance increases here are highly dependent on the number of partitions and the query being planned. Performance increases can be visible with as few as 8 partitions, but the speedup is marginal for such low numbers of partitions. The speedups become much more visible with a few dozen to hundreds of partitions. With some tested queries using 56 partitions, the planner was around 3x faster than before. For use cases with thousands of partitions, these are likely to become significantly faster. Some testing has shown planner speedups of 60x or more with 8192 partitions. Author: Yuya Watari <watari.yuya@gmail.com> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru> Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Tested-by: Thom Brown <thom@linux.com> Tested-by: newtglobal postgresql_contributors <postgresql_contributors@newtglobalcorp.com> Discussion: https://postgr.es/m/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com	2025-04-08 18:09:57 +12:00
Amit Kapila	105b2cb336	Stabilize 035_standby_logical_decoding.pl. Some tests try to invalidate logical slots on the standby server by running VACUUM on the primary. The problem is that xl_running_xacts was getting generated and replayed before the VACUUM command, leading to the advancement of the active slot's catalog_xmin. Due to this, active slots were not getting invalidated, leading to test failures. We fix it by skipping the generation of xl_running_xacts for the required tests with the help of injection points. As the required interface for injection points was not present in back branches, we fixed the failing tests in them by disallowing the slot to become active for the required cases (where rows_removed conflict could be generated). Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16, where it was introduced Discussion: https://postgr.es/m/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal	2025-04-08 09:38:02 +05:30
Bruce Momjian	46b4ba533c	Fix PG 17 [NOT] NULL optimization bug for domains A PG 17 optimization allowed columns with NOT NULL constraints to skip table scans for IS NULL queries, and to skip IS NOT NULL checks for IS NOT NULL queries. This didn't work for domain types, since domain types don't follow the IS NULL/IS NOT NULL constraint logic. To fix, disable this optimization for domains for PG 17+. Reported-by: Jan Behrens Diagnosed-by: Tom Lane Discussion: https://postgr.es/m/Z37p0paENWWUarj-@momjian.us Backpatch-through: 17	2025-04-07 21:33:42 -04:00
Michael Paquier	039549d70f	Flush the IO statistics of active WAL senders more frequently WAL senders do not flush their statistics until they exit, limiting the monitoring possible for live processes. This is penalizing when WAL senders are running for a long time, like in streaming or logical replication setups, because it is not possible to know the amount of IO they generate while running. This commit makes WAL senders more aggressive with their statistics flush, using an internal of 1 second, with the flush timing calculated based on the existing GetCurrentTimestamp() done before the sleeps done to wait for some activity. Note that the sleep done for logical and physical WAL senders happens in two different code paths, so the stats flushes need to happen in these two places. One test is added for the physical WAL sender case, and one for the logical WAL sender case. This can be done in a stable fashion by relying on the WAL generated by the TAP tests in combination with a stats reset while a server is running, but only on HEAD as WAL data has been added to pg_stat_io in `a051e71e28`. This issue exists since `a9c70b46db` and the introduction of pg_stat_io, so backpatch down to v16. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/Z73IsKBceoVd4t55@ip-10-97-1-34.eu-west-3.compute.internal Backpatch-through: 16	2025-04-08 07:57:19 +09:00
Tomas Vondra	8cc139bec3	Introduce pg_shmem_allocations_numa view Introduce new pg_shmem_alloctions_numa view with information about how shared memory is distributed across NUMA nodes. For each shared memory segment, the view returns one row for each NUMA node backing it, with the total amount of memory allocated from that node. The view may be relatively expensive, especially when executed for the first time in a backend, as it has to touch all memory pages to get reliable information about the NUMA node. This may also force allocation of the shared memory. Unlike pg_shmem_allocations, the view does not show anonymous shared memory allocations. It also does not show memory allocated using the dynamic shared memory infrastructure. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com	2025-04-07 23:08:17 +02:00
Tomas Vondra	65c298f61f	Add support for basic NUMA awareness Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c portability wrapper and an optional build dependency, enabled by --with-libnuma configure option. For now this is Linux-only, other platforms may be supported later. A built-in SQL function pg_numa_available() allows checking NUMA support, i.e. that the server was built/linked with the NUMA library. The main function introduced is pg_numa_query_pages(), which allows determining the NUMA node for individual memory pages. Internally the function uses move_pages(2) syscall, as it allows batching, and is more efficient than get_mempolicy(2). Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com	2025-04-07 23:08:17 +02:00
Andres Freund	8e293e689b	aio: Make AIO more compatible with valgrind In some edge cases valgrind flags issues with the memory referenced by IOs. All of the cases addressed in this change are false positives. Most of the false positives are caused by UnpinBuffer[NoOwner] marking buffer data as inaccessible. This happens even though the AIO subsystem still holds a pin. That's good, there shouldn't be accesses to the buffer outside of AIO related code until it is pinned by "user" code again. But it requires some explicit work - if the buffer is not pinned by the current backend, we need to explicitly mark the buffer data accessible/inaccessible while executing completion callbacks. That however causes a cascading issue in IO workers: After the completion callbacks for a buffer is executed, the page is marked as inaccessible. If subsequently the same worker is executing IO targeting the same buffer, we would get an error, as the memory is still marked inaccessible. To avoid that, we need to explicitly mark the memory as accessible in IO workers. Another issue is that IO executed in workers or via io_uring will not mark memory as DEFINED. In the case of workers that is because valgrind does not track memory definedness across processes. For io_uring that is because valgrind does not understand io_uring, and therefore its IOs never mark memory as defined, whether the completions are processed in the defining process or in another context. It's not entirely clear how to best solve that. The current user of AIO is not affected, as it explicitly marks buffers as DEFINED & NOACCESS anyway. Defer solving this issue until we have a user with different needs. Per buildfarm animal skink. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6	2025-04-07 15:20:30 -04:00
Andres Freund	8ab4241b9f	localbuf: Add Valgrind buffer access instrumentation This mirrors `1e0dfd166b` (+ `46ef520b95`), for temporary table buffers. This is mainly interesting right now because the AIO work currently triggers spurious valgrind errors, and the fix for that is cleaner if temp buffers behave the same as shared buffers. This requires one change beyond the annotations themselves, namely to pin local buffers while writing them out in FlushRelationBuffers(). Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6	2025-04-07 15:20:30 -04:00
Tom Lane	b73e6d71a8	Fix erroneous construction of functions' dependencies on transforms. The list of transform objects that a function should use is specified in CREATE FUNCTION's TRANSFORM clause, and then represented indirectly in pg_proc.protrftypes. However, ProcedureCreate completely ignored that for purposes of constructing pg_depend entries, and instead made the function depend on any transforms that exist for its parameter or return data types. This is bad in both directions: the function could be made dependent on a transform it does not actually use, or it could try to use a transform that's since been dropped. (The latter scenario would require use of a transform that's not for any of the parameter or return types, but that seems legit for cases where the function performs SQL operations internally.) To fix, pass in the list of transform objects that CreateFunction identified, and build pg_depend entries from that not from the parameter/return types. This results in changes in the expected test outputs in contrib/bool_plperl, which I guess are due to different ordering of pg_depend entries -- that test case is surely not exercising either of the problem scenarios. This fix is not back-patchable as-is: changing the signature of ProcedureCreate seems too risky in stable branches. We could do something like making ProcedureCreate a wrapper around ProcedureCreateExt or so. However, I'm more inclined to do nothing in the back branches. We had no field complaints up to now, so the hazards don't seem to be a big issue in practice. And we couldn't do anything about existing pg_depend entries, so a back-patched fix would result in a mishmash of dependencies created according to different rules. That cure could be worse than the disease, perhaps. I bumped catversion just to lay down a marker that the expected contents of pg_depend are a bit different than before. Reported-by: Chapman Flack <jcflack@acm.org> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3112950.1743984111@sss.pgh.pa.us	2025-04-07 13:31:37 -04:00
Álvaro Herrera	a379061a22	Allow NOT NULL constraints to be added as NOT VALID This allows them to be added without scanning the table, and validating them afterwards without holding access exclusive lock on the table after any violating rows have been deleted or fixed. Doing ALTER TABLE ... SET NOT NULL for a column that has an invalid not-null constraint validates that constraint. ALTER TABLE .. VALIDATE CONSTRAINT is also supported. There are various checks on whether an invalid constraint is allowed in a child table when the parent table has a valid constraint; this should match what we do for enforced/not enforced constraints. pg_attribute.attnotnull is now only an indicator for whether a not-null constraint exists for the column; whether it's valid or invalid must be queried in pg_constraint. Applications can continue to query pg_attribute.attnotnull as before, but now it's possible that NULL rows are present in the column even when that's set to true. For backend internal purposes, we cache the nullability status in CompactAttribute->attnullability that each tuple descriptor carries (replacing CompactAttribute.attnotnull, which was a mirror of Form_pg_attribute.attnotnull). During the initial tuple descriptor creation, based on the pg_attribute scan, we set this to UNRESTRICTED if pg_attribute.attnotnull is false, or to UNKNOWN if it's true; then we update the latter to VALID or INVALID depending on the pg_constraint scan. This flag is also copied when tupledescs are copied. Comparing tuple descs for equality must also compare the CompactAttribute.attnullability flag and return false in case of a mismatch. pg_dump deals with these constraints by storing the OIDs of invalid not-null constraints in a separate array, and running a query to obtain their properties. The regular table creation SQL omits them entirely. They are then dealt with in the same way as "separate" CHECK constraints, and dumped after the data has been loaded. Because no additional pg_dump infrastructure was required, we don't bump its version number. I decided not to bump catversion either, because the old catalog state works perfectly in the new world. (Trying to run with new catalog state and the old server version would likely run into issues, however.) System catalogs do not support invalid not-null constraints (because commit `14e87ffa5c` didn't allow them to have pg_constraint rows anyway.) Author: Rushabh Lathia <rushabh.lathia@gmail.com> Author: Jian He <jian.universality@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Tested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAGPqQf0KitkNack4F5CFkFi-9Dqvp29Ro=EpcWt=4_hs-Rt+bQ@mail.gmail.com	2025-04-07 19:19:50 +02:00
Tom Lane	3516ea768c	Add local-address escape "%L" to log_line_prefix. This escape shows the numeric server IP address that the client has connected to. Unix-socket connections will show "[local]". Non-client processes (e.g. background processes) will show "[none]". We expect that this option will be of interest to only a fairly small number of users. Therefore the implementation is optimized for the case where it's not used (that is, we don't do the string conversion until we have to), and we've not added the field to csvlog or jsonlog formats. Author: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Cary Huang <cary.huang@highgo.ca> Reviewed-by: David Steele <david@pgmasters.net> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAKAnmmK-U+UicE-qbNU23K--Q5XTLdM6bj+gbkZBZkjyjrd3Ow@mail.gmail.com	2025-04-07 11:06:05 -04:00
Andres Freund	8ce79483dc	read_stream: Fix overflow hazard with large shared buffers If the limit returned by GetAdditionalPinLimit() is large, the buffer_limit variable in read_stream_start_pending_read() can overflow. While the code is careful to limit buffer_limit PG_INT16_MAX, we subsequently add the number of forwarded buffers. The overflow can lead to assertion failures, crashes or wrong query results when using large shared buffers. It seems easier to avoid this if we make the buffer_limit variable an int, instead of an int16. Do so, and clamp buffer_limit after adding the number of forwarded buffers. It's possible we might want to address this and related issues more widely by changing to int instead of int16 more widely, but since the consequences of this bug can be confusing, it seems better to fix it now. This bug was introduced in `ed0b87caac`. Discussion: https://postgr.es/m/ewvz3cbtlhrwqk7h6ca6cctiqh7r64ol3pzb3iyjycn2r5nxk5@tnhw3a5zatlr	2025-04-07 09:45:00 -04:00
Alexander Korotkov	717d0e8dd9	Remove GUC_NOT_IN_SAMPLE from enable_self_join_elimination `fc069a3a63` implements Self-Join Elimination (SJE) and provides a new GUC variable: enable_self_join_elimination. This new GUC variable was marked as GUC_NOT_IN_SAMPLE. However, enable_self_join_elimination is documented and is not different from any other enable_* GUCs. Thus, remove GUC_NOT_IN_SAMPLE from it and add it to the postgresql.conf.sample. Discussion: https://postgr.es/m/CAPpHfdsqMTEsmxk3aQwt6xPz%2BKpUELO%3D6fzmER9ZRGrbs4uMfA%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>	2025-04-07 16:28:54 +03:00
Michael Paquier	c36eda2591	Clarify comment for worst-case allocation in quote_literal_cstr() palloc() is invoked with a specific formula for its allocation size in quote_literal_cstr(). This wastes some memory, but the size is large enough to cover even the worst-case scenarios. No explanations were given about the reasons behind these numbers. This commit adds more documentation about all that. Author: Steve Chavez <steve@supabase.io> Discussion: https://postgr.es/m/CAGRrpzZ9bToRWS+fAnjxDJrxwZN1QcJ-y1Pn2yg=Hst6rydLtw@mail.gmail.com	2025-04-07 10:02:12 +09:00
Michael Paquier	3191a593d6	Fix use-after-free in pgstat_fetch_stat_backend_by_pid() stats_fetch_consistency set to "snapshot" causes the backend entry "beentry" retrieved by pgstat_get_beentry_by_proc_number() to be reset at the beginning of pgstat_fetch_stat_backend() when fetching the backend pgstats entry. As coded, "beentry" was being accessed after being freed. This commit moves all the accesses to "beentry" to happen before calling pgstat_fetch_stat_backend(), fixing the problem. This problem could be reached by calling the SQL functions pg_stat_get_backend_io() or pg_stat_get_backend_wal(). Issue caught by valgrind. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/f1788cc0-253a-4a3a-aee0-1b8ab9538736@gmail.com	2025-04-07 09:51:40 +09:00
Fujii Masao	173c97812f	Use XLOG_CONTROL_FILE macro consistently for control file name. The XLOG_CONTROL_FILE macro (defined in access/xlog_internal.h) represents the control file name. While some parts of the codebase already use this macro, others previously hardcoded the file name as a string. This commit replaces those hardcoded strings with the macro, ensuring consistent usage throughout the code. This makes future maintenance easier and improves searchability, for example when grepping for control file usage. Author: Anton A. Melnikov <a.melnikov@postgrespro.ru> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Masao Fujii <masao.fujii@gmail.com> Discussion: https://postgr.es/m/0841ec77-47e5-452a-adb4-c6fa55d605fc@postgrespro.ru	2025-04-07 09:27:33 +09:00
Andres Freund	57dec20fd4	aio: Avoid spurious coverity warning PgAioResult.result is never accessed in the relevant path, but coverity complains about an uninitialized access anyway. So just zero-initialize the whole thing. While at it, reduce the scope of the variable. Reported-by: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CAEudQApsKqd-s+fsUQ0OmxJAMHmBSXxrAz3dCs+uvqb3iRtjSw@mail.gmail.com	2025-04-06 12:07:02 -04:00
Tom Lane	2e4ccf1b45	Use "(void)" to mark pgstat_lock_entry(..., false) calls. This should silence Coverity's complaints about the result being sometimes ignored. I'm inclined to think that these routines are simply misdesigned, because sometimes it's okay to ignore the result and sometimes it isn't, and we have no way to enforce the latter. But for now I just added a comment.	2025-04-06 11:37:09 -04:00
Peter Eisentraut	a8025f5448	Relax ordering-related hardcoded btree requirements in planning There were several places in ordering-related planning where a requirement for btree was hardcoded but an amcanorder index could suffice. This fixes that. We just need to do the necessary mapping between strategy numbers and compare types and adjust some related APIs so that this works independent of btree strategy numbers. For instance, non-btree amcanorder indexes can now be used to support sorting and merge joins. Also, predtest.c works independent of btree strategy numbers now. To avoid performance regressions, some details on btree and other built-in index types are still hardcoded as shortcuts, but other index types now have access to the same features by providing the required flags and callbacks. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-04-06 14:43:51 +02:00
Alexander Korotkov	3a1a7c5a70	Revert "Put enable_self_join_elimination into postgresql.conf.sample" This reverts commit `c2d329260c`. Reported-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/D292EB44-806E-439A-82A4-491A1BA59E7A%40yesql.se	2025-04-06 14:30:20 +03:00
Alexander Korotkov	c2d329260c	Put enable_self_join_elimination into postgresql.conf.sample `fc069a3a63` implements Self-Join Elimination (SJE) and provides a new GUC variable: enable_self_join_elimination. This commit adds enable_self_join_elimination to the postgresql.conf.sample, as it was forgotten in the original commit. Discussion: https://postgr.es/m/CAHewXN%3D%2Bghd6O6im46q7j2u6c3H6vkXtXmF%3D_v4CfGSnjje8PA%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com>	2025-04-06 13:24:16 +03:00
Tom Lane	691836405f	Fix parse_cte.c's failure to examine sub-WITHs in DML statements. makeDependencyGraphWalker thought that only SelectStmt nodes could contain a WithClause. Which was true in our original implementation of WITH, but astonishingly we missed updating this code when we added the ability to attach WITH to INSERT/UPDATE/DELETE (and later MERGE). Moreover, since it was coded to deliberately block recursion to a WithClause, even updating raw_expression_tree_walker didn't save it. The upshot of this was that we didn't see references to outer CTE names appearing within an inner WITH, and would neither complain about disallowed recursion nor account for such references when sorting CTEs into a usable order. The lack of complaints about this is perhaps not so surprising, because typical usage of WITH wouldn't hit either case. Still, it's pretty broken; failing to detect recursion here leads to assert failures or worse later on. Fix by factoring out the processing of sub-WITHs into a new function WalkInnerWith, and invoking that for all the statement types that can have WITH. Bug: #18878 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18878-a26fa5ab6be2f2cf@postgresql.org Backpatch-through: 13	2025-04-05 15:01:48 -04:00
Tom Lane	e33f2335a9	Avoid double transformation of json_array()'s subquery. transformJsonArrayQueryConstructor() applied transformStmt() to the same subquery tree twice. While this causes no issue in many cases, there are some where it causes a coredump, thanks to the parser's habit of scribbling on its input. Fix by making a copy before the first transformation (compare `0f43083d1`). This is quite brute-force, but then so is the whole business of transforming the input twice. Per discussion in the bug thread, this implementation of json_array() parsing should be replaced completely. But that will take some work and will surely not be back-patchable, so for the moment let's take the easy way out. Oversight in `7081ac46a`. Back-patch to v16 where that came in. Bug: #18877 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18877-c3c3ad75845833bb@postgresql.org Backpatch-through: 16	2025-04-05 12:13:35 -04:00
Tom Lane	43b8e6c4ab	Repair misbehavior with duplicate entries in FK SET column lists. Since v15 we've had an option to apply a foreign key constraint's ON DELETE SET DEFAULT or SET NULL action to just some of the referencing columns. There was not a check for duplicate entries in the list of columns-to-set, though. That caused a potential memory stomp in CreateConstraintEntry(), which incautiously assumed that the list of columns-to-set couldn't be longer than the number of key columns. Even after fixing that, the case doesn't work because you get an error like "multiple assignments to same column" from the SQL command that is generated to do the update. We could either raise an error for duplicate columns or silently suppress the dups, and after a bit of thought I chose to do the latter. This is motivated by the fact that duplicates in the FK column list are legal, so it's not real clear why duplicates in the columns-to-set list shouldn't be. Of course there's no need to actually set the column more than once. I left in the fix in CreateConstraintEntry() too, just because it didn't seem like such low-level code ought to be making assumptions about what it's handed. Bug: #18879 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18879-259fc59d072bd4d7@postgresql.org Backpatch-through: 15	2025-04-04 20:11:48 -04:00
Tom Lane	0f43083d16	functions.c: copy trees from source_list before parse analysis etc. This is yet another bit of fallout from the fact that backend/parser (like other code) feels free to scribble on the parse tree it's handed. In this case that resulted in modifying the relatively-short-lived copy in the cached function's source_list. That would be fine since we only need each source_list tree once ... except that if the parser fails after making some changes, the function cache entry remains as-is and will still be there if the user tries to execute the function again. Then we have problems because we're feeding a non-pristine tree to the parser. The most expedient fix is a quick copyObject(). I considered other answers like somehow marking the cache entry invalid temporarily, but that would add complexity and I'm not sure it's worth it. In typical scenarios we'd only do this once per function query per session. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6d442183-102c-498a-81d1-eeeb086cdc5a@gmail.com	2025-04-04 18:26:51 -04:00
Peter Geoghegan	b3f1a13f22	Avoid extra index searches through preprocessing. Transform low_compare and high_compare nbtree skip array inequalities (with opclasses that offer skip support) in such a way as to allow _bt_first to consistently apply later keys when it descends the tree. This can lower the number of index searches for multi-column scans that use a ">" key on one of the index's prefix columns (or use a "<" key, when scanning backwards) when it precedes some later lower-order key. For example, an index qual "WHERE a > 5 AND b = 2" will now be converted to "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place after low_compare and high_compare have been finalized. That way, the initial call to _bt_first can use "WHERE a >= 6 AND b = 2" to find an initial position, rather than just using "WHERE a > 5" -- "b = 2" can be applied during every _bt_first call. There's a decent chance that this will allow such a scan to avoid the extra search that might otherwise be needed to determine the lowest "a" value still satisfying "WHERE a > 5". The transformation process can only lower the total number of index pages read when the use of a more restrictive set of initial positioning keys in _bt_first actually allows the scan to land on some later leaf page directly, relative to the unoptimized case (or on an earlier leaf page directly, when scanning backwards). But the savings can really add up in cases where an affected skip array comes after some other array. For example, a scan indexqual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2" can save as many as 3 _bt_first calls by applying the new transformation to its "y" array (up to 1 extra search can be avoided per "x" element). Follow-up to commit `92fe23d9`, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com	2025-04-04 14:14:08 -04:00
Peter Geoghegan	21a152b37f	Improve nbtree skip scan primitive scan scheduling. Don't allow nbtree scans with skip arrays to end any primitive scan on its first leaf page without giving some consideration to how many times the scan's arrays advanced while changing at least one skip array (though continue not caring about the number of array advancements that only affected SAOP arrays, even during skip scans with SAOP arrays). Now when a scan performs more than 3 such array advancements in the course of reading a single leaf page, it is taken as a signal that the next page is unlikely to be skippable. We'll therefore continue the ongoing primitive index scan, at least until we can perform a recheck against the next page's finaltup. Testing has shown that this new heuristic occasionally makes all the difference with skip scans that were expected to rely on the "passed first page" heuristic added by commit `9a2e2a28`. Without it, there is a remaining risk that certain kinds of skip scans will never quite manage to clear the initial hurdle of performing a primitive scan that lasts beyond its first leaf page (or that such a skip scan will only clear that initial hurdle when it has already wasted noticeably-many cycles due to inefficient primitive scan scheduling). Follow-up to commits `92fe23d9` and `9a2e2a28`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com	2025-04-04 13:58:05 -04:00
Peter Geoghegan	8a510275dd	Further optimize nbtree search scan key comparisons. Postgres 17 commit `e0b1ee17` added two complementary optimizations to nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage was made to avoid needlessly evaluating keys that are guaranteed to be satisfied by applying page-level context. "prechecked" did this for keys required in the current scan direction, while "firstmatch" did it for keys required in the opposite-to-scan direction only. The "prechecked" design had a number of notable issues. It didn't account for the fact that an = array scan key's sk_argument field might need to advance at the point of the page precheck (it didn't check the precheck tuple against the key's array, only the key's sk_argument, which needlessly made it ineffective in cases involving stepping to a page having advanced the scan's arrays using a truncated high key). "prechecked" was also completely ineffective when only one scan key wasn't guaranteed to be satisfied by every tuple (it didn't recognize that it was still safe to avoid evaluating other, earlier keys). The "firstmatch" optimization had similar limitations. It could only be applied after _bt_readpage found its first matching tuple, regardless of why any earlier tuples failed to satisfy the scan's index quals. This allowed unsatisfied non-required scan keys to impede the optimization. Replace both optimizations with a new optimization, without any of these limitations: the "startikey" optimization. Affected _bt_readpage calls generate a page-level key offset ("startikey"), that their _bt_checkkeys calls can then start at. This is an offset to the first key that isn't known to be satisfied by every tuple on the page. Although this is independently useful work, its main goal is to avoid performance regressions with index scans that use skip arrays, but still never manage to skip over irrelevant leaf pages. We must avoid wasting CPU cycles on overly granular skip array maintenance in these cases. The new "startikey" optimization helps with this by selectively disabling array maintenance for the duration of a _bt_readpage call. This has no lasting consequences for the scan's array keys (they'll still reliably track the scan's progress through the index's key space whenever the scan is "between pages"). Skip scan adds skip arrays during preprocessing using simple, static rules, and decides how best to navigate/apply the scan's skip arrays dynamically, at runtime. The "startikey" optimization enables this approach. As a result of all this, the planner doesn't need to generate distinct, competing index paths (one path for skip scan, another for an equivalent traditional full index scan). The overall effect is to make scan runtime close to optimal, even when the planner works off an incorrect cardinality estimate. Scans will also perform well given a skipped column with data skew: individual groups of pages with many distinct values (in respect of a skipped column) can be read about as efficiently as before -- without the scan being forced to give up on skipping over other groups of pages that are provably irrelevant. Many scans that cannot possibly skip will still benefit from the use of skip arrays, since they'll allow the "startikey" optimization to be as effective as possible (by allowing preprocessing to mark all the scan's keys as required). A scan that uses a skip array on "a" for a qual "WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now, even when every tuple read by the scan has its own distinct "a" value. However, there are still some remaining regressions, affecting certain trickier cases. Scans whose index quals have several range skip arrays, each on some high cardinality column, can still be slower than they were before the introduction of skip scan -- even with the new "startikey" optimization. There are also known regressions affecting very selective index scans that use a skip array. The underlying issue with such selective scans is that they never get as far as reading a second leaf page, and so will never get a chance to consider applying the "startikey" optimization. In principle, all regressions could be avoided by teaching preprocessing to not add skip arrays whenever they aren't expected to help, but it seems best to err on the side of robust performance. Follow-up to commit `92fe23d9`, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com	2025-04-04 12:27:52 -04:00
Peter Geoghegan	92fe23d93a	Add nbtree skip scan optimization. Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit `5bf748b8`. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com	2025-04-04 12:27:04 -04:00
Nathan Bossart	e1a8b1ad58	Re-pgindent pg_largeobject.c after commit `0d6c477664`.	2025-04-04 09:38:22 -05:00
Alexander Korotkov	c0962a113d	Convert 'x IN (VALUES ...)' to 'x = ANY ...' then appropriate This commit implements the automatic conversion of 'x IN (VALUES ...)' into ScalarArrayOpExpr. That simplifies the query tree, eliminating the appearance of an unnecessary join. Since VALUES describes a relational table, and the value of such a list is a table row, the optimizer will likely face an underestimation problem due to the inability to estimate cardinality through MCV statistics. The cardinality evaluation mechanism can work with the array inclusion check operation. If the array is small enough (< 100 elements), it will perform a statistical evaluation element by element. We perform the transformation in the convert_ANY_sublink_to_join() if VALUES RTE is proper and the transformation is convertible. The conversion is only possible for operations on scalar values, not rows. Also, we currently support the transformation only when it ends up with a constant array. Otherwise, the evaluation of non-hashed SAOP might be slower than the corresponding Hash Join with VALUES. Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-04-04 16:01:50 +03:00
Alexander Korotkov	d48d2e2dc8	Extract make_SAOP_expr() function from match_orclause_to_indexcol() This commit extracts the code to generate ScalarArrayOpExpr on top of the list of expressions from match_orclause_to_indexcol() into a separate function make_SAOP_expr(). This function was extracted to be used in optimization for conversion of 'x IN (VALUES ...)' to 'x = ANY ...'. make_SAOP_expr() is placed in clauses.c file as only two additional headers were needed there compared with other places. Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-04-04 16:01:28 +03:00
Peter Eisentraut	ee1ae8b99f	Fix crash/valgrind error Fix for commit `9ef1851685`: We have to skip indexes where sortopfamily is NULL. This takes the place of the previous btree check. Detected by valgrind on the buildfarm.	2025-04-04 14:45:53 +02:00
Heikki Linnakangas	7afca7edef	Relax assertion in finding correct GiST parent Commit `28d3c2ddcf` introduced an assertion that if the memorized downlink location in the insertion stack isn't valid, the parent's LSN should've changed too. Turns out that was too strict. In gistFindCorrectParent(), if we walk right, we update the parent's block number and clear its memorized 'downlinkoffnum'. That triggered the assertion on next call to gistFindCorrectParent(), if the parent needed to be split too. Relax the assertion, so that it's OK if downlinkOffnum is InvalidOffsetNumber. Backpatch to v13-, all supported versions. The assertion was added in commit `28d3c2ddcf` in v12. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://www.postgresql.org/message-id/18396-03cac9beb2f7aac3@postgresql.org	2025-04-04 13:49:00 +03:00
Fujii Masao	534874fac0	Allow "COPY table TO" command to copy rows from materialized views. Previously, "COPY table TO" command worked only with plain tables and did not support materialized views, even when they were populated and had physical storage. To copy rows from materialized views, "COPY (query) TO" command had to be used, instead. This commit extends "COPY table TO" to support populated materialized views directly, improving usability and performance, as "COPY table TO" is generally faster than "COPY (query) TO". Note that copying from unpopulated materialized views will still result in an error. Author: jian he <jian.universality@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CACJufxHVxnyRYy67hiPePNCPwVBMzhTQ6FaL9_Te5On9udG=yg@mail.gmail.com	2025-04-04 19:32:00 +09:00
Peter Eisentraut	9ef1851685	Support non-btree indexes in get_actual_variable_range() This was previously not supported because the btree strategy numbers were hardcoded. Now we can support this for any index that has the required strategy mapping support and the required operators. If an index scan used for get_actual_variable_range() requires recheck, we now just ignore it instead of erroring out. With btree we knew this couldn't happen, but now it might. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-04-04 12:21:34 +02:00
Fujii Masao	0d6c477664	Extend ALTER DEFAULT PRIVILEGES to define default privileges for large objects. Previously, ALTER DEFAULT PRIVILEGES did not support large objects. This meant that to grant privileges to users other than the owner, permissions had to be manually assigned each time a large object was created, which was inconvenient. This commit extends ALTER DEFAULT PRIVILEGES to allow defining default access privileges for large objects. With this change, specified privileges will automatically apply to newly created large objects, making privilege management more efficient. As a side effect, this commit introduces the new keyword OBJECTS since it's used in the syntax of ALTER DEFAULT PRIVILEGES. Original patch by Haruka Takatsuka, with some fixes and tests by Yugo Nagata, and rebased by Laurenz Albe. Author: Takatsuka Haruka <harukat@sraoss.co.jp> Co-authored-by: Yugo Nagata <nagata@sraoss.co.jp> Co-authored-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Masao Fujii <masao.fujii@gmail.com> Discussion: https://postgr.es/m/20240424115242.236b499b2bed5b7a27f7a418@sraoss.co.jp	2025-04-04 19:02:17 +09:00
Heikki Linnakangas	6e9c81836e	Use standard die() signal handler in walreceiver This gets rid of the bespoken ProcessWalRcvInterrupts() function, which lets walreceiver terminate at any CHECK_FOR_INTERRUPTS() call. And it's less code anyway. We can now use the standard libpqsrv_connect_params() libpq wrapper from libpq-be-fe-helpers.h, removing more code. We attempted to do that earlier already in commit `728f86fec6`, but that was reverted because it didn't call ProcessWalRcvInterrupts() and therefore didn't react to shutdown requests. Now that ProcessWalRcvInterrupts() is gone, it works. As stated in that commit, this also leads to libpqwalreceiver reserving file descriptors for libpq conncetions, which is nice. Author: Andres Freund <andres@anarazel.de> (the earlier commit) Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>	2025-04-04 12:38:32 +03:00
Peter Eisentraut	8123e91f5a	Convert PathKey to use CompareType Change the PathKey struct to use CompareType to record the sort direction instead of hardcoding btree strategy numbers. The CompareType is then converted to the index-type-specific strategy when the plan is created. This reduces the number of places btree strategy numbers are hardcoded, and it's a self-contained subset of a larger effort to allow non-btree indexes to behave like btrees. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-04-04 11:22:20 +02:00
Tomas Vondra	1aff1dc8df	Revert "Improve accounting for memory used by shared hash tables" This reverts commit `f5930f9a98`. This broke the expansion of private hash tables, which reallocates the directory. But that's impossible when it's allocated together with the other fields, and dir_realloc() failed with BogusFree. Clearly, this needs rethinking. Discussion: https://postgr.es/m/CAApHDvriCiNkm=v521AP6PKPfyWkJ++jqZ9eqX4cXnhxLv8w-A@mail.gmail.com	2025-04-04 04:43:50 +02:00
Amit Langote	88f55bc976	Make derived clause lookup in EquivalenceClass more efficient Derived clauses are stored in ec_derives, a List of RestrictInfos. These clauses are later looked up by matching the left and right EquivalenceMembers along with the clause's parent EC. This linear search becomes expensive in queries with many joins or partitions, where ec_derives may contain thousands of entries. In particular, create_join_clause() can spend significant time scanning this list. To improve performance, introduce a hash table (ec_derives_hash) that is built when the list reaches 32 entries -- the same threshold used for join_rel_hash. The original list is retained alongside the hash table to support EC merging and serialization (_outEquivalenceClass()). Each clause is stored in the hash table using a canonicalized key: the EquivalenceMember with the lower memory address is placed in the key before the one with the higher memory address. This avoids storing or searching for both permutations of the same clause. For clauses involving a constant EM, the key places NULL in the first slot and the non-constant EM in the second. The hash table is initialized using list_length(ec_derives_list) as the size hint. simplehash internally adjusts this to the next power of two after dividing by the fillfactor, so this typically results in at least 64 buckets near the threshold -- avoiding immediate resizing while adapting to the actual number of entries. The lookup logic for derived clauses is now centralized in ec_search_derived_clause_for_ems(), which consults the hash table when available and falls back to the list otherwise. The new ec_clear_derived_clauses() always frees ec_derives_list, even though some of the original code paths that cleared the old ec_derives field did not. This ensures consistent cleanup and avoids leaking memory when large lists are discarded. An assertion originally placed in find_derived_clause_for_ec_member() is moved into ec_search_derived_clause_for_ems() so that it is enforced consistently, regardless of whether the hash table or list is used for lookup. This design incorporates suggestions by David Rowley, who proposed both the key canonicalization and the initial sizing approach to balance memory usage and CPU efficiency. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> Tested-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Tested-by: Amit Langote <amitlangote09@gmail.com> Tested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAExHW5vZiQtWU6moszLP5iZ8gLX_ZAUbgEX0DxGLx9PGWCtqUg@mail.gmail.com	2025-04-04 10:45:05 +09:00
Amit Langote	887160d1be	Add assertion to verify derived clause has constant RHS find_derived_clause_for_ec_member() searches for a previously-derived clause that equates a non-constant EquivalenceMember to a constant. It is only called for EquivalenceClasses with ec_has_const set, and with a non-constant member the EquivalenceMember to search for. The matched clause is expected to have the non-constant member on the left-hand side and the constant EquivalenceMember on the right. Assert that the RHS is indeed a constant, to catch violations of this structure and enforce assumptions made by generate_base_implied_equalities_const(). Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Discussion: https://postgr.es/m/CAExHW5scMxyFRqOFE6ODmBiW2rnVBEmeEcA-p4W_CyuEikURdA@mail.gmail.com	2025-04-04 10:45:05 +09:00
Melanie Plageman	67be093562	Use AIO batchmode for bitmap heap scans Previously bitmap heap scan was not AIO batchmode safe because of the visibility map reads potentially done for the "skip fetch" optimization (which skipped fetching tuples from the heap if the pages were all visible and none of the columns were used in the query). The skip fetch optimization implementation was found to have bugs and was removed in `459e7bf8e2`, so we can safely enable batchmode for bitmap heap scans.	2025-04-03 18:23:02 -04:00
Melanie Plageman	54a3615f15	Remove misleading read stream asserts in a few users Several read stream users asserted that the read stream was exhausted after looping on that very condition. It was pointed out in an a review of an as-of-yet uncommitted read stream user [1] that this was confusing and could lead the reader to think there was a possibility of some kind of race condition. Remove these asserts. [1] https://postgr.es/m/F9ACE8D0-B807-4A17-B6BD-87EF0717983D%40yesql.se	2025-04-03 18:22:37 -04:00
Tom Lane	dbd437e670	Fix oversight in commit `0dca5d68d`. As coded, fmgr_sql() would get an assertion failure for a SQL function that has an empty body and is declared to return some type other than VOID. Typically you'd never get that far because fmgr_sql_validator() would reject such a definition (I suspect that's how come I managed to miss the bug). But if check_function_bodies is off or the function is polymorphic, the validation check wouldn't get made. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/0fde377a-3870-4d18-946a-ce008ee5bb88@gmail.com	2025-04-03 16:03:12 -04:00
Masahiko Sawada	fd09c1316b	Restrict copying of invalidated replication slots. Previously, invalidated logical and physical replication slots could be copied using the pg_copy_logical_replication_slot and pg_copy_physical_replication_slot functions. Replication slots that were invalidated for reasons other than WAL removal retained their restart_lsn. This meant that a new slot copied from an invalidated slot could have a restart_lsn pointing to a WAL segment that might have already been removed. This commit restricts the copying of invalidated replication slots. Backpatch to v16, where slots could retain their restart_lsn when invalidated for reasons other than WAL removal. For v15 and earlier, this check is not required since slots can only be invalidated due to WAL removal, and existing checks already handle this issue. Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CANhcyEU65aH0VYnLiu%3DOhNNxhnhNhwcXBeT-jvRe1OiJTo_Ayg%40mail.gmail.com Backpatch-through: 16	2025-04-03 10:30:00 -07:00
Richard Guo	ea5d3f5233	Remove duplicated comment in get_relation_constraints The check for non-inheritable constraints is performed later, and the same comment is included at that point. While we're here, remove one extraneous blank line. Author: jian he <jian.universality@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CACJufxETi6x86S8EkH8mRfOcm2AenoE9t1pyCFVMpU34gVhF3w@mail.gmail.com	2025-04-03 16:43:53 +09:00
Amit Kapila	4868c96bc8	Fix slot synchronization for two_phase enabled slots. The issue is that the transactions prepared before two-phase decoding is enabled can fail to replicate to the subscriber after being committed on a promoted standby following a failover. This is because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without two_phase_at, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber after promotion of standby server, causing them to be skipped. To address the issue on HEAD, the two_phase_at field of the slot is exposed by the pg_replication_slots view and allows the slot synchronization to copy this value to the corresponding synced slot on the standby server. This bug is likely to occur if the user toggles the two_phase option to true after initial slot creation. Given that altering the two_phase option of a replication slot is not allowed in PostgreSQL 17, this bug is less likely to occur. We can't change the view/function definition in backbranch so we can't push the same fix but we are brainstorming an appropriate solution for PG17. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/TYAPR01MB5724CC7C288535BBCEEE65DA94A72@TYAPR01MB5724.jpnprd01.prod.outlook.com	2025-04-03 12:26:54 +05:30
Tom Lane	a7187c3723	Remove unnecessary type violation in tsvectorrecv(). compareentry() is declared to work on WordEntryIN structs, but tsvectorrecv() is using it in two places to work on WordEntry structs. This is almost okay, since WordEntry is the first field of WordEntryIN. But on machines with 8-byte pointers, WordEntryIN will have a larger alignment spec than WordEntry, and it's at least theoretically possible that the compiler could generate code that depends on the larger alignment. Given the lack of field reports, this may be just a hypothetical bug that upsets nothing except sanitizer tools. Or it may be real on certain hardware but nobody's tried to use tsvectorrecv() on such hardware. In any case we should fix it, and the fix is trivial: just change compareentry() so that it works on WordEntry without any mention of WordEntryIN. We can also get rid of the quite-useless intermediate function WordEntryCMP. Bug: #18875 Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18875-07a29c49c825a608@postgresql.org Backpatch-through: 13	2025-04-02 16:17:43 -04:00
Andres Freund	459e7bf8e2	Remove HeapBitmapScan's skip_fetch optimization The optimization does not take the removal of TIDs by a concurrent vacuum into account. The concurrent vacuum can remove dead TIDs and make pages ALL_VISIBLE while those dead TIDs are referenced in the bitmap. This can lead to a skip_fetch scan returning too many tuples. It likely would be possible to implement this optimization safely, but we don't have the necessary infrastructure in place. Nor is it clear that it's worth building that infrastructure, given how limited the skip_fetch optimization is. In the backbranches we just disable the optimization by always passing need_tuples=true to table_beginscan_bm(). We can't perform API/ABI changes in the backbranches and we want to make the change as minimal as possible. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reported-By: Konstantin Knizhnik <knizhnik@garret.ru> Discussion: https://postgr.es/m/CAEze2Wg3gXXZTr6_rwC+s4-o2ZVFB5F985uUSgJTsECx6AmGcQ@mail.gmail.com Backpatch-through: 13	2025-04-02 14:54:20 -04:00
Tom Lane	0dca5d68d7	Change SQL-language functions to use the plan cache. In the historical implementation of SQL functions (if they don't get inlined), we built plans for all the contained queries at first call within an outer query, and then re-used those plans for the duration of the outer query, and then forgot everything. This was not ideal, not least because the plans could not be customized to specific values of the function's parameters. Our plancache infrastructure seems mature enough to be used here. That will solve both the problem with not being able to build custom plans and the problem with not being able to share work across successive outer queries. Aside from those performance concerns, this change fixes a longstanding bugaboo with SQL functions: you could not write DDL that would affect later statements in the same function. That's mostly still true with new-style SQL functions, since the results of parse analysis are baked into the stored query trees (and protected by dependency records). But for old-style SQL functions, it will now work much as it does with PL/pgSQL functions, because we delay parse analysis and planning of each query until we're ready to run it. Some edge cases that require replanning are now handled better too; see for example the new rowsecurity test, where we now detect an RLS context change that was previously missed. One other edge-case change that might be worthy of a release note is that we now insist that a SQL function's result be generated by the physically-last query within it. Previously, if the last original query was deleted by a DO INSTEAD NOTHING rule, we'd be willing to take the result from the preceding query instead. This behavior was undocumented except in source-code comments, and it seems hard to believe that anyone's relying on it. Along the way to this feature, we needed a few infrastructure changes: * The plancache can now take either a raw parse tree or an analyzed-but-not-rewritten Query as the starting point for a CachedPlanSource. If given a Query, it is caller's responsibility that nothing will happen to invalidate that form of the query. We use this for new-style SQL functions, where what's in pg_proc is serialized Query(s) and we trust the dependency mechanism to disallow DDL that would break those. * The plancache now offers a way to invoke a post-rewrite callback to examine/modify the rewritten parse tree when it is rebuilding the parse trees after a cache invalidation. We need this because SQL functions sometimes adjust the parse tree to make its output exactly match the declared result type; if the plan gets rebuilt, that has to be re-done. * There is a new backend module utils/cache/funccache.c that abstracts the idea of caching data about a specific function usage (a particular function and set of input data types). The code in it is moved almost verbatim from PL/pgSQL, which has done that for a long time. We use that logic now for SQL-language functions too, and maybe other PLs will have use for it in the future. Author: Alexander Pyhalov <a.pyhalov@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com> Discussion: https://postgr.es/m/8216639.NyiUUSuA9g@aivenlaptop	2025-04-02 14:06:02 -04:00
Heikki Linnakangas	e9e7b66044	Add GiST and btree sortsupport routines for range types For GiST, having a sortsupport function allows building the index using the "sorted build" method, which is much faster. For b-tree, the sortsupport routine doesn't give any new functionality, but speeds up sorting a tiny bit. The difference is not very significant, about 2% in cursory testing on my laptop, because the range type comparison function has quite a lot of overhead from detoasting. In any case, since we have the function for GiST anyway, we might as well register it for the btree opfamily too. Author: Bernd Helmle <mailings@oopsware.de> Discussion: https://www.postgresql.org/message-id/64d324ce2a6d535d3f0f3baeeea7b25beff82ce4.camel@oopsware.de	2025-04-02 19:51:28 +03:00
Tomas Vondra	46df9487d9	Improve accounting for PredXactList, RWConflictPool and PGPROC Various places allocated shared memory by first allocating a small chunk using ShmemInitStruct(), followed by ShmemAlloc() calls to allocate more memory. Unfortunately, ShmemAlloc() does not update ShmemIndex, so this affected pg_shmem_allocations - it only shown the initial chunk. This commit modifies the following allocations, to allocate everything as a single chunk, and then split it internally. - PredXactList - RWConflictPool - PGPROC structures - Fast-Path Lock Array The fast-path lock array is allocated separately, not as a part of the PGPROC structures allocation. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com	2025-04-02 17:14:28 +02:00
Tomas Vondra	f5930f9a98	Improve accounting for memory used by shared hash tables pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(), but for shared hash tables that covered only the header and hash directory. The remaining parts (segments and buckets) were allocated later using ShmemAlloc(), which does not update the shmem accounting. Thus, these allocations were not shown in pg_shmem_allocations. This commit improves the situation by allocating all the hash table parts at once, using a single ShmemInitStruct() call. This way the ShmemIndex entries (and thus pg_shmem_allocations) better reflect the proper size of the hash table. This affects allocations for private (non-shared) hash tables too, as the hash_create() code is shared. For non-shared tables this however makes no practical difference. This changes the alignment a bit. ShmemAlloc() aligns the chunks using CACHELINEALIGN(), which means some parts (header, directory, segments) were aligned this way. Allocating all parts as a single chunk removes this (implicit) alignment. We've considered adding explicit alignment, but we've decided not to - it seems to be merely a coincidence due to using the ShmemAlloc() API, not due to necessity. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com	2025-04-02 17:14:28 +02:00
Tom Lane	bd178960c6	Need to do CommandCounterIncrement after StoreAttrMissingVal. Without this, an additional change to the same pg_attribute row within the same command will fail. This is possible at least with ALTER TABLE ADD COLUMN on a multiple-inheritance-pathway structure. (Another potential hazard is that immediately-following operations might not see the missingval.) Introduced by `95f650674`, which split the former coding that used a single pg_attribute update to change both atthasdef and atthasmissing/attmissingval into two updates, but missed that this should entail two CommandCounterIncrements as well. Like that fix, back-patch through v13. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/025a3ffa-5eff-4a88-97fb-8f583b015965@gmail.com Backpatch-through: 13	2025-04-02 11:13:01 -04:00
Heikki Linnakangas	a460251f0a	Make cancel request keys longer Currently, the cancel request key is a 32-bit token, which isn't very much entropy. If you want to cancel another session's query, you can brute-force it. In most environments, an unauthorized cancellation of a query isn't very serious, but it nevertheless would be nice to have more protection from it. Hence make the key longer, to make it harder to guess. The longer cancellation keys are generated when using the new protocol version 3.2. For connections using version 3.0, short 4-bytes keys are still used. The new longer key length is not hardcoded in the protocol anymore, the client is expected to deal with variable length keys, up to 256 bytes. This flexibility allows e.g. a connection pooler to add more information to the cancel key, which might be useful for finding the connection. Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions) Discussion: https://www.postgresql.org/message-id/508d0505-8b7a-4864-a681-e7e5edfe32aa@iki.fi	2025-04-02 16:41:48 +03:00
Peter Eisentraut	eec0040c4b	Add support for NOT ENFORCED in foreign key constraints This expands the NOT ENFORCED constraint flag, previously only supported for CHECK constraints (commit `ca87c415e2`), to foreign key constraints. Normally, when a foreign key constraint is created on a table, action and check triggers are added to maintain data integrity. With this patch, if a constraint is marked as NOT ENFORCED, integrity checks are no longer required, making these triggers unnecessary. Consequently, when creating a NOT ENFORCED foreign key constraint, triggers will not be created, and the constraint will be marked as NOT VALID. Similarly, if an existing foreign key constraint is changed to NOT ENFORCED, the associated triggers will be dropped, and the constraint will also be marked as NOT VALID. Conversely, if a NOT ENFORCED foreign key constraint is changed to ENFORCED, the necessary triggers will be created, and the will be changed to VALID by performing necessary validation. Since not-enforced foreign key constraints have no triggers, the shortcut used for example in psql and pg_dump to skip looking for foreign keys if the relation is known not to have triggers no longer applies. (It already didn't work for partitioned tables.) Author: Amul Sul <sulamul@gmail.com> Reviewed-by: Joel Jacobson <joel@compiler.org> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Isaac Morland <isaac.morland@gmail.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Tested-by: Triveni N <triveni.n@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com	2025-04-02 13:36:44 +02:00
Alexander Korotkov	bc22dc0e0d	Get rid of WALBufMappingLock Allow multiple backends to initialize WAL buffers concurrently. This way `MemSet((char ) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without taking a single LWLock in exclusive mode. The new algorithm works as follows: reserve a page for initialization using XLogCtl->InitializeReserved, * ensure the page is written out, * once the page is initialized, try to advance XLogCtl->InitializedUpTo and signal to waiters using XLogCtl->InitializedUpToCondVar condition variable, * repeat previous steps until we reserve initialization up to the target WAL position, * wait until concurrent initialization finishes using a XLogCtl->InitializedUpToCondVar. Now, multiple backends can, in parallel, concurrently reserve pages, initialize them, and advance XLogCtl->InitializedUpTo to point to the latest initialized page. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Tested-by: Michael Paquier <michael@paquier.xyz>	2025-04-02 12:44:24 +03:00
Fujii Masao	b53b88109f	Improve error message when standby does accept connections. Even after reaching the minimum recovery point, if there are long-lived write transactions with 64 subtransactions on the primary, the recovery snapshot may not yet be ready for hot standby, delaying read-only connections on the standby. Previously, when read-only connections were not accepted due to this condition, the following error message was logged: FATAL: the database system is not yet accepting connections DETAIL: Consistent recovery state has not been yet reached. This DETAIL message was misleading because the following message was already logged in this case: LOG: consistent recovery state reached This contradiction, i.e., indicating that the recovery state was consistent while also stating it wasn’t, caused confusion. This commit improves the error message to better reflect the actual state: FATAL: the database system is not yet accepting connections DETAIL: Recovery snapshot is not yet ready for hot standby. HINT: To enable hot standby, close write transactions with more than 64 subtransactions on the primary server. To implement this, the commit introduces a new postmaster signal, PMSIGNAL_RECOVERY_CONSISTENT. When the startup process reaches a consistent recovery state, it sends this signal to the postmaster, allowing it to correctly recognize that state. Since this is not a clear bug, the change is applied only to the master branch and is not back-patched. Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp> Discussion: https://postgr.es/m/02db8cd8e1f527a8b999b94a4bee3165@oss.nttdata.com	2025-04-02 15:13:01 +09:00
Melanie Plageman	b3219c69fc	aio: Add errcontext for processing I/Os for another backend Push an ErrorContextCallback adding additional detail about the process performing the I/O and the owner of the I/O when those are not the same. For io_method worker, this adds context specifying which process owns the I/O that the I/O worker is processing. For io_method io_uring, this adds context only when a backend is completing I/O for another backend. It specifies the pid of the owning process. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/rdml3fpukrqnas7qc5uimtl2fyytrnu6ymc2vjf2zuflbsjuul%40hyizyjsexwmm	2025-04-01 19:53:07 -04:00
David Rowley	b136db07c6	Fix planner's failure to identify multiple hashable ScalarArrayOpExprs `50e17ad28` (v14) and `29f45e299` (v15) made it so the planner could identify IN and NOT IN clauses which have Const lists as right-hand arguments and when an appropriate hash function is available for the data types, mark the ScalarArrayOpExpr as hashable so the executor could execute it more optimally by building and probing a hash table during expression evaluation. These commits both worked correctly when there was only a single ScalarArrayOpExpr in the given expression being processed by the planner, but when there were multiple, only the first was checked and any subsequent ones were not identified, which resulted in less optimal expression evaluation during query execution for all but the first found ScalarArrayOpExpr. Backpatch to 14, where `50e17ad28` was introduced. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/29a76f51-97b0-4c07-87b7-ec8e3b5345c9@gmail.com Backpatch-through: 14	2025-04-02 11:56:29 +13:00
Tom Lane	6c12ae09f5	Introduce a SQL-callable function array_sort(anyarray). Create a function that will sort the elements of an array according to the element type's sort order. If the array has more than one dimension, the sub-arrays of the first dimension are sorted per normal array-comparison rules, leaving their contents alone. In support of this, add pg_type.typarray to the set of fields cached by the typcache. Author: Junwang Zhao <zhjwpku@gmail.com> Co-authored-by: Jian He <jian.universality@gmail.com> Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/CAEG8a3J41a4dpw_-F94fF-JPRXYxw-GfsgoGotKcjs9LVfEEvw@mail.gmail.com	2025-04-01 18:03:55 -04:00
Andres Freund	e19dc74491	aio: Minor comment improvements Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/usbwzckj7q3jhfx3ann3nrfnukmupbs35axvq5zfyeo6nvrzrm@onjhxs2du4st	2025-04-01 16:06:48 -04:00
Andres Freund	fdd146a8ef	aio: Add README.md explaining higher level design Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-04-01 16:06:48 -04:00
Andres Freund	00066aa173	md: Add comment & assert to buffer-zeroing path in md[start]readv() mdreadv() has a codepath to zero out buffers when a read returns zero bytes, guarded by a check for zero_damaged_pages \|\| InRecovery. The InRecovery codepath to zero out buffers in mdreadv() appears to be unreachable. The only known paths to reach mdreadv()/mdstartreadv() in recovery are XLogReadBufferExtended(), vm_readbuf(), and fsm_readbuf(), each of which takes care to extend the relation if necessary. This looks to either have been the case for a long time, or the code was never reachable. The zero_damaged_pages path is incomplete, as missing segments are not created. Putting blocks into the buffer-pool that do not exist on disk is rather problematic, as such blocks will, at least initially, not be found by scans that rely on smgrnblocks(), as they are beyond EOF. It also can cause weird problems with relation extension, as relation extension does not expect blocks beyond EOF to exist. Therefore we would like to remove that path. mdstartreadv(), which I added in e5fe570b51c, does not implement this zeroing logic. I had started a discussion about that a while ago (linked below), but forgot to act on the conclusion of the discussion, namely to disable the in-memory-zeroing behavior. We could certainly implement equivalent zeroing logic in mdstartreadv(), but it would have to be more complicated due to potential differences in the zero_damaged_pages setting between the definer and completor of IO. Given that we want to remove the logic, that does not seem worth implementing the necessary logic. For now, put an Assert(false) and comments documenting this choice into mdreadv() and comments documenting the deprecation of the path in mdreadv() and the non-implementation of it in mdstartreadv(). If we, during testing, discover that we do need the path, we can implement it at that time. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/postgr.es/m/20250330024513.ac.nmisch@google.com Discussion: https://postgr.es/m/postgr.es/m/3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd	2025-04-01 13:50:39 -04:00
Andres Freund	93bc3d75d8	aio: Add test_aio module To make the tests possible, a few functions from bufmgr.c/localbuf.c had to be exported, via buf_internals.h. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Andres Freund <andres@anarazel.de> Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-04-01 13:47:46 -04:00
Andres Freund	60f566b4f2	aio: Add pg_aios view The new view lists all IO handles that are currently in use and is mainly useful for PG developers, but may also be useful when tuning PG. Bumps catversion. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-04-01 13:30:33 -04:00
Peter Eisentraut	764d501d24	Remove a stray "pgrminclude" annotation We don't use those anymore. Fix for commit `8492feb98f`.	2025-04-01 15:28:22 +02:00
Peter Eisentraut	113ecf1f8c	Fix minor C type confusion Returning false instead of NULL gets a compiler error under gcc-14 -std=gnu23, and it appears to have been unintentional. Fix for commit `8492feb98f`.	2025-04-01 15:28:22 +02:00
Heikki Linnakangas	2904324a88	heapam: Only set tuple's block once per page in pagemode Due to splitting the block id into two 16 bit integers, BlockIdSet() is more expensive than one might think. Doing it once per returned tuple shows up as a small but reliably reproducible cost. It's simple enough to set the block number just once per block in pagemode, so do so. Author: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6	2025-04-01 13:24:27 +03:00
Andres Freund	2a5e709e72	Enable IO concurrency on all systems Previously effective_io_concurrency and maintenance_io_concurrency could not be set above 0 on machines without fadvise support. AIO enables IO concurrency without such support, via io_method=worker. Currently only subsystems using the read stream API will take advantage of this. Other users of maintenance_io_concurrency (like recovery prefetching) which leverage OS advice directly will not benefit from this change. In those cases, maintenance_io_concurrency will have no effect on I/O behavior. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CAAKRu_atGgZePo=_g6T3cNtfMf0QxpvoUh5OUqa_cnPdhLd=gw@mail.gmail.com	2025-03-30 19:16:47 -04:00
Andres Freund	ae3df4b341	read_stream: Introduce and use optional batchmode support Submitting IO in larger batches can be more efficient than doing so one-by-one, particularly for many small reads. It does, however, require the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: a) block without first calling pgaio_submit_staged(), unless a to-be-waited-on lock cannot be part of a deadlock, e.g. because it is never held while waiting for IO. b) directly or indirectly start another batch pgaio_enter_batchmode() As this requires care and is nontrivial in some cases, batching is only used with explicit opt-in. This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and uses it where appropriate. There are two cases where batching would likely be beneficial, but where we aren't using it yet: 1) bitmap heap scans, because the callback reads the VM This should soon be solved, because we are planning to remove the use of the VM, due to that not being sound. 2) The first phase of heap vacuum This could be made to support batchmode, but would require some care. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-30 18:36:41 -04:00
Andres Freund	f4d0730bbc	aio: Basic read_stream adjustments for real AIO Adapt the read stream logic for real AIO: - If AIO is enabled, we shouldn't issue advice, but if it isn't, we should continue issuing advice - AIO benefits from reading ahead with direct IO - If effective_io_concurrency=0, pass READ_BUFFERS_SYNCHRONOUSLY to StartReadBuffers() to ensure synchronous IO execution There are further improvements we should consider: - While in read_stream_look_ahead(), we can use AIO batch submission mode for increased efficiency. That however requires care to avoid deadlocks and thus done separately. - It can be beneficial to defer starting new IOs until we can issue multiple IOs at once. That however requires non-trivial heuristics to decide when to do so. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com>	2025-03-30 18:26:44 -04:00
Andres Freund	12ce89fd07	bufmgr: Use AIO in StartReadBuffers() This finally introduces the first actual use of AIO. StartReadBuffers() now uses the AIO routines to issue IO. As the implementation of StartReadBuffers() is also used by the functions for reading individual blocks (StartReadBuffer() and through that ReadBufferExtended()) this means all buffered read IO passes through the AIO paths. However, as those are synchronous reads, actually performing the IO asynchronously would be rarely beneficial. Instead such IOs are flagged to always be executed synchronously. This way we don't have to duplicate a fair bit of code. When io_method=sync is used, the IO patterns generated after this change are the same as before, i.e. actual reads are only issued in WaitReadBuffers() and StartReadBuffers() may issue prefetch requests. This allows to bypass most of the actual asynchronicity, which is important to make a change as big as this less risky. One thing worth calling out is that, if IO is actually executed asynchronously, the precise meaning of what track_io_timing is measuring has changed. Previously it tracked the time for each IO, but that does not make sense when multiple IOs are executed concurrently. Now it only measures the time actually spent waiting for IO. A subsequent commit will adjust the docs for this. While AIO is now actually used, the logic in read_stream.c will often prevent using sufficiently many concurrent IOs. That will be addressed in the next commit. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-30 18:02:23 -04:00
Andres Freund	047cba7fa0	bufmgr: Implement AIO read support This commit implements the infrastructure to perform asynchronous reads into the buffer pool. To do so, it: - Adds readv AIO callbacks for shared and local buffers It may be worth calling out that shared buffer completions may be run in a different backend than where the IO started. - Adds an AIO wait reference to BufferDesc, to allow backends to wait for in-progress asynchronous IOs - Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c equivalents, to be able to deal with AIO - Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it now also needs to be called on IO completion As of this commit, nothing issues AIO on shared/local buffers. A future commit will update StartReadBuffers() to do so. Buffer reads executed through this infrastructure will report invalid page / checksum errors / warnings differently than before: In the error case the error message will cover all the blocks that were included in the read, rather than just the reporting the first invalid block. If more than one block is invalid, the error will include information about the range of the read, the first invalid block and the number of invalid pages, with a HINT towards the server log for per-block details. For the warning case (i.e. zero_damaged_buffers) we would previously emit one warning message for each buffer in a multi-block read. Now there is only a single warning message for the entire read, again referring to the server log for more details in case of multiple checksum failures within a single larger read. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-30 17:28:03 -04:00
Andres Freund	ef64fe26ba	aio: Add WARNING result status If an IO succeeds, but issues a warning, e.g. due to a page verification failure with zero_damaged_pages, we want to issue that warning in the context of the issuer of the IO, not the process that executes the completion (always the case for worker). It's already possible for a completion callback to report a custom error message, we just didn't have a result status that allowed a user of AIO to know that a warning should be emitted even though the IO request succeeded. All that's needed for that is a dedicated PGAIO_RS_ value. Previously there were not enough bits in PgAioResult.id for the new value. Increase. While at that, add defines for the amount of bits and static asserts to check that the widths are appropriate. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com	2025-03-30 16:27:10 -04:00
Andres Freund	d445990adc	Let caller of PageIsVerified() control ignore_checksum_failure For AIO the completion of a read into shared buffers (i.e. verifying the page including the checksum, updating the BufferDesc to reflect the IO) can happen in a different backend than the backend that started the IO. As ignore_checksum_failure can differ between backends, we need to allow the caller of PageIsVerified() control whether to ignore checksum failures. The commit leaves a gap in the PIV_* values, as an upcoming commit, which depends on this commit, will add PIV_LOG_LOG, which better fits just after PIV_LOG_WARNING. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com	2025-03-30 16:27:10 -04:00
Andres Freund	b96d3c3897	pgstat: Allow checksum errors to be reported in critical sections For AIO we execute completion callbacks in critical sections (to ensure that AIO can in the future be used for WAL, which in turn requires that we can call completion callbacks in critical sections, to get the resources for WAL io). To report checksum errors a backend now has to call pgstat_prepare_report_checksum_failure(), before entering a critical section, which guarantees the relevant pgstats entry is in shared memory, the relevant DSM segment is mapped into the backend's memory and the address is known via a PgStat_EntryRef. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/wkjj4p2rmkevutkwc6tewoovdqznj6c6nvjmvii4oo5wmbh5sr@retq7d6uqs4j	2025-03-30 16:12:04 -04:00
Andres Freund	4244cf6876	Add errhint_internal() We have errmsg_internal(), errdetail_internal(), but not errhint_internal(). Sometimes it is useful to output a hint with already translated format string (e.g. because there different messages depending on the condition). For message/detail we do that with the _internal() variants, but we can't do that with hint today. It's possible to work around that that by using something like str = psprintf(translated_format, args); ereport(... errhint("%s", str); but that's not exactly pretty and makes it harder to avoid memory leaks. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/ym3dqpa4xcvoeknewcw63x77vnqdosbqcetjinb2zfoh65k55m@m4ozmwhr6lk6	2025-03-30 16:10:51 -04:00
Andres Freund	d6d8054dc7	localbuf: Track pincount in BufferDesc as well For AIO on temporary table buffers the AIO subsystem needs to be able to ensure a pin on a buffer while AIO is going on, even if the IO issuing query errors out. Tracking the buffer in LocalRefCount does not work, as it would cause CheckForLocalBufferLeaks() to assert out. Instead, also track the refcount in BufferDesc.state, not just LocalRefCount. This also makes local buffers behave a bit more akin to shared buffers. Note that we still don't need locking, AIO completion callbacks for local buffers are executed in the issuing session (i.e. nobody else has access to the BufferDesc). Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-29 16:36:51 -04:00
Andres Freund	08ccd56ac7	aio, bufmgr: Comment fixes/improvements Some of these comments have been wrong for a while (`12f3867f55`), some I recently introduced (`da7226993f`, `55b454d0e1`). This includes an update to a comment in FlushBuffer(), which will be copied in a future commit. These changes seem big enough to be worth doing in separate commits. Suggested-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250319212530.80.nmisch@google.com	2025-03-29 14:45:42 -04:00
Andres Freund	50cb7505b3	aio: Implement support for reads in smgr/md/fd This implements the following: 1) An smgr AIO target, for AIO on smgr files. This should be usable not just for md.c but also other SMGR implementation if we ever get them. 2) readv support in fd.c, which requires a small bit of infrastructure work in fd.c 3) smgr.c and md.c support for readv There still is nothing performing AIO, but as of this commit it would be possible. As part of this change FileGetRawDesc() actually ensures that the file is opened - previously it was basically not usable. It's used to reopen a file in IO workers. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-29 13:38:35 -04:00
Andres Freund	dee8002468	Fix mis-attribution of checksum failure stats to the wrong database Checksum failure stats could be attributed to the wrong database in two cases: - when a read of a shared relation encountered a checksum error , it would be attributed to the current database, instead of the "database" representing shared relations - when using CREATE DATABASE ... STRATEGY WAL_LOG checksum errors in the source database would be attributed to the current database The checksum stats reporting via PageIsVerifiedExtended(PIV_REPORT_STAT) does not have access to the information about what database a page belongs to. This fixes the issue by removing PIV_REPORT_STAT and delegating the responsibility to report stats to the caller, which now can learn about the number of stats via a new optional argument. As this changes the signature of PageIsVerifiedExtended() and all callers should adapt to the new signature, use the occasion to rename the function to PageIsVerified() and remove the compatibility macro. We could instead have fixed this by adding information about the database to the args of PageIsVerified(), but there are soon-to-be-applied patches that need to separate the stats reporting from the PageIsVerified() call anyway. Those patches also include testing for the failure paths, something we inexplicably have not had. As there is no caller of pgstat_report_checksum_failure() left, remove it. It'd be possible, but awkward to fix this in the back branches. We considered doing the work not quite worth it, as mis-attributed stats should still elicit concern. The emitted error messages do allow to attribute the errors correctly. Discussion: https://postgr.es/m/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az Discussion: https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz	2025-03-29 13:38:35 -04:00
Tomas Vondra	fb9dff7663	Fix grammar in GIN README Author: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/CALdSSPgu9uAhVYojQ0yjG%3Dq5MaqmiSLUJPhz%2B-u7cA6K6Mc9UA%40mail.gmail.com	2025-03-29 15:14:25 +01:00
Dean Rasheed	8b6a0e2392	Fix MERGE with DO NOTHING actions into a partitioned table. ExecInitPartitionInfo() duplicates much of the logic in ExecInitMerge(), except that it failed to handle DO NOTHING actions. This would cause an "unknown action in MERGE WHEN clause" error if a MERGE with any DO NOTHING actions attempted to insert into a partition not already initialised by ExecInitModifyTable(). Bug: #18871 Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Gurjeet Singh <gurjeet@singh.im> Discussion: https://postgr.es/m/18871-b44e3c96de3bd2e8%40postgresql.org Backpatch-through: 15	2025-03-29 09:58:40 +00:00
Peter Eisentraut	a0ed19e0a9	Use PRI?64 instead of "ll?" in format strings (continued). Continuation of work started in commit `15a79c73`, after initial trial. Author: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/b936d2fb-590d-49c3-a615-92c3a88c6c19%40eisentraut.org	2025-03-29 10:43:57 +01:00
Alexander Korotkov	775a06d44c	Make group_similar_or_args() reorder clause list as little as possible Currently, group_similar_or_args() permutes original positions of clauses independently on whether it manages to find any groups of similar clauses. While we are not providing any strict warranties on saving the original order of OR-clauses, it is preferred that the original order be modified as little as possible. This commit changes the reordering algorithm of group_similar_or_args() in the following way. We reorder each group of similar clauses so that the first item of the group stays in place, but all the other items are moved after it. So, if there are no similar clauses, the order of clauses stays the same. When there are some groups, only required reordering happens while the rest of the clauses remain in their places. Reported-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/3ac7c436-81e1-4191-9caf-b0dd70b51511%40gmail.com Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Alena Rybakina <a.rybakina@postgrespro.ru>	2025-03-28 23:37:49 +02:00
Heikki Linnakangas	51a0382e8d	Fix crash if LockErrorCleanup() is called twice The refactoring in commit `3c0fd64fec` removed the clearing of awaitedLock from LockErrorCleanup(). It's still needed, otherwise LockErrorCleanup() during abort processing will try to update the LOCALLOCK struct even after the lock has already been released. Put it back. Reported-by: Richard Guo <guofenglinux@gmail.com> Reported-by: Robins Tharakan <tharakan@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4_dNX1SzBmvFdoY-LxJh_4W_BjtVd5i008ihfU-wFF=eg@mail.gmail.com Discussion: https://www.postgresql.org/message-id/18832-38e5575b1bbd7277@postgresql.org Discussion: https://www.postgresql.org/message-id/e11a30e5-c0d8-491d-8546-3a1b50c10ad4@gmail.com	2025-03-28 20:19:17 +02:00
Masahiko Sawada	a5419bc72e	Fix timestamp overflow in UUIDv7 implementation. The uuidv7_interval() function previously converted a shifted microsecond-precision timestamp (64-bit integer) to another 64-bit integer representing a timestamp with nanosecond precision. This conversion caused overflow for dates beyond the year 2262. The millisecond and sub-millisecond parts were then extracted from this nanosecond-precision timestamp and stored in UUIDv7 values. With this commit, the millisecond and sub-millisecond parts are stored directly into the UUIDv7 value without being converted back to a nanosecond precision timestamp. Following RFC 9562, the timestamp is stored as an unsigned integer, enabling support for dates up to the year 10889. Reported and fixed by Andrey Borodin, with cosmetic changes and regression tests by me. Reported-by: Andrey Borodin <x4mmm@yandex-team.ru> Author: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/96DEC2D9-659A-40E8-B7BA-AF5D162A9E21@yandex-team.ru	2025-03-28 09:39:11 -07:00
Peter Eisentraut	cdc168ad4b	Add support for not-null constraints on virtual generated columns This was left out of the original patch for virtual generated columns (commit `83ea6c5402`). This just involves a bit of extra work in the executor to expand the generation expressions and run a "IS NOT NULL" test against them. There is also a bit of work to make sure that not-null constraints are checked during a table rewrite. Author: jian he <jian.universality@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Navneet Kumar <thanit3111@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CACJufxHArQysbDkWFmvK+D1TPHQWWTxWN15cMuUaTYX3xhQXgg@mail.gmail.com	2025-03-28 13:53:37 +01:00
Peter Eisentraut	747ddd38cb	Modernize some code a bit Modernize code in ExecRelCheck() and ExecConstraints() a bit, preparing the way for some new code. Co-authored-by: jian he <jian.universality@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Navneet Kumar <thanit3111@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CACJufxHArQysbDkWFmvK+D1TPHQWWTxWN15cMuUaTYX3xhQXgg@mail.gmail.com	2025-03-28 10:49:15 +01:00
Peter Eisentraut	9a9ead1105	Rename a node field for clarity Rename ResultRelInfo.ri_ConstraintExprs to ri_CheckConstraintExprs. This reflects its specific purpose better and avoids confusion with adjacent fields with similar but distinct purposes. Discussion: https://postgr.es/m/CACJufxHArQysbDkWFmvK+D1TPHQWWTxWN15cMuUaTYX3xhQXgg@mail.gmail.com	2025-03-28 09:50:01 +01:00
Peter Eisentraut	890fc826c9	Use thread-safe strftime_l() instead of strftime(). This removes some setlocale() calls and a lot of commentary about how dangerous that is. strftime_l() is from POSIX 2008, and on Windows we use _wcsftime_l(). While here, adjust error message for strftime_l() failure: it does not in practice set errno (even though POSIX says it could), so no %m. Author: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/CA%2BhUKGJqVe0%2BPv9dvC9dSums_PXxGo9SWcxYAMBguWJUGbWz-A%40mail.gmail.com	2025-03-28 07:13:43 +01:00
Daniel Gustafsson	058b5152f0	Fix guc_malloc calls for consistency and OOM checks check_createrole_self_grant and check_synchronized_standby_slots were allocating memory on a LOG elevel without checking if the allocation succeeded or not, which would have led to a segfault on allocation failure. On top of that, a number of callsites were using the ERROR level, relying on erroring out rather than returning false to allow the GUC machinery handle it gracefully. Other callsites used WARNING instead of LOG. While neither being not wrong, this changes all check_ functions do it consistently with LOG. init_custom_variable gets a promoted elevel to FATAL to keep the guc_malloc error handling in line with the rest of the error handling in that function which already call FATAL. If we encounter an OOM in this callsite there is no graceful handling to be had, better to error out hard. Backpatch the fix to check_createrole_self_grant down to v16 and the fix to check_synchronized_standby_slots down to v17 where they were introduced. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Nikita <pm91.arapov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Bug: #18845 Discussion: https://postgr.es/m/18845-582c6e10247377ec@postgresql.org Backpatch-through: 16	2025-03-27 22:57:34 +01:00
Álvaro Herrera	9fbd53dea5	Remove the query_id_squash_values GUC Commit `62d712ecfd` introduced the capability to calculate the same queryId for queries with different lengths of constants in a list for an IN clause. This behavior was originally enabled with a GUC query_id_squash_values. After a discussion about the value of such a GUC, it was decided to back out of the use of a GUC and make the squashing behavior the only available option. Author: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/Z-LZyygkkNyA8-kR@msg.df7cb.de Discussion: https://postgr.es/m/CA+q6zcVTK-3C-8NWV1oY2NZrvtnMCDqnyYYyk1T7WMUG65MeOQ@mail.gmail.com	2025-03-27 13:33:37 +01:00
Peter Eisentraut	b98be8a2a2	Provide thread-safe pg_localeconv_r(). This involves four different implementation strategies: 1. For Windows, we now require _configthreadlocale() to be available and work (commit `f1da075d9a`), and the documentation says that the object returned by localeconv() is in thread-local memory. 2. For glibc, we translate to nl_langinfo_l() calls, because it offers the same information that way as an extension, and that API is thread-safe. 3. For macOS/*BSD, use localeconv_l(), which is thread-safe. 4. For everything else, use uselocale() to set the locale for the thread, and use a big ugly lock to defend against the returned object being concurrently clobbered. In practice this currently means only Solaris. The new call is used in pg_locale.c, replacing calls to setlocale() and localeconv(). Author: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/CA%2BhUKGJqVe0%2BPv9dvC9dSums_PXxGo9SWcxYAMBguWJUGbWz-A%40mail.gmail.com	2025-03-27 10:54:28 +01:00
Álvaro Herrera	4a02af8b1a	Simplify syntax for ALTER TABLE ALTER CONSTRAINT NO INHERIT Commit `d45597f72f` introduced the ability to change a not-null constraint from NO INHERIT to INHERIT and vice versa, but we included the SET noise word in the syntax for it. The SET turns out not to be necessary and goes against what the SQL standard says for other ALTER TABLE subcommands, so remove it. This changes the way this command is processed for constraint types other than not-null, so there are some error message changes. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Suraj Kharage <suraj.kharage@enterprisedb.com> Discussion: https://postgr.es/m/202503251602.vsxaehsyaoac@alvherre.pgsql	2025-03-27 09:24:52 +01:00
David Rowley	ad9a23bc4f	Optimize Query jumble `f31aad9b0` adjusted query jumbling so it no longer ignores NULL nodes during the jumble. This added some overhead. Here we tune a few things to make jumbling faster again. This makes jumbling perform similar or even slightly faster than prior to that change. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAApHDvreP04nhTKuYsPw0F-YN+4nr4f=L72SPeFb81jfv+2c7w@mail.gmail.com	2025-03-27 18:34:34 +13:00
David Rowley	f31aad9b07	Fix query jumbling to account for NULL nodes Previously NULL nodes were ignored. This could cause issues where the computed query ID could match for queries where fields that are next to each other in their Node struct where one field was NULL and the other non-NULL. For example, the Query struct had distinctClause and sortClause next to each other. If someone wrote; SELECT DISTINCT c1 FROM t; and then; SELECT c1 FROM t ORDER BY c1; these would produce the same query ID since, in the first query, we ignored the NULL sortClause and appended the jumble bytes for the distictClause. In the latter query, since we did nothing for the NULL distinctClause then jumble the non-NULL sortClause, and since the node representation stored is the same in both cases, the query IDs were identical. Here we fix this by always accounting for NULL nodes by recording that we saw a NULL in the jumble buffer. This fixes the issue as the order that the NULL is recorded isn't the same in the above two queries. Author: Bykov Ivan <i.bykov@modernsys.ru> Author: Michael Paquier <michael@paquier.xyz> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/aafce7966e234372b2ba876c0193f1e9%40localhost.localdomain	2025-03-27 18:23:00 +13:00
Michael Paquier	44fe6ceb51	doc: Correct description of values used in FSM for indexes The implementation of FSM for indexes is simpler than heap, where 0 is used to track if a page is in-use and (BLCKSZ - 1) if a page is free. One comment in indexfsm.c and one description in the documentation of pg_freespacemap were incorrect about that. Author: Alex Friedman <alexf01@gmail.com> Discussion: https://postgr.es/m/71eef655-c192-453f-ac45-2772fec2cb04@gmail.com Backpatch-through: 13	2025-03-27 10:20:41 +09:00
Andres Freund	c325a7633f	aio: Add io_method=io_uring Performing AIO using io_uring can be considerably faster than io_method=worker, particularly when lots of small IOs are issued, as a) the context-switch overhead for worker based AIO becomes more significant b) the number of IO workers can become limiting io_uring, however, is linux specific and requires an additional compile-time dependency (liburing). This implementation is fairly simple and there are substantial optimization opportunities. The description of the existing AIO_IO_COMPLETION wait event is updated to make the difference between it and the new AIO_IO_URING_EXECUTION clearer. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-26 19:49:13 -04:00
Andres Freund	8eadd5c73c	aio: Add liburing dependency Will be used in a subsequent commit, to implement io_method=io_uring. Kept separate for easier review. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-26 19:45:32 -04:00
Andres Freund	9469d7fdd2	aio: Rename pgaio_io_prep_* to pgaio_io_start_* The old naming pattern (mirroring liburing's naming) was inconsistent with the (not yet introduced) callers. It seems better to get rid of the inconsistency now than to grow more users of the odd naming. Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com	2025-03-26 16:10:29 -04:00
Andres Freund	f321ec237a	aio: Pass result of local callbacks to ->report_return Otherwise the results of e.g. temp table buffer verification errors will not reach bufmgr.c. Obviously that's not right. Found while expanding the tests for invalid buffer contents. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com	2025-03-26 16:06:54 -04:00
Andres Freund	96da9050a5	aio: Be more paranoid about interrupts As reported by Noah, it's possible, although practically very unlikely, that interrupts could be processed in between pgaio_io_reopen() and pgaio_io_perform_synchronously(). Prevent that by explicitly holding interrupts. It also seems good to add an assertion to pgaio_io_before_prep() to ensure that interrupts are held, as otherwise FDs referenced by the IO could be closed during interrupt processing. All code in the aio series currently runs the code with interrupts held, but it seems better to be paranoid. Reviewed-by: Noah Misch <noah@leadboat.com> Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250324002939.5c.nmisch@google.com	2025-03-26 16:06:54 -04:00
Tomas Vondra	818245506c	Keep the decompressed filter in brin_bloom_union The brin_bloom_union() function combines two BRIN summaries, by merging one filter into the other. With bloom, we have to decompress the filters first, but the function failed to update the summary to store the merged filter. As a consequence, the index may be missing some of the data, and return false negatives. This issue exists since BRIN bloom indexes were introduced in Postgres 14, but at that point the union function was called only when two sessions happened to summarize a range concurrently, which is rare. It got much easier to hit in 17, as parallel builds use the union function to merge summaries built by workers. Fixed by storing a pointer to the decompressed filter, and freeing the original one. Free the second filter too, if it was decompressed. The freeing is not strictly necessary, because the union is called in short-lived contexts, but it's tidy. Backpatch to 14, where BRIN bloom indexes were introduced. Reported by Arseniy Mukhin, investigation and fix by me. Reported-by: Arseniy Mukhin Discussion: https://postgr.es/m/18855-1cf1c8bcc22150e6%40postgresql.org Backpatch-through: 14	2025-03-26 17:01:41 +01:00
Tom Lane	55527368bd	Use PG_MODULE_MAGIC_EXT in our installable shared libraries. It seems potentially useful to label our shared libraries with version information, now that a facility exists for retrieving that. This patch labels them with the PG_VERSION string. There was some discussion about using semantic versioning conventions, but that doesn't seem terribly helpful for modules with no SQL-level presence; and for those that do have SQL objects, we typically expect them to support multiple revisions of the SQL definitions, so it'd still not be very helpful. I did not label any of src/test/modules/. It seems unnecessary since we don't install those, and besides there ought to be someplace that still provides test coverage for the original PG_MODULE_MAGIC macro. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/dd4d1b59-d0fe-49d5-b28f-1e463b68fa32@gmail.com	2025-03-26 11:11:02 -04:00
Tom Lane	9324c8c580	Introduce PG_MODULE_MAGIC_EXT macro. This macro allows dynamically loaded shared libraries (modules) to provide a wired-in module name and version, and possibly other compile-time-constant fields in future. This information can be retrieved with the new pg_get_loaded_modules() function. This feature is expected to be particularly useful for modules that do not have any exposed SQL functionality and thus are not associated with a SQL-level extension object. But even for modules that do belong to extensions, being able to verify the actual code version can be useful. Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Yurii Rashkovskii <yrashk@omnigres.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/dd4d1b59-d0fe-49d5-b28f-1e463b68fa32@gmail.com	2025-03-26 11:06:12 -04:00
Dean Rasheed	a3b6dfd410	Add support for gamma() and lgamma() functions. These are useful general-purpose math functions which are included in POSIX and C99, and are commonly included in other math libraries, so expose them as SQL-callable functions. Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Stepan Neretin <sncfmgg@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Dmitry Koval <d.koval@postgrespro.ru> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: https://postgr.es/m/CAEZATCXpGyfjXCirFk9au+FvM0y2Ah+2-0WSJx7MO368ysNUPA@mail.gmail.com	2025-03-26 09:35:53 +00:00
Michael Paquier	787514b30b	Use relation name instead of OID in query jumbling for RangeTblEntry custom_query_jumble (introduced in `5ac462e2b7` as a node field attribute) is now assigned to the expanded reference name "eref" of RangeTblEntry, adding in the query jumble computation the non-qualified aliased relation name, without the list of column names. The relation OID is removed from the query jumbling. The effects of this change can be seen in the tests added by `3430215fe3`, where pg_stat_statements (PGSS) entries are now grouped using the relation name, ignoring the relation search_path may point at. For example, these two relations are different, but are now grouped in a single PGSS entry as they are assigned the same query ID: CREATE TABLE foo1.tab (a int); CREATE TABLE foo2.tab (b int); SET search_path = 'foo1'; SELECT count() FROM tab; SET search_path = 'foo2'; SELECT count() FROM tab; SELECT count() FROM foo1.tab; SELECT count() FROM foo2.tab; SELECT query, calls FROM pg_stat_statements WHERE query ~ 'FROM tab'; query \| calls --------------------------+------- SELECT count(*) FROM tab \| 4 (1 row) It is still possible to use an alias in the FROM clause to split these. This behavior is useful for relations re-created with the same name, where queries based on such relations would be grouped in the same PGSS entry. For permanent schemas, it should not really matter in practice. The main benefit is for workloads that use a lot of temporary relations, which are usually re-created with the same name continuously. These can be a heavy source of bloat in PGSS depending on the workload. Such entries can now be grouped together, improving the user experience. The original idea from Christoph Berg used catalog lookups to find temporary relations, something that the query jumble has never done, and it could cause some performance regressions. The idea to use RangeTblEntry.eref and the relation name, applying the same rules for all relations, temporary and not temporary, has been proposed by Tom Lane. The documentation additions have been suggested by Sami Imseih. Author: Michael Paquier <michael@paquier.xyz> Co-authored-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Christoph Berg <myon@debian.org> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/Z9iWXKGwkm8RAC93@msg.df7cb.de	2025-03-26 15:21:05 +09:00
Michael Paquier	27ee6ede6b	Fix two issues with custom_query_jumble in gen_node_support.pl A node field marked with custom_query_jumble and query_jumble_ignore would generate some code of a custom routine. The script is changed so as custom_query_jumble behaves like the other options in this case, query_jumble_ignore taking priority, with no code generated. A comment related to the code generated for node types was misplaced. Thinkos introduced in `5ac462e2b7`. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1324036.1742945060@sss.pgh.pa.us	2025-03-26 09:06:36 +09:00
Jeff Davis	650ab8aaf1	Stats: use schemaname/relname instead of regclass. For import and export, use schemaname/relname rather than regclass. This is more natural during export, fits with the other arguments better, and it gives better control over error handling in case we need to downgrade more errors to warnings. Also, use text for the argument types for schemaname, relname, and attname so that casts to "name" are not required. Author: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/CADkLM=ceOSsx_=oe73QQ-BxUFR2Cwqum7-UP_fPe22DBY0NerA@mail.gmail.com	2025-03-25 11:16:06 -07:00
Peter Eisentraut	ef7a5af77d	refactor: Pass relation OID instead of Relation to createForeignKeyCheckTriggers() Currently, createForeignKeyCheckTriggers() takes a Relation type as its first argument, but it doesn't use that argument directly. Instead, it fetches the relation OID by calling RelationGetRelid(). Therefore, it would be more consistent with other functions (e.g., createForeignKeyCheckTriggers()) to pass the relation OID directly instead of the whole Relation. Author: Amul Sul <amul.sul@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com	2025-03-25 17:04:12 +01:00
Peter Eisentraut	639238b978	refactor: Split ATExecAlterConstraintInternal() Split ATExecAlterConstraintInternal() into two functions: ATExecAlterConstrDeferrability() and ATExecAlterConstrInheritability(). This simplifies the code and avoids unnecessary confusion caused by recursive code, which isn't needed for ATExecAlterConstrInheritability(). (This also takes over the changes in commit `64224a834c`, as the new AlterConstrDeferrabilityRecurse() is essentially the old ATExecAlterChildConstr().) Author: Amul Sul <amul.sul@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com	2025-03-25 16:18:00 +01:00
Peter Eisentraut	a3280e2a49	refactor: Move some code that updates pg_constraint to a separate function This extracts common/duplicate code for different ALTER CONSTRAINT variants into a common function. We plan to add more variants that would use the same code. Author: Amul Sul <amul.sul@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com	2025-03-25 14:37:22 +01:00
Peter Eisentraut	f4b2a62ae3	Small fixes for Add ALTER TABLE ... ALTER CONSTRAINT ... SET [NO] INHERIT Small fixes for commit `f4e53e10b6`: Add missing calls to InvokeObjectPostAlterHook() and also CacheInvalidateRelcache(). The former change could have a user-visible effect. The latter omission might have caused other bugs, but it is not clear whether one actually existed. With these changes, the code is now more consistent with similar ALTER CONSTRAINT variants, especially the ones that set the deferrability. Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CAF1DzPVfOW6Kk=7SSh7LbneQDJWh=PbJrEC_Wkzc24tHOyQWGg@mail.gmail.com	2025-03-25 13:40:24 +01:00
Peter Eisentraut	be1cc9aaf5	Generalize index support in network support function The network (inet) support functions currently only supported a hardcoded btree operator family. With the generalized compare type facility, we can generalize this to support any operator family from any index type that supports the required operators. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-25 07:11:56 +01:00
Michael Paquier	5ac462e2b7	Add support for custom_query_jumble as a node field attribute This option gives the possibility for query jumble to define a custom routine for the field of a Node, extending support for custom_query_jumble as a node field attribute. When dealing with complex node structures, this can be simpler than having to enforce a custom function across a full node. Custom functions need to be defined in queryjumblefuncs.c, named as _jumble${node}_${field}(), and use in input the JumbleState, the node and its field. The field is not really required if we have the Node, but it makes custom implementations somewhat easier to think about. The code generated by gen_node_support.pl uses a macro called JUMBLE_CUSTOM(), hiding the internals of the logic inside queryjumblefuncs.c. This will be used by an upcoming patch manipulating adding a custom routine into a field of RangeTblEntry, but this facility can become useful in more cases. Reviewed-by: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/Z9y43-dRvb4EtxQ0@paquier.xyz	2025-03-25 14:18:00 +09:00
Jeff Davis	626df47ad9	Remove 'additional' pointer from TupleHashEntryData. Reduces memory required for hash aggregation by avoiding an allocation and a pointer in the TupleHashEntryData structure. That structure is used for all buckets, whether occupied or not, so the savings is substantial. Discussion: https://postgr.es/m/AApHDvpN4v3t_sdz4dvrv1Fx_ZPw=twSnxuTEytRYP7LFz5K9A@mail.gmail.com Reviewed-by: David Rowley <dgrowleyml@gmail.com>	2025-03-24 22:06:02 -07:00
Jeff Davis	a0942f441e	Add ExecCopySlotMinimalTupleExtra(). Allows an "extra" argument that allocates extra memory at the end of the MinimalTuple. This is important for callers that need to store additional data, but do not want to perform an additional allocation. Suggested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvppeqw2pNM-+ahBOJwq2QmC0hOAGsmCpC89QVmEoOvsdg@mail.gmail.com	2025-03-24 22:05:53 -07:00
Jeff Davis	4d143509cb	Create accessor functions for TupleHashEntry. Refactor for upcoming optimizations. Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/1cc3b400a0e8eead18ff967436fa9e42c0c14cfb.camel@j-davis.com	2025-03-24 22:05:41 -07:00
Jeff Davis	cc721c459d	HashAgg: use Bump allocator for hash TupleHashTable entries. The entries aren't freed until the entire hash table is destroyed, so use the Bump allocator to improve allocation speed, avoid wasting space on the chunk header, and avoid wasting space due to the power-of-two allocations. Discussion: https://postgr.es/m/CAApHDvqv1aNB4cM36FzRwivXrEvBO_LsG_eQ3nqDXTjECaatOQ@mail.gmail.com Reviewed-by: David Rowley	2025-03-24 22:05:33 -07:00
Amit Kapila	b87ced747d	Fix an oversight in `3abe9dc188`. Forgot to update the comment atop one of the functions. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/OSCPR01MB1496623BE1125B44614494E7AF5A72@OSCPR01MB14966.jpnprd01.prod.outlook.com	2025-03-25 09:26:23 +05:30
Andres Freund	adb5f85fa5	Redefine max_files_per_process to control additionally opened files Until now max_files_per_process=N limited each backend to open N files in total (minus a safety factor), even if there were already more files opened in postmaster and inherited by backends. Change max_files_per_process to control how many additional files each process is allowed to open. The main motivation for this is the patch to add io_method=io_uring, which needs to open one file for each backend. Without this patch, even if RLIMIT_NOFILE is high enough, postmaster will fail in set_max_safe_fds() if started with a high max_connections. The cause of the failure is that, until now, set_max_safe_fds() subtracted the already open files from max_files_per_process. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/w6uiicyou7hzq47mbyejubtcyb2rngkkf45fk4q7inue5kfbeo@bbfad3qyubvs Discussion: https://postgr.es/m/CAGECzQQh6VSy3KG4pN1d=h9J=D1rStFCMR+t7yh_Kwj-g87aLQ@mail.gmail.com	2025-03-24 18:20:18 -04:00
Melanie Plageman	aea916fe55	Fix bitmapheapscan incorrect recheck of NULL tuples The bitmap heap scan skip fetch optimization skips fetching the heap block when a page is set all-visible in the visibility map and no columns from the table are needed to satisfy the query. `2b73a8cd33` and `c3953226a0` changed the control flow of bitmap heap scan to use the read stream API. The read stream API returns buffers containing blocks to the user. To make this work with the skip fetch optimization, we keep a count of the empty tuples we need to emit for all the blocks skipped and only emit the empty tuples after processing the next block fetched from the heap or at the end of the scan. It's incorrect to recheck NULL tuples, so we must set `recheck` to false before yielding control back to BitmapHeapNext(). This was done before emitting any remaining empty tuples at the end of the scan but not for empty tuples emitted during the scan. This meant that if a page fetched from the heap did require recheck and set `recheck` to true and then we emitted empty tuples for subsequent blocks, we would get wrong results. Fix this by always setting `recheck` to false before emitting empty tuples. Reported-by: Alexander Lakhin <exclusion@gmail.com> Tested-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/496f7acd-881c-4df3-9bd3-8f8534dfec26%40gmail.com	2025-03-24 16:40:59 -04:00
Amit Kapila	73eba5004a	Detect and Log multiple_unique_conflicts type conflict. Introduce a new conflict type, multiple_unique_conflicts, to handle cases where an incoming row during logical replication violates multiple UNIQUE constraints. Previously, the apply worker detected and reported only the first encountered key conflict (insert_exists/update_exists), causing repeated failures as each constraint violation needs to be handled one by one making the process slow and error-prone. With this patch, the apply worker checks all unique constraints upfront once the first key conflict is detected and reports multiple_unique_conflicts if multiple violations exist. This allows users to resolve all conflicts at once by deleting all conflicting tuples rather than dealing with them individually or skipping the transaction. In the future, this will also allow us to specify different resolution handlers for such a conflict type. Add the stats for this conflict type in pg_stat_subscription_stats. Author: Nisha Moond <nisha.moond412@gmail.com> Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/CABdArM7FW-_dnthGkg2s0fy1HhUB8C3ELA0gZX1kkbs1ZZoV3Q@mail.gmail.com	2025-03-24 12:30:44 +05:30
Michael Paquier	2a0cd38da5	Allow plugins to set a 64-bit plan identifier in PlannedStmt This field can be optionally set in a PlannedStmt through the planner hook, giving extensions the possibility to assign an identifier related to a computed plan. The backend is changed to report it in the backend entry of a process running (including the extended query protocol), with semantics and APIs to set or get it similar to what is used for the existing query ID (introduced in the backend via `4f0b0966c8`). The plan ID is reset at the same timing as the query ID. Currently, this information is not added to the system view pg_stat_activity; extensions can access it through PgBackendStatus. Some patches have been proposed to provide some features in the planning area, where a plan identifier is used as a key to know the plan involved (for statistics, plan storage and manipulations, etc.), and the point of this commit is to provide an anchor in the backend that extensions can rely on for future work. The reset of the plan identifier is controlled by core and follows the same pattern as the query identifier added in `4f0b0966c8`. The contents of this commit are extracted from a larger set proposed originally by Lukas Fittl, that Sami Imseih has proposed as an independent change, with a few tweaks sprinkled by me. Author: Lukas Fittl <lukas@fittl.com> Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAP53Pkyow59ajFMHGpmb1BK9WHDypaWtUsS_5DoYUEfsa_Hktg@mail.gmail.com Discussion: https://postgr.es/m/CAA5RZ0vyWd4r35uUBUmhngv8XqeiJUkJDDKkLf5LCoWxv-t_pw@mail.gmail.com	2025-03-24 13:23:42 +09:00
Heikki Linnakangas	2817525f0d	Fix rare assertion failure in standby, if primary is restarted During hot standby, ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() functions mark old transactions as no-longer running, but they failed to update xactCompletionCount and latestCompletedXid. AFAICS it would not lead to incorrect query results, because those functions effectively turn in-progress transactions into aborted transactions and an MVCC snapshot considers both as "not visible". But it could surprise GetSnapshotDataReuse() and trigger the "TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin))" assertion in it, if the apparent xmin in a backend would move backwards. We saw this happen when GetCatalogSnapshot() would reuse an older catalog snapshot, when GetTransactionSnapshot() had already advanced TransactionXmin. The bug goes back all the way to commit `623a9ba79b` in v14 that introduced the snapshot reuse mechanism, but it started to happen more frequently with commit `952365cded` which removed a GetTransactionSnapshot() call from backend startup. That made it more likely for ExpireOldKnownAssignedTransactionIds() to be called between GetCatalogSnapshot() and the first GetTransactionSnapshot() in a backend. Andres Freund first spotted this assertion failure on buildfarm member 'skink'. Reproduction and analysis by Tomas Vondra. Backpatch-through: 14 Discussion: https://www.postgresql.org/message-id/oey246mcw43cy4qw2hqjmurbd62lfdpcuxyqiu7botx3typpax%40h7o7mfg5zmdj	2025-03-23 20:41:16 +02:00
Andres Freund	ca3067cc57	aio: Change prefix of PgAioResultStatus values to PGAIO_RS_ The previous prefix wasn't consistent with the naming of other AIO related enum values. It seems best to rename it before the users are introduced. Reported-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_Yb+JzQpNsgUxCB0gBi+sE-mi_HmcJF6ALnmO4W+UgwpA@mail.gmail.com	2025-03-22 17:30:44 -04:00
Peter Geoghegan	9a2e2a285a	Improve nbtree array primitive scan scheduling. Add a new scheduling heuristic: don't end the ongoing primitive index scan immediately (at the point where _bt_advance_array_keys notices that the next set of matching tuples must be on a later page) if the primscan already managed to step right/left from its first leaf page. Schedule a recheck against the next sibling leaf page's finaltup instead. The new heuristic tends to avoid scenarios where the top-level scan repeatedly starts and ends primitive index scans that each read only one leaf page from a group of neighboring leaf pages. Affected top-level scans will now tend to step forward (or backward) through the index instead, without wasting cycles on descending the index anew. The recheck mechanism isn't exactly new. But up until now it has only been used to deal with edge cases involving high key finaltups with one or more truncated -inf attributes that _bt_advance_array_keys deemed "provisionally satisfied" (satisfied for the purposes of allowing the scan to step onto the next page, subject to recheck once on that page). The mechanism was added by commit `5bf748b8`, which invented the general concept of primitive scan scheduling. It was later enhanced by commit `79fa7b3b`, which taught it about cases involving -inf attributes that satisfy inequality scan keys required in the opposite-to-scan direction only (arguably, they should have been covered by the earliest version). Now the recheck mechanism can be applied based on scan-level heuristics, which have nothing to do with truncated high keys. Now rechecks might be performed by _bt_readpage when scanning in _either_ scan direction. The theory behind the new heuristic is that any primitive scan that makes it past its first leaf page is one that is already likely to have arrays whose key values match index tuples that are closely clustered together in the index. The rules that determine whether we ever get past the first page are still conservative (that'll still only happen when pstate.finaltup strongly suggests that it's the right thing to do). Surviving past the first leaf page is a strong signal in itself. Preparation for an upcoming patch that will add skip scan optimizations to nbtree. That'll work by adding skip arrays, which behave similarly to SAOP arrays, but generate their elements procedurally and on-demand. Note that this commit isn't specifically concerned with skip arrays; the scheduling logic doesn't (and won't) condition anything on whether the scan uses skip arrays, SAOP arrays, or some combination of the two (which seems like a good general principle for _bt_advance_array_keys). While the problems that this commit ameliorates are more likely with skip arrays (at least in practice), SAOP arrays (or those with very dense, contiguous array elements) are also affected. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzkz0wPe6+02kr+hC+JJNKfGtjGTzpG3CFVTQmKwWNrXNw@mail.gmail.com	2025-03-22 13:02:18 -04:00
Melanie Plageman	e215166c9c	Use streaming read I/O in SP-GiST vacuuming Like `69273b818b` did for GiST vacuuming, make SP-GiST vacuum use the read stream API for vacuuming physically contiguous index pages. Concurrent insertions may cause SP-GiST index tuples to be redirected. While vacuuming, these are added to a pending list which is later processed to ensure no dead tuples are left behind. Pages containing such tuples are still read by directly calling ReadBuffer() and do not use the read stream API. Author: Andrey M. Borodin <x4mmm@yandex-team.ru> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/37432403-8657-403B-9CDF-5A642BECDD81%40yandex-team.ru	2025-03-21 17:51:22 -04:00
Thomas Munro	e51ca405ed	Fix ps display for IO workers. This code must have missed a memo about the backend type description being supplied automatically these days, and was duplicating that information. Before: "io worker io worker: N" After: "io worker N"	2025-03-22 10:13:23 +13:00
Masahiko Sawada	04ff636cbc	Add GUC option to control maximum active replication origins. This commit introduces a new GUC option max_active_replication_origins to control the maximum number of active replication origins. Previously, this was controlled by 'max_replication_slots'. Having a separate GUC option provides better flexibility for setting up subscribers, as they may not require replication slots (for cascading replication) but always require replication origins. Author: Euler Taveira <euler@eulerto.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: vignesh C <vignesh21@gmail.com> Discussion: https://postgr.es/m/b81db436-8262-4575-b7c4-bc0c1551000b@app.fastmail.com	2025-03-21 12:20:15 -07:00
Tom Lane	cd72c1b76e	Label the contents of pg__d.h files a little better. Make genbki.pl emit some boilerplate comments identifying the sections of the pg__d.h files that it generates. This is in hopes of making them slightly more readable, in case people look at those files and not the pg_.h/pg_.dat originals. Discussion: https://postgr.es/m/1134562.1742507765@sss.pgh.pa.us	2025-03-21 15:09:46 -04:00
Melanie Plageman	69273b818b	Use streaming read I/O in GiST vacuuming Like `c5c239e26e` did for btree vacuuming, make GiST vacuum use the read stream API for sequentially processed pages. Because it is possible for concurrent insertions to relocate unprocessed index entries to already vacuumed pages, GiST vacuum must backtrack and reprocess those pages. These pages are still read with explicit ReadBuffer() calls. Author: Andrey M. Borodin <x4mmm@yandex-team.ru> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/EFEBED92-18D1-4C0F-A4EB-CD47072EF071%40yandex-team.ru	2025-03-21 14:06:45 -04:00
Melanie Plageman	3f850c3fc5	Assorted trivial cleanup of `c5c239e26e` `c5c239e26e` made btree vacuum use the read stream API. Though it used functions declared in read_stream.h, it relied on transitively including it. Explicitly include that file. Also remove an extraneous newline and decrease the scope of one of the local variables in btvacuumscan().	2025-03-21 14:06:40 -04:00
Melanie Plageman	c5c239e26e	Use streaming read I/O in btree vacuuming Btree vacuum processes all index pages in physical order. Now it uses the read stream API to get the next buffer instead of explicitly invoking ReadBuffer(). It is possible for concurrent insertions to cause page splits during index vacuuming. This can lead to index entries that have yet to be vacuumed being moved to pages that have already been vacuumed. Btree vacuum code handles this by backtracking to reprocess those pages. So, while sequentially encountered pages are now read through the read stream API, backtracked pages are still read with explicit ReadBuffer() calls. Author: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_bW1UOyup%3DjdFw%2BkOF9bCaAm%3D9UpiyZtbPMn8n_vnP%2Big%40mail.gmail.com#3b3a84132fc683b3ee5b40bc4c2ea2a5	2025-03-21 09:09:39 -04:00
Álvaro Herrera	1d617a2028	Change one loop in ATRewriteTable to use 1-based attnums All TupleDescAttr() calls in tablecmds.c that aren't in loops across all attributes use AttrNumber-style indexes (1-based); there was only one place in ATRewriteTable that was stashing 0-based indexes in a list for later processing. Switch that to use attnums for consistency. Author: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxEoYA5ScUr2=CmA1xcpaS_1ixneDbEkVU77X1ctGxY2mA@mail.gmail.com	2025-03-21 10:55:06 +01:00
Thomas Munro	ce1a75c4fe	Support buffer forwarding in StartReadBuffers(). StartReadBuffers() reports a short read when it finds a cached block that ends a range needing I/O by updating the caller's nblocks. It doesn't want to have to unpin the trailing hit that it knows the caller wants, so the v17 version used sleight of hand in the name of simplicity: it included it in nblocks as if it were part of the I/O, but internally tracked the shorter real I/O size in io_buffers_len (now removed). This API change "forwards" the delimiting buffer to the next call. It's still pinned, and still stored in the caller's array, but *nblocks no longer includes stray buffers that are not really part of the operation. The expectation is that the caller still wants the rest of the blocks and will call again starting from that point, and now it can pass the already pinned buffer back in (or choose not to and release it). The change is needed for the coming asynchronous I/O version's larger version of the problem: by definition it must move BM_IO_IN_PROGRESS negotiation from WaitReadBuffers() to StartReadBuffers(), but it might already have many buffers pinned before it discovers a need to split an I/O. (The current synchronous I/O version hides that detail from callers by looping over smaller reads if required to make all covered buffers valid in WaitReadBuffers(), so it looks like one operation but it might occasionally be several under the covers.) Aside from avoiding unnecessary pin traffic, this will also be important for later work on out-of-order streams: you can't prioritize data that is already available right now if that fact is hidden from you. The new API is natural for read_stream.c (see `ed0b87ca`). After a short read it leaves forwarded buffers where they fell in its circular queue for the continuing call to pick up. Single-block StartReadBuffer() and traditional ReadBuffer() share code but are not affected by the change. They don't do multi-block I/O. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-21 20:43:59 +13:00
Thomas Munro	ed0b87caac	Support buffer forwarding in read_stream.c. In preparation for a follow-up change to the buffer manager, teach read_stream.c to manage buffers "forwarded" from one StartReadBuffers() call to the next after a short read. This involves a small amount of extra book-keeping, and opens the way for lower levels to split I/O operations without having to drop pins, as required for efficient handling of various edge cases. Concretely, the "buffers" argument will change from an out parameter to an in/out parameter. Buffer queue elements must be initialized on first use and cleared after they're consumed, but forwarded buffers are left where they fall ahead of the current pending read in the queue, ready for use by the operation that continues where a short read left off. The stream also needs to count them for pin limit management and release them on reset/early end. Tested-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-21 18:44:47 +13:00
David Rowley	00b52c3db6	Simplify EXPLAIN code for Memoize This removes a needless special case for Memoize's FORMAT TEXT EXPLAIN output. ExplainPropertyText() outputs the same thing in text mode as the special-case code was doing, so removing the special-case code results in the same EXPLAIN output, just with less code. It seems like a good idea to fix this to help prevent future changes in this area from copying the same pattern. Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> Reported-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/88a71bcd-0b5c-4d0b-8107-757e96f402d5@tantorlabs.com	2025-03-21 13:40:05 +13:00
Andres Freund	202b12774d	bufmgr: Improve stats when a buffer is read in concurrently Previously we would have the following inaccuracies when a backend tried to read in a buffer, but that buffer was read in concurrently by another backend: - the read IO was double-counted in the global buffer access stats (pgBufferUsage) - the buffer hit was not accounted for in: - global buffer access statistics - pg_stat_io - relation level IO stats - vacuum cost balancing While trying to read in a buffer that is concurrently read in by another backend is not a common occurrence, it's also not that rare, e.g. due to concurrent sequential scans on the same relation. This scenario has become more likely in PG 17, due to the introducing of read streams, which can pin multiple buffers before calling StartBufferIO() for all the buffers. This behaviour has historically grown, but there doesn't seem to be any reason to continue with the wrong accounting. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_Zk-B08AzPsO-6680LUHLOCGaNJYofaxTFseLa=OepV1g@mail.gmail.com	2025-03-20 19:58:22 -04:00
Andres Freund	fc51a60dd4	smgr: Hold interrupts in most smgr functions We need to hold interrupts across most of the smgr.c/md.c functions, as otherwise interrupt processing, e.g. due to a < ERROR elog/ereport, can trigger procsignal processing, which in turn can trigger smgrreleaseall(). As the relevant code is not reentrant, we quickly end up in a bad situation. The only reason we haven't noticed this before is that there is only one non-error ereport called in affected routines, in register_dirty_segments(), and that one is extremely rarely reached. If one enables fd.c's FDDEBUG it's easy to reproduce crashes. It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c, instead of trying to push them down to md.c where possible: For one, every smgr implementation would be vulnerable, for another, a good bit of smgr.c code itself is affected too. Eventually we might want a more targeted solution, allowing e.g. a networked smgr implementation to be interrupted, but many other, more complicated, problems would need to be fixed for that to be viable (e.g. smgr.c is often called with interrupts already held). One could argue this should be backpatched, but the existing < ERROR elog/ereports that can be reached with unmodified sources are unlikely to be reached. On balance the risk of backpatching seems higher than the gain - at least for now. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl	2025-03-20 17:33:57 -04:00
Robert Haas	50ba65e733	Add an additional hook for EXPLAIN option validation. Commit `c65bc2e1d1` made it possible for loadable modules to add EXPLAIN options. Normally, any necessary validation can be performed by the hook function passed to RegisterExtensionExplainOption, but if a loadable module wants to sanity check options against each other, that needs to be done after the entire options list has been processed. So, add an additional hook for that purpose. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CAA5RZ0vOcJF91O2e5AQN+V6guMNLMhJx83dxALf-iUZ-hLGO_Q@mail.gmail.com	2025-03-20 13:47:55 -04:00
Nathan Bossart	0164a0f9ee	Add vacuum_truncate configuration parameter. This new parameter works just like the storage parameter of the same name: if set to true (which is the default), autovacuum and VACUUM attempt to truncate any empty pages at the end of the table. It is primarily intended to help users avoid locking issues on hot standbys. The setting can be overridden with the storage parameter or VACUUM's TRUNCATE option. Since there's presently no way to determine whether a Boolean storage parameter is explicitly set or has just picked up the default value, this commit also introduces an isset_offset member to relopt_parse_elt. Suggested-by: Will Storey <will@summercat.com> Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Gurjeet Singh <gurjeet@singh.im> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Robert Treat <rob@xzilla.net> Discussion: https://postgr.es/m/Z2DE4lDX4tHqNGZt%40dev.null	2025-03-20 10:16:50 -05:00
Peter Eisentraut	618c64ffd3	Revert workarounds for -Wmissing-braces false positives on old GCC We have collected several instances of a workaround for GCC bug 53119, which caused false-positive compiler warnings. This bug has long been fixed, but was still seen on the buildfarm, most recently on lapwing with gcc (Debian 4.7.2-5). (The GCC bug tracker mentions that a fix was backported to 4.7.4 and 4.8.3.) That compiler no longer runs warning-free since commit `6fdd5d9563`, so we don't need to keep these workarounds. And furthermore, the consensus appears to be that we don't want to keep supporting that era of platform anymore at all. This reverts the following commits: `d937904cce` `506428d091` `b449afb582` `6392f2a096` `bad0763a4d` `5e0c761d0a` and makes a few similar fixes to newer code. Discussion: https://www.postgresql.org/message-id/flat/e170d61f-01ab-4cf9-ab68-91cd1fac62c5%40eisentraut.org Discussion: https://www.postgresql.org/message-id/flat/CA%2BTgmoYEAm-KKZibAP3hSqbTFTjUd47XtVcf3xSFDpyecXX9uQ%40mail.gmail.com	2025-03-20 11:25:58 +01:00
Peter Eisentraut	47929324c5	Fix typo in comment	2025-03-20 10:44:12 +01:00
Peter Eisentraut	190dc27998	Update a code comment The comment explained that ALTER TABLE ADD CONSTRAINT USING INDEX is only supported with a btree index. (This is not being changed.) The reason is to keep upgrades robust, as explained there. The other part of the comment, that btree is the only unique index kind anyway, is somewhat less true as we're trying to enable unique indexes other than btree, and it's irrelevant to this check. There is a check for indisunique earlier already. So just remove this part of the comment. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-19 10:39:06 +01:00
Peter Eisentraut	4f7f7b0375	extension_control_path The new GUC extension_control_path specifies a path to look for extension control files. The default value is $system, which looks in the compiled-in location, as before. The path search uses the same code and works in the same way as dynamic_library_path. Some use cases of this are: (1) testing extensions during package builds, (2) installing extensions outside security-restricted containers like Python.app (on macOS), (3) adding extensions to PostgreSQL running in a Kubernetes environment using operators such as CloudNativePG without having to rebuild the base image for each new extension. There is also a tweak in Makefile.global so that it is possible to install extensions using PGXS into an different directory than the default, using 'make install prefix=/else/where'. This previously only worked when specifying the subdirectories, like 'make install datadir=/else/where/share pkglibdir=/else/where/lib', for purely implementation reasons. (Of course, without the path feature, installing elsewhere was rarely useful.) Author: Peter Eisentraut <peter@eisentraut.org> Co-authored-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: David E. Wheeler <david@justatheory.com> Reviewed-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Reviewed-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Reviewed-by: Niccolò Fei <niccolo.fei@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E7C7BFFB-8857-48D4-A71F-88B359FADCFD@justatheory.com	2025-03-19 07:03:20 +01:00
Amit Langote	28317de723	Ensure first ModifyTable rel initialized if all are pruned Commit `cbc127917e` introduced tracking of unpruned relids to avoid processing pruned relations, and changed ExecInitModifyTable() to initialize only unpruned result relations. As a result, MERGE statements that prune all target partitions can now lead to crashes or incorrect behavior during execution. The crash occurs because some executor code paths rely on ModifyTableState.resultRelInfo[0] being present and initialized, even when no result relations remain after pruning. For example, ExecMerge() and ExecMergeNotMatched() use the first resultRelInfo to determine the appropriate action. Similarly, ExecInitPartitionInfo() assumes that at least one result relation exists. To preserve these assumptions, ExecInitModifyTable() now includes the first result relation in the initialized result relation list if all result relations for that ModifyTable were pruned. To enable that, ExecDoInitialPruning() ensures the first relation is locked if it was pruned and locking is necessary. To support this exception to the pruning logic, PlannedStmt now includes a list of RT indexes identifying the first result relation of each ModifyTable node in the plan. This allows ExecDoInitialPruning() to check whether each such relation was pruned and, if so, lock it if necessary. Bug: #18830 Reported-by: Robins Tharakan <tharakan@gmail.com> Diagnozed-by: Tender Wang <tndrwang@gmail.com> Diagnozed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Co-authored-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/18830-1f31ea1dc930d444%40postgresql.org	2025-03-19 12:14:24 +09:00
Thomas Munro	06fb5612c9	Increase io_combine_limit range to 1MB. The default of 128kB is unchanged, but the upper limit is changed from 32 blocks to 128 blocks, unless the operating system's IOV_MAX is too low. Some other RDBMSes seem to cap their multi-block buffer pool I/O around this number, and it seems useful to allow experimentation. The concrete change is to our definition of PG_IOV_MAX, which provides the maximum for io_combine_limit and io_max_combine_limit. It also affects a couple of other places that work with arrays of struct iovec or smaller objects on the stack, so we still don't want to use the system IOV_MAX directly without a clamp: it is not under our control and likely to be 1024. 128 seems acceptable for our current usage. For Windows, we can't use real scatter/gather yet, so we continue to define our own IOV_MAX value of 16 and emulate preadv()/pwritev() with loops. Someone would need to research the trade-offs of raising that number. NB if trying to see this working: you might temporarily need to hack BAS_BULKREAD to be bigger, since otherwise the obvious way of "a very big SELECT" is limited by that for now. Suggested-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:40:35 +13:00
Thomas Munro	10f6646847	Introduce io_max_combine_limit. The existing io_combine_limit can be changed by users. The new io_max_combine_limit is fixed at server startup time, and functions as a silent clamp on the user setting. That in itself is probably quite useful, but the primary motivation is: aio_init.c allocates shared memory for all asynchronous IOs including some per-block data, and we didn't want to waste memory you'd never used by assuming they could be up to PG_IOV_MAX. This commit already halves the size of 'AioHandleIov' and 'AioHandleData'. A follow-up commit can now expand PG_IOV_MAX without affecting that. Since our GUC system doesn't support dependencies or cross-checks between GUCs, the user-settable one now assigns a "raw" value to io_combine_limit_guc, and the lower of io_combine_limit_guc and io_max_combine_limit is maintained in io_combine_limit. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:23:54 +13:00
Michael Paquier	17d8bba6da	Fix copy-paste error related to the autovacuum launcher in pgstat_io.c Autovacuum launchers perform no WAL IO reads, but pgstat_tracks_io_op() was tracking them as an allowed combination for the "init" and "normal" contexts. This caused the "read", "read_bytes" and "read_time" attributes of pg_stat_io to show zeros for the autovacuum launcher rather than NULL. NULL means that a combination of IO object, IO context and IO operation has no meaning for a backend type. Zero is the same as telling that a combination is relevant, and that WAL reads are possible in an autovacuum launcher, but it is not relevant. Copy-pasto introduced in `a051e71e28`. Author: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CAEudQAopEMAPiUqE7BvDV+x2fUPmKmb9RrsaoDR+hhQzLKg4PQ@mail.gmail.com	2025-03-19 08:52:10 +09:00
Masahiko Sawada	f4290f20dd	Fix assertion failure in parallel vacuum with minimal maintenance_work_mem setting. `bbf668d66f` lowered the minimum value of maintenance_work_mem to 64kB. However, in parallel vacuum cases, since the initial underlying DSA size is 256kB, it attempts to perform a cycle of index vacuuming and table vacuuming with an empty TID store, resulting in an assertion failure. This commit ensures that at least one page is processed before index vacuuming and table vacuuming begins. Backpatch to 17, where the minimum maintenance_work_mem value was lowered. Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAD21AoCEAmbkkXSKbj4dB+5pJDRL4ZHxrCiLBgES_g_g8mVi1Q@mail.gmail.com Backpatch-through: 17	2025-03-18 16:37:02 -07:00
Michael Paquier	6d3ea48ff1	Optimize check for pending backend IO stats This commit changes the backend stats code so as we rely on a single boolean rather than a repeated check based on pg_memory_is_all_zeros() in the code, making it cheaper should PgStat_PendingIO get bigger in size. The frequency of backend stats reports is not a bottleneck, but there is no reason to not make that cheaper, and the logic is simple as the only entry points updating backend IO stats are pgstat_count_backend_io_op() and pgstat_count_backend_io_op_time(). Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/Z8WYf1jyy4MwOveQ@ip-10-97-1-34.eu-west-3.compute.internal	2025-03-19 08:03:06 +09:00
Andres Freund	499faf9063	smgr: Make SMgrRelation initialization safer against errors In case the smgr_open callback failed, the ->pincount field would not be initialized and the relation would not be put onto the unpinned_relns list. This buglet was introduced in `21d9c3ee4e`, in 17. Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl Backpatch-through: 17	2025-03-18 14:04:44 -04:00
Álvaro Herrera	62d712ecfd	Introduce squashing of constant lists in query jumbling pg_stat_statements produces multiple entries for queries like SELECT something FROM table WHERE col IN (1, 2, 3, ...) depending on the number of parameters, because every element of ArrayExpr is individually jumbled. Most of the time that's undesirable, especially if the list becomes too large. Fix this by introducing a new GUC query_id_squash_values which modifies the node jumbling code to only consider the first and last element of a list of constants, rather than each list element individually. This affects both the query_id generated by query jumbling, as well as pg_stat_statements query normalization so that it suppresses printing of the individual elements of such a list. The default value is off, meaning the previous behavior is maintained. Author: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Sergey Dudoladov (mysterious, off-list) Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Marcos Pegoraro <marcos@f10.com.br> Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Tested-by: Yasuo Honda <yasuo.honda@gmail.com> Tested-by: Sergei Kornilov <sk@zsrv.org> Tested-by: Maciek Sakrejda <m.sakrejda@gmail.com> Tested-by: Chengxi Sun <sunchengxi@highgo.com> Tested-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: https://postgr.es/m/CA+q6zcWtUbT_Sxj0V6HY6EZ89uv5wuG5aefpe_9n0Jr3VwntFg@mail.gmail.com	2025-03-18 18:56:11 +01:00
Andres Freund	247ce06b88	aio: Add io_method=worker The previous commit introduced the infrastructure to start io_workers. This commit actually makes the workers execute IOs. IO workers consume IOs from a shared memory submission queue, run traditional synchronous system calls, and perform the shared completion handling immediately. Client code submits most requests by pushing IOs into the submission queue, and waits (if necessary) using condition variables. Some IOs cannot be performed in another process due to lack of infrastructure for reopening the file, and must processed synchronously by the client code when submitted. For now the default io_method is changed to "worker". We should re-evaluate that around beta1, we might want to be careful and set the default to "sync" for 18. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-18 11:54:01 -04:00
Andres Freund	55b454d0e1	aio: Infrastructure for io_method=worker This commit contains the basic, system-wide, infrastructure for io_method=worker. It does not yet actually execute IO, this commit just provides the infrastructure for running IO workers, kept separate for easier review. The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually we'd like to make the number of workers dynamically scale up/down based on the current "IO load". To allow the number of IO workers to be increased without a restart, we need to reserve PGPROC entries for the workers unconditionally. This has been judged to be worth the cost. If it turns out to be problematic, we can introduce a PGC_POSTMASTER GUC to control the maximum number. As io workers might be needed during shutdown, e.g. for AIO during the shutdown checkpoint, a new PMState phase is added. IO workers are shut down after the shutdown checkpoint has been performed and walsender/archiver have shut down, but before the checkpointer itself shuts down. See also `87a6690cc6`. Updates PGSTAT_FILE_FORMAT_ID due to the addition of a new BackendType. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-18 11:54:01 -04:00
Daniel Gustafsson	daa02c6bd9	Add X25519 to the default set of curves Since many clients default to the X25519 curve in the TLS handshake, the fact that the server by defualt doesn't support it cause an extra roundtrip for each TLS connection. By adding multiple curves, which is supported since `3d1ef3a15c`, we can reduce the risk of extra roundtrips. Author: Daniel Gustafsson <daniel@yesql.se> Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com> Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Discussion: https://postgr.es/m/20240616234612.6cslu7nqexquvwj7@awork3.anarazel.de	2025-03-18 15:26:27 +01:00
Robert Haas	4fd02bf7cf	Add some new hooks so extensions can add details to EXPLAIN. Specifically, add a per-node hook that is called after the per-node information has been displayed but before we display children, and a per-query hook that is called after existing query-level information is printed. This assumes that extension-added information should always go at the end rather than the beginning or the middle, but that seems like an acceptable limitation for simplicity. It also assumes that extensions will only want to add information, not remove or reformat existing details; those also seem like acceptable restrictions, at least for now. If multiple EXPLAIN extensions are used, the order in which any additional details are printed is likely to depend on the order in which the modules are loaded. That seems OK, since the user may have opinions about the order in which output should appear, and the extension author can't really know whether their stuff is more or less important to a particular user than some other extension. Discussion: http://postgr.es/m/CA+TgmoYSzg58hPuBmei46o8D3SKX+SZoO4K_aGQGwiRzvRApLg@mail.gmail.com Reviewed-by: Srinath Reddy <srinath2133@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Sami Imseih <samimseih@gmail.com>	2025-03-18 09:28:01 -04:00
Melanie Plageman	cc6be07ebd	Increase default maintenance_io_concurrency to 16 Since its introduction in `fc34b0d9de`, the default maintenance_io_concurrency has been larger than the default effective_io_concurrency. maintenance_io_concurrency primarily controlled prefetching done on behalf of the whole system, for operations like recovery. Therefore it makes sense for it to have a value equal to or greater than effective_io_concurrency, which controls I/O concurrency for reading a relation in a bitmap heap scan. `ff79b5b2ab` increased effective_io_concurrency to 16, so we'll increase maintenance_io_concurrency as well. For now, though, we'll keep the defaults of effective_io_concurrency and maintenance_io_concurrency equal to one another (16). On fast, high IOPs systems, significantly higher values of maintenance_io_concurrency are observably beneficial [1]. However, such values would flood low IOPs systems and increase overall system I/O latency. It is worth mentioning that since `9256822608` and `c3e775e608`, maintenance_io_concurrency also controls the I/O concurrency of each vacuum worker. Since many autovacuum workers may be simultaneously issuing I/Os, we want to keep maintenance_io_concurrency appropriately conservative. [1] https://postgr.es/m/c5d52837-6256-0556-ac8c-d6d3d558820a%40enterprisedb.com Suggested-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: https://postgr.es/m/CAKZiRmxdHQaU%2B2Zpe6d%3Dx%3D0vigJ1sfWwwVYLJAf%3Dud_wQ_VcUw%40mail.gmail.com	2025-03-18 09:08:10 -04:00
Robert Haas	c65bc2e1d1	Make it possible for loadable modules to add EXPLAIN options. Modules can use RegisterExtensionExplainOption to register new EXPLAIN options, and GetExplainExtensionId, GetExplainExtensionState, and SetExplainExtensionState to store related state inside the ExplainState object. Since this substantially increases the amount of code that needs to handle ExplainState-related tasks, move a few bits of existing code to a new file explain_state.c and add the rest of this infrastructure there. See the comments at the top of explain_state.c for further explanation of how this mechanism works. This does not yet provide a way for such such options to do anything useful. The intention is that we'll add hooks for that purpose in a separate commit. Discussion: http://postgr.es/m/CA+TgmoYSzg58hPuBmei46o8D3SKX+SZoO4K_aGQGwiRzvRApLg@mail.gmail.com Reviewed-by: Srinath Reddy <srinath2133@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Sami Imseih <samimseih@gmail.com>	2025-03-18 08:41:12 -04:00
Peter Eisentraut	9d6db8bec1	Allow non-btree unique indexes for matviews We were rejecting non-btree indexes in some cases owing to the inability to determine the equality operators for other index AMs; that problem no longer exists, because we can look up the equality operator using COMPARE_EQ. Stop rejecting these indexes, but instead rely on all unique indexes having equality operators. Unique indexes must have equality operators. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-18 11:29:15 +01:00
Peter Eisentraut	f278e1fe30	Allow non-btree unique indexes for partition keys We were rejecting non-btree indexes in some cases owing to the inability to determine the equality operators for other index AMs; that problem no longer exists, because we can look up the equality operator using COMPARE_EQ. The problem of not knowing the strategy number for equality in other index AMs is already resolved. Stop rejecting the indexes upfront, and instead reject any for which the equality operator lookup fails. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-18 11:25:36 +01:00
Peter Eisentraut	7317e64126	Add some opfamily support functions to lsyscache.c Add get_opfamily_method() and get_opfamily_member_for_cmptype() in lsyscache.c. No callers yet, but we'll add some soon. This is part of generalizing some parts of the code away from having btree hardcoded and use CompareType instead. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-18 11:17:43 +01:00
Amit Kapila	122a9af5de	Fix typo. Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CALDaNm1KqJ0VFfDJRPbfYi9Shz6LHFEE-Ckn+eqsePfKhebv9w@mail.gmail.com	2025-03-18 14:18:09 +05:30
Amit Kapila	01e27aab05	Use correct variable name in publicationcmds.c. subid was used at few places for publicationid in publicationcmds.c/.h. Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CALDaNm1KqJ0VFfDJRPbfYi9Shz6LHFEE-Ckn+eqsePfKhebv9w@mail.gmail.com	2025-03-18 14:06:51 +05:30
Andres Freund	da7226993f	aio: Add core asynchronous I/O infrastructure The main motivations to use AIO in PostgreSQL are: a) Reduce the time spent waiting for IO by issuing IO sufficiently early. In a few places we have approximated this using posix_fadvise() based prefetching, but that is fairly limited (no completion feedback, double the syscalls, only works with buffered IO, only works on some OSs). b) Allow to use Direct-I/O (DIO). DIO can offload most of the work for IO to hardware and thus increase throughput / decrease CPU utilization, as well as reduce latency. While we have gained the ability to configure DIO in `d4e71df6`, it is not yet usable for real world workloads, as every IO is executed synchronously. For portability, the new AIO infrastructure allows to implement AIO using different methods. The choice of the AIO method is controlled by the new io_method GUC. As of this commit, the only implemented method is "sync", i.e. AIO is not actually executed asynchronously. The "sync" method exists to allow to bypass most of the new code initially. Subsequent commits will introduce additional IO methods, including a cross-platform method implemented using worker processes and a linux specific method using io_uring. To allow different parts of postgres to use AIO, the core AIO infrastructure does not need to know what kind of files it is operating on. The necessary behavioral differences for different files are abstracted as "AIO Targets". One example target would be smgr. For boring portability reasons, all targets currently need to be added to an array in aio_target.c. This commit does not implement any AIO targets, just the infrastructure for them. The smgr target will be added in a later commit. Completion (and other events) of IOs for one type of file (i.e. one AIO target) need to be reacted to differently, based on the IO operation and the callsite. This is made possible by callbacks that can be registered on IOs. E.g. an smgr read into a local buffer does not need to update the corresponding BufferDesc (as there is none), but a read into shared buffers does. This commit does not contain any callbacks, they will be added in subsequent commits. For now the AIO infrastructure only understands READV and WRITEV operations, but it is expected that more operations will be added. E.g. fsync/fdatasync, flush_range and network operations like send/recv. As of this commit, nothing uses the AIO infrastructure. Later commits will add an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for read_stream.c IO, which, in one fell swoop, will convert all read stream users to AIO. The goal is to use AIO in many more places. There are patches to use AIO for checkpointer and bgwriter that are reasonably close to being ready. There also are prototypes to use it for WAL, relation extension, backend writes and many more. Those prototypes were important to ensure the design of the AIO subsystem is not too limiting (e.g. WAL writes need to happen in critical sections, which influenced a lot of the design). A future commit will add an AIO README explaining the AIO architecture and how to use the AIO subsystem. The README is added later, as it references details only added in later commits. Many many more people than the folks named below have contributed with feedback, work on semi-independent patches etc. E.g. various folks have contributed patches to use the read stream infrastructure (added by Thomas in `b5a9b18cd0`) in more places. Similarly, a lot of folks have contributed to the CI infrastructure, which I had started to work on to make adding AIO feasible. Some of the work by contributors has gone into the "v1" prototype of AIO, which heavily influenced the current design of the AIO subsystem. None of the code from that directly survives, but without the prototype, the current version of the AIO infrastructure would not exist. Similarly, the reviewers below have not necessarily looked at the current design or the whole infrastructure, but have provided very valuable input. I am to blame for problems, not they. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-17 18:51:33 -04:00
Andres Freund	02844012b3	aio: Basic subsystem initialization This commit just does the minimal wiring up of the AIO subsystem, added in the next commit, to the rest of the system. The next commit contains more details about motivation and architecture. This commit is kept separate to make it easier to review, separating the changes across the tree, from the implementation of the new subsystem. We discussed squashing this commit with the main commit before merging AIO, but there has been a mild preference for keeping it separate. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-17 18:51:33 -04:00
Michael Paquier	5721e5453e	Revert "Add redo LSN to pgstats files" This reverts commit `b860848232`, that was added as a prerequisite for the support of pgstats data flush across checkpoints, linking a pgstats file to a specific checkpoint redo LSN. As reported, this is proving to be currently problematic when going through a pg_upgrade, that does direct manipulations of the control file in the new cluster. The LSN stored in the pgstats file is not able to cope with any changes done in the control file by pg_upgrade yet, causing the pgstats file to be discarded when starting the new cluster after overriding its redo LSN (one is a `pg_resetwal -l` where the new cluster's start LSN is bumped by a hardcoded value of 8 segments, see copy_xact_xlog_xid). The least painful path going forward is likely going to be a refactor of the pgstats code so as it is possible to read and write some of its data with some routines in src/common/, so as pg_upgrade or pg_resetwal are able to update its data. The main point is that we are going to need a LSN in the stats file should we make it written at checkpoint time and not only as part of a shutdown sequence. It is too late to dive into these details for v18, so let's revert the change, and let's try to figure out all the details in the next release cycle. The pgstats file is currently only written as part of a shutdown sequence, and its contents are still lost on crash, same as older releases. Bump PGSTAT_FILE_FORMAT_ID. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2563883.1741826489@sss.pgh.pa.us	2025-03-17 08:35:12 +09:00
Andres Freund	771ba90298	localbuf: Introduce StartLocalBufferIO() To initiate IO on a shared buffer we have StartBufferIO(). For temporary table buffers no similar function exists - likely because the code for that currently is very simple due to the lack of concurrency. However, the upcoming AIO support will make it possible to re-encounter a local buffer, while the buffer already is the target of IO. In that case we need to wait for already in-progress IO to complete. This commit makes it easier to add the necessary code, by introducing StartLocalBufferIO(). Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com	2025-03-15 22:07:48 -04:00
Andres Freund	4b4d33b9ea	localbuf: Introduce FlushLocalBuffer() Previously we had two paths implementing writing out temporary table buffers. For shared buffers, the logic for that is centralized in FlushBuffer(). Introduce FlushLocalBuffer() to do the same for local buffers. Besides being a nice cleanup on its own, it also makes an upcoming change slightly easier. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com	2025-03-15 22:07:48 -04:00
Andres Freund	dd6f2618f6	localbuf: Introduce TerminateLocalBufferIO() Previously TerminateLocalBufferIO() was open-coded in multiple places, which doesn't seem like a great idea. While TerminateLocalBufferIO() currently is rather simple, an upcoming patch requires additional code to be added to TerminateLocalBufferIO(), making this modification particularly worthwhile. For some reason FlushRelationBuffers() previously cleared BM_JUST_DIRTIED, even though that's never set for temporary buffers. This is not carried over as part of this change. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com	2025-03-15 22:07:48 -04:00
Andres Freund	0762a151b0	localbuf: Introduce InvalidateLocalBuffer() Previously, there were three copies of this code, two of them identical. There's no good reason for that. This change is nice on its own, but the main motivation is the AIO patchset, which needs to add extra checks the deduplicated code, which of course is easier if there is only one version. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com	2025-03-15 22:07:48 -04:00
Andres Freund	fa6af9b25e	localbuf: Fix dangerous coding pattern in GetLocalVictimBuffer() If PinLocalBuffer() were to modify the buf_state, the buf_state in GetLocalVictimBuffer() would be out of date. Currently that does not happen, as PinLocalBuffer() only modifies the buf_state if adjust_usagecount=true and GetLocalVictimBuffer() passes false. However, it's easy to make this not the case anymore - it cost me a few hours to debug the consequences. The minimal fix would be to just refetch the buf_state after after calling PinLocalBuffer(), but the same danger exists in later parts of the function. Instead, declare buf_state in the narrower scopes and re-read the state in conditional branches. Besides being safer, it also fits well with an upcoming set of cleanup patches that move the contents of the conditional branches in GetLocalVictimBuffer() into helper functions. I "broke" this in `794f259447`. Arguably this should be backpatched, but as the relevant functions are not exported and there is no actual misbehaviour, I chose to not backpatch, at least for now. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com	2025-03-15 22:07:48 -04:00
Melanie Plageman	c3953226a0	Remove table AM callback scan_bitmap_next_block After pushing the bitmap iterator into table-AM specific code (as part of making bitmap heap scan use the read stream API in `2b73a8cd33`), scan_bitmap_next_block() no longer returns the current block number. Since scan_bitmap_next_block() isn't returning any relevant information to bitmap table scan code, it makes more sense to get rid of it. Now, bitmap table scan code only calls table_scan_bitmap_next_tuple(), and the heap AM implementation of scan_bitmap_next_block() is a local helper in heapam_handler.c. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com	2025-03-15 10:37:46 -04:00
Melanie Plageman	2b73a8cd33	BitmapHeapScan uses the read stream API Make Bitmap Heap Scan use the read stream API instead of invoking ReadBuffer() for each block indicated by the bitmap. The read stream API handles prefetching, so remove all of the explicit prefetching from bitmap heap scan code. Now, heap table AM implements a read stream callback which uses the bitmap iterator to return the next required block to the read stream code. Tomas Vondra conducted extensive regression testing of this feature. Andres Freund, Thomas Munro, and I analyzed regressions and Thomas Munro patched the read stream API. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Tested-by: Tomas Vondra <tomas@vondra.me> Tested-by: Andres Freund <andres@anarazel.de> Tested-by: Thomas Munro <thomas.munro@gmail.com> Tested-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com	2025-03-15 10:34:42 -04:00
Melanie Plageman	944e81bf99	Separate TBM[Shared\|Private]Iterator and TBMIterateResult Remove the TBMIterateResult member from the TBMPrivateIterator and TBMSharedIterator and make tbm_[shared\|private_]iterate() take a TBMIterateResult as a parameter. This allows tidbitmap API users to manage multiple TBMIterateResults per scan. This is required for bitmap heap scan to use the read stream API, with which there may be multiple I/Os in flight at once, each one with a TBMIterateResult. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/d4bb26c9-fe07-439e-ac53-c0e244387e01%40vondra.me	2025-03-15 10:11:19 -04:00
Thomas Munro	799959dc7c	Simplify distance heuristics in read_stream.c. Make the distance control heuristics simpler and more aggressive in preparation for asynchronous I/O. The v17 version of read_stream.c made a conservative choice to limit the look-ahead distance when streaming sequential blocks, because it couldn't benefit very much from looking ahead further yet. It had a three-behavior model where only random I/O would rapidly increase the look-ahead distance, to support read-ahead advice. Sequential I/O would move it towards the io_combine_limit setting, just enough to build one full-sized synchronous I/O at a time, and then expect kernel read-ahead to avoid I/O stalls. That already left I/O performance on the table with advice-based I/O concurrency, since sequential blocks could be followed by random jumps, eg with the proposed streaming Bitmap Heap Scan patch. It is time to delete the cautious middle option and adjust the distance based on recent I/O needs only, since asynchronous reads will need to be started ahead of time whether random or sequential. It is still limited by io_combine_limit, *_io_concurrency, buffer availability and strategy ring size, as before. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Tested-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-16 03:05:07 +13:00
Thomas Munro	7ea8cd1566	Improve read_stream.c advice for dense streams. read_stream.c tries not to issue read-ahead advice when it thinks the kernel's own read-ahead should be active, ie when using buffered I/O and reading sequential blocks. It previously gave up too easily, and issued advice only for the first read of up to io_combine_limit blocks in a larger range of sequential blocks after random jump. The following read could suffer an avoidable I/O stall. Fix, by continuing to issue advice until the corresponding preadv() calls catch up with the start of the region we're currently issuing advice for, if ever. That's when the kernel actually sees the sequential pattern. Advice is now disabled only when the stream is entirely sequential as far as we can see in the look-ahead window, or in other words, when a sequential region is larger than we can cover with the current io_concurrency and io_combine_limit settings. While refactoring the advice control logic, also get rid of the "suppress_advice" argument that was passed around between functions to skip useless posix_fadvise() calls immediately followed by preadv(). read_stream_start_pending_read() can figure that out, so let's concentrate knowledge of advice heuristics in fewer places (our goal being to make advice-based I/O concurrency a legacy mode soon). The problem cases were revealed by Tomas Vondra's extensive regression testing with many different disk access patterns using Melanie Plageman's streaming Bitmap Heap Scan patch, in a battle against the venerable always-issue-advice-and-always-one-block-at-a-time code. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Reported-by: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Tomas Vondra <tomas@vondra.me> Reported-by: Andres Freund <andres@anarazel.de> Tested-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGJ3HSWciQCz8ekP1Zn7N213RfA4nbuotQawfpq23%2Bw-5Q%40mail.gmail.com	2025-03-15 19:04:54 +13:00
Fujii Masao	6d376c3b0d	Add GUC option to log lock acquisition failures. This commit introduces a new GUC, log_lock_failure, which controls whether a detailed log message is produced when a lock acquisition fails. Currently, it only supports logging lock failures caused by SELECT ... NOWAIT. The log message includes information about all processes holding or waiting for the lock that couldn't be acquired, helping users analyze and diagnose the causes of lock failures. Currently, this option does not log failures from SELECT ... SKIP LOCKED, as that could generate excessive log messages if many locks are skipped, causing unnecessary noise. This mechanism can be extended in the future to support for logging lock failures from other commands, such as LOCK TABLE ... NOWAIT. Author: Yuki Seino <seinoyu@oss.nttdata.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://postgr.es/m/411280a186cc26ef7034e0f2dfe54131@oss.nttdata.com	2025-03-14 23:14:12 +09:00
Fujii Masao	e80171d57c	Optimize iteration over PGPROC for fast-path lock searches. This commit improves efficiency in FastPathTransferRelationLocks() and GetLockConflicts(), which iterate over PGPROCs to search for fast-path locks. Previously, these functions recalculated the fast-path group during every loop iteration, even though it remained constant. This update optimizes the process by calculating the group once and reusing it throughout the loop. The functions also now skip empty fast-path groups, avoiding unnecessary scans of their slots. Additionally, groups belonging to inactive backends (with pid=0) are always empty, so checking the group is sufficient to bypass these backends, further enhancing performance. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/07d5fd6a-71f1-4ce8-8602-4cc6883f4bd1@oss.nttdata.com	2025-03-14 22:49:29 +09:00
Peter Eisentraut	a359d37019	Simplify and generalize PrepareSortSupportFromIndexRel() PrepareSortSupportFromIndexRel() was accepting btree strategy numbers purely for the purpose of comparing it later against btree strategies to determine if the sort direction was forward or reverse. Change that. Instead, pass a bool directly, to indicate the same without an unfortunate assumption that a strategy number refers specifically to a btree strategy. (This is similar in spirit to commits `0d2aa4d493` and c594f1ad2ba.) (This could arguably be simplfied further by having the callers fill in ssup_reverse directly. But this way, it preserves consistency by having all PrepareSortSupport*() variants be responsible for filling in ssup_reverse.) Moreover, remove the hardcoded check against BTREE_AM_OID, and check against amcanorder instead, which is the actual requirement. Co-authored-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-14 10:34:08 +01:00
Álvaro Herrera	1548c3a304	Remove direct handling of reloptions for toast tables It doesn't actually work, even with allow_system_table_mods turned on: the ALTER TABLE operation is rejected by ATSimplePermissions(), so even the error message we're adding in this commit is unreachable. Add a test case for it. Author: Nikolay Shaplov <dhyan@nataraj.su> Discussion: https://postgr.es/m/1913854.tdWV9SEqCh@thinkpad-pgpro	2025-03-14 09:28:51 +01:00
Thomas Munro	92fc6856cb	Respect changing pin limits in read_stream.c. To avoid pinning too much of the buffer pool at once, read_stream.c previously used LimitAdditionalPins(). The coding was naive, and only considered the available buffers at stream construction time. This commit checks before each StartReadBuffers() call with GetAdditionalPinLimit(). The result might change over time due to pins acquired outside this stream by the same backend. No extra CPU cycles are added to the all-buffered fast-path code, but the I/O-starting path now considers the up-to-date remaining buffer limit. In practice it was quite difficult to exceed limits and cause any real problems in v17, so no back-patch for now, but proposed changes will make it easier. Per code review from Andres, in the course of testing his AIO patches. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-14 21:21:09 +13:00
Thomas Munro	01261fb078	Improve buffer manager API for backend pin limits. Previously the support functions assumed that the caller needed one pin to make progress, and could optionally use some more, allowing enough for every connection to do the same. Add a couple more functions for callers that want to know: * what the maximum possible number could be, irrespective of currently held pins, for space planning purposes * how many additional pins they could acquire right now, without the special case allowing one pin, for callers that already hold pins and could already make progress even if no extra pins are available The pin limit logic began in commit `31966b15`. This refactoring is better suited to read_stream.c, which will be adjusted to respect the remaining limit as it changes over time in a follow-up commit. It also computes MaxProportionalPins up front, to avoid performing divisions whenever a caller needs to check the balance. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-14 17:13:09 +13:00
Amit Kapila	7c99dc587a	Fix ALTER SUBSCRIPTION ... SET PUBLICATION ... command. The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will lead to restarting of apply worker and after the restart, the apply worker will use the existing slot and replication origin corresponding to the subscription. Now, it is possible that before the restart, the origin has not been updated, and the WAL start location points to a location before where PUBLICATION pointed to by SET PUBLICATION doesn't exist, and that can lead to an error like: "ERROR: publication "pub1" does not exist". Once this error occurs, apply worker will never be able to proceed and will always return the same error. We decided to skip loading the publication if the publication does not exist. The publication is loaded later and updates the relation entry when the publication gets created. We decided not to backpatch this as this is a behaviour change, and we don't see field reports. This problem has been found by intermittent buildfarm failures. Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/flat/CALDaNm0-n8FGAorM%2BbTxkzn%2BAOUyx5%3DL_XmnvOP6T24%2B-NcBKg%40mail.gmail.com Discussion: https://postgr.es/m/CAA4eK1+T-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q@mail.gmail.com	2025-03-14 08:57:40 +05:30
Tom Lane	4618045bee	Fix ARRAY_SUBLINK and ARRAY[] for int2vector and oidvector input. If the given input_type yields valid results from both get_element_type and get_array_type, initArrayResultAny believed the former and treated the input as an array type. However this is inconsistent with what get_promoted_array_type does, leading to situations where the output of an ARRAY() subquery is labeled with the wrong type: it's labeled as oidvector[] but is really a 2-D array of OID. That at least results in strange output, and can result in crashes if further processing such as unnest() is applied. AFAIK this is only possible with the int2vector and oidvector types, which are special-cased to be treated mostly as true arrays even though they aren't quite. Fix by switching the logic to match get_promoted_array_type by testing get_array_type not get_element_type, and remove an Assert thereby made pointless. (We need not introduce a symmetrical check for get_element_type in the other if-branch, because initArrayResultArr will check it.) This restores the behavior that existed before `bac27394a` introduced initArrayResultAny: the output really is int2vector[] or oidvector[]. Comparable confusion exists when an input of an ARRAY[] construct is int2vector or oidvector: transformArrayExpr decides it's dealing with a multidimensional array constructor, and we end up with something that's a multidimensional OID array but is alleged to be of type oidvector. I have not found a crashing case here, but it's easy to demonstrate totally-wrong results. Adjust that code so that what you get is an oidvector[] instead, for consistency with ARRAY() subqueries. (This change also makes these types work like domains-over-arrays in this context, which seems correct.) Bug: #18840 Reported-by: yang lei <ylshiyu@126.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18840-fbc9505f066e50d6@postgresql.org Backpatch-through: 13	2025-03-13 16:07:55 -04:00
Álvaro Herrera	c7fc8808a9	ATExecSetRelOptions: Reduce scope of 'isnull' variable Author: Nikolay Shaplov <dhyan@nataraj.su> Reviewed-by: Timur Magomedov <t.magomedov@postgrespro.ru> Discussion: https://postgr.es/m/1913854.tdWV9SEqCh@thinkpad-pgpro	2025-03-13 18:15:59 +01:00
Álvaro Herrera	da0f0582e8	Make lwlocknames.h generated file less ugly We can make the output look a bit better by aligning each lock's definition, so add some padding space to achieve that. This change makes no practical difference, but casual onlookers will be less distracted by (lack of) whitespace. Author: Gurjeet Singh <gurjeet@singh.im> Discussion: https://postgr.es/m/CABwTF4VxfwDtRV-H22_XK4XeDogaV-Vaobu+af5U=8ZAZn9ZZQ@mail.gmail.com	2025-03-13 17:38:21 +01:00
Nathan Bossart	0697b23906	Add reverse(bytea). This commit introduces a function for reversing the order of the bytes in binary strings. Bumps catversion. Author: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/CAJ7c6TMe0QVRuNssUArbMi0bJJK32%2BzNA3at5m3osrBQ25MHuw%40mail.gmail.com	2025-03-13 11:20:53 -05:00
Peter Eisentraut	bb25276205	Fix copy-and-paste mistake in error message Introduced in commit `a68159ff2b`.	2025-03-13 15:17:08 +01:00
Peter Eisentraut	3691edfab9	pg_noreturn to replace pg_attribute_noreturn() We want to support a "noreturn" decoration on more compilers besides just GCC-compatible ones, but for that we need to move the decoration in front of the function declaration instead of either behind it or wherever, which is the current style afforded by GCC-style attributes. Also rename the macro to "pg_noreturn" to be similar to the C11 standard "noreturn". pg_noreturn is now supported on all compilers that support C11 (using _Noreturn), as well as GCC-compatible ones (using __attribute__, as before), as well as MSVC (using __declspec). (When PostgreSQL requires C11, the latter two variants can be dropped.) Now, all supported compilers effectively support pg_noreturn, so the extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped. This also fixes a possible problem if third-party code includes stdnoreturn.h, because then the current definition of #define pg_attribute_noreturn() __attribute__((noreturn)) would cause an error. Note that the C standard does not support a noreturn attribute on function pointer types. So we have to drop these here. There are only two instances at this time, so it's not a big loss. In one case, we can make up for it by adding the pg_noreturn to a wrapper function and adding a pg_unreachable(), in the other case, the latter was already done before. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw	2025-03-13 12:37:26 +01:00
Richard Guo	cc5d98525d	Fix incorrect handling of subquery pullup When pulling up a subquery, if the subquery's target list items are used in grouping set columns, we need to wrap them in PlaceHolderVars. This ensures that expressions retain their separate identity so that they will match grouping set columns when appropriate. In `90947674f`, we decided to wrap subquery outputs that are non-var expressions in PlaceHolderVars. This prevents const-simplification from merging them into the surrounding expressions after subquery pullup, which could otherwise lead to failing to match those subexpressions to grouping set columns, with the effect that they'd not go to null when expected. However, that left some loose ends. If the subquery's target list contains two or more identical Var expressions, we can still fail to match the Var expression to the expected grouping set expression. This is not related to const-simplification, but rather to how we match expressions to lower target items in setrefs.c. For sort/group expressions, we use ressortgroupref matching, which works well. For other expressions, we primarily rely on comparing the expressions to determine if they are the same. Therefore, we need a way to prevent setrefs.c from matching the expression to some other identical ones. To fix, wrap all subquery outputs in PlaceHolderVars if the parent query uses grouping sets, ensuring that they preserve their separate identity throughout the whole planning process. Reported-by: Dean Rasheed <dean.a.rasheed@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-meSahaanKskpBn0KKxdHAXC1_EJCVWHxEodqirrGJnw@mail.gmail.com	2025-03-13 16:36:03 +09:00
Richard Guo	4c49611715	Remove code setting wrap_non_vars to true for UNION ALL subqueries In pull_up_simple_subquery and pull_up_constant_function, there is code that sets wrap_non_vars to true when dealing with an appendrel member. The goal is to wrap subquery outputs that are not simple Vars in PlaceHolderVars, ensuring that what we pull up doesn't get merged into a surrounding expression during later processing, which could cause it to fail to match the expression actually available from the appendrel. However, this is unnecessary. When pulling up an appendrel child subquery, the only part of the upper query that could reference the appendrel child yet is the translated_vars list of the associated AppendRelInfo that we just made for this child. Furthermore, we do not want to force use of PHVs in the AppendRelInfo, as there is no outer join between. In fact, perform_pullup_replace_vars always sets wrap_non_vars to false before performing pullup_replace_vars on the AppendRelInfo. This patch simply removes the code that sets wrap_non_vars to true for UNION ALL subqueries. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-VXDEi1v+hZYLxpOv0riJxHsCkCH1f46tLnhonEAyGCQ@mail.gmail.com	2025-03-13 16:34:28 +09:00
Amit Kapila	3abe9dc188	Avoid invalidating all RelationSyncCache entries on publication rename. On Publication rename, we need to only invalidate the RelationSyncCache entries corresponding to relations that are part of the publication being renamed. As part of this patch, we introduce a new invalidation message to invalidate the cache maintained by the logical decoding output plugin. We can't use existing relcache invalidation for this purpose, as that would unnecessarily cause relcache invalidations in other backends. This will improve performance by building fewer relation cache entries during logical replication. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OSCPR01MB14966C09AA201EFFA706576A7F5C92@OSCPR01MB14966.jpnprd01.prod.outlook.com	2025-03-13 09:16:33 +05:30
Thomas Munro	75da2bece6	Fix read_stream.c for changing io_combine_limit. In a couple of places, read_stream.c assumed that io_combine_limit would be stable during the lifetime of a stream. That is not true in at least one unusual case: streams held by CURSORs where you could change the GUC between FETCH commands, with unpredictable results. Fix, by storing stream->io_combine_limit and referring only to that after construction. This mirrors the treatment of the other important setting {effective,maintenance}_io_concurrency, which is stored in stream->max_ios. One of the cases was the queue overflow space, which was sized for io_combine_limit and could be overrun if the GUC was increased. Since that coding was a little hard to follow, also introduce a variable for better readability instead of open-coding the arithmetic. Doing so revealed an off-by-one thinko while clamping max_pinned_buffers to INT16_MAX, though that wasn't a live bug due to the current limits on GUC values. Back-patch to 17. Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-13 15:43:34 +13:00
Amit Langote	d4f79865d4	Fix copy-paste error in datum_to_jsonb_internal() Commit `3c152a27b0` mistakenly repeated JSONTYPE_JSON in a condition, omitting JSONTYPE_CAST. As a result, datum_to_jsonb_internal() failed to reject inputs that were casts (e.g., from an enum to json as in the example below) when used as keys in JSON constructors. This led to a crash in cases like: SELECT JSON_OBJECT('happy'::mood: '123'::jsonb); where 'happy'::mood is implicitly cast to json. The missing check meant such casted values weren’t properly rejected as invalid (non-scalar) JSON keys. Reported-by: Maciek Sakrejda <maciek@pganalyze.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Maciek Sakrejda <maciek@pganalyze.com> Discussion: https://postgr.es/m/CADXhmgTJtJZK9A3Na_ry+Xrq-ghjcejBRhcRMzWZvbd__QdgJA@mail.gmail.com Backpatch-through: 17	2025-03-13 09:56:36 +09:00
Heikki Linnakangas	ac4494646d	Rename alloc/free functions in reorderbuffer.c There used to be bespoken pools for these structs to reduce the palloc/pfree overhead, but that was ripped out a long time ago and replaced with the generic, cheaper generational memory allocator (commit `a4ccc1cef5`). The Get/Return terminology made sense with the pools, as you "got" an object from the pool and "returned" it later, but now it just looks weird. Rename to Alloc/Free. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/c9e43d2d-8e83-444f-b111-430377368989@iki.fi	2025-03-12 22:03:39 +02:00
Nathan Bossart	025e7e1eb4	Remove count_one_bits() in acl.c. The only caller, select_best_grantor(), can instead use pg_popcount64(). This isn't performance-critical code, but we might as well use the centralized implementation. While at it, add some test coverage for this part of select_best_grantor(). Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/Z9GtL7Nm6hsYyJnF%40nathan	2025-03-12 15:01:52 -05:00
Melanie Plageman	ff79b5b2ab	Increase default effective_io_concurrency to 16 The default effective_io_concurrency has been 1 since it was introduced in `b7b8f0b609`. Referencing the associated discussion [1], it seems 1 was chosen as a conservative value that seemed unlikely to cause regressions. Experimentation on high latency cloud storage as well as fast, local nvme storage (see Discussion link) shows that even slightly higher values improve query timings substantially. 1 actually performs worse than 0 [2]. With effective_io_concurrency 1, we are not prefetching enough to avoid I/O stalls, but we are issuing extra syscalls. The new default is 16, which should be more appropriate for common hardware while still avoiding flooding low IOPs devices with I/O requests. [1] https://www.postgresql.org/message-id/flat/FDDBA24E-FF4D-4654-BA75-692B3BA71B97%40enterprisedb.com [2] https://www.postgresql.org/message-id/CAAKRu_Zv08Cic%3DqdCfzrQabpEXGrd9Z9UOW5svEVkCM6%3DFXA9g%40mail.gmail.com Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAAKRu_Z%2BJa-mwXebOoOERMMUMvJeRhzTjad4dSThxG0JLXESxw%40mail.gmail.com	2025-03-12 15:57:44 -04:00
Heikki Linnakangas	af717317a0	Handle interrupts while waiting on Append's async subplans We did not wake up on interrupts while waiting on async events on an async-capable append node. For example, if you tried to cancel the query, nothing would happen until one of the async subplans becomes readable. To fix, add WL_LATCH_SET to the WaitEventSet. Backpatch down to v14 where async Append execution was introduced. Discussion: https://www.postgresql.org/message-id/37a40570-f558-40d3-b5ea-5c2079b3b30b@iki.fi	2025-03-12 20:53:09 +02:00
Tom Lane	f4e7756ef9	Build whole-row Vars the same way during parsing and planning. makeWholeRowVar() has different rules for constructing a whole-row Var depending on the kind of RTE it's representing. This turns out to be problematic because the rewriter and planner can convert view RTEs and set-returning-function RTEs into subquery RTEs; so a whole-row Var made during planning might look different from one made by the parser. In isolation this doesn't cause any problem, but if a query contains Vars made both ways for the same varno, there are cross-checks in the executor that will complain. This manifests for UPDATE, DELETE, and MERGE queries that use whole-row table references. To fix, we need makeWholeRowVar() to produce the same result from an inlined RTE as it would have for the original. For an inlined view, we can use RangeTblEntry.relid to detect that this had been a view RTE. For inlined SRFs, make a data structure definition change akin to commit `47bb9db75`, and say that we won't clear RangeTblEntry.functions until the end of planning. That allows makeWholeRowVar() to repeat what it would have done with the unmodified RTE. Reported-by: Duncan Sands <duncan.sands@deepbluecap.com> Reported-by: Dean Rasheed <dean.a.rasheed@gmail.com> Diagnosed-by: Tender Wang <tndrwang@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/3518c50a-ab18-482f-b916-a37263622501@deepbluecap.com Backpatch-through: 13	2025-03-12 11:47:38 -04:00
Melanie Plageman	18cd15e706	Add connection establishment duration logging Add log_connections option 'setup_durations' which logs durations of several key parts of connection establishment and backend setup. For an incoming connection, starting from when the postmaster gets a socket from accept() and ending when the forked child backend is first ready for query, there are multiple steps that could each take longer than expected due to external factors. This logging provides visibility into authentication and fork duration as well as the end-to-end connection establishment and backend initialization time. To make this portable, the timings captured in the postmaster (socket creation time, fork initiation time) are passed through the BackendStartupData. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Guillaume Lelarge <guillaume.lelarge@dalibo.com> Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com	2025-03-12 11:35:27 -04:00
Melanie Plageman	9219093cab	Modularize log_connections output Convert the boolean log_connections GUC into a list GUC comprised of the connection aspects to log. This gives users more control over the volume and kind of connection logging. The current log_connections options are 'receipt', 'authentication', and 'authorization'. The empty string disables all connection logging. 'all' enables all available connection logging. For backwards compatibility, the most common values for the log_connections boolean are still supported (on, off, 1, 0, true, false, yes, no). Note that previously supported substrings of on, off, true, false, yes, and no are no longer supported. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com	2025-03-12 11:35:21 -04:00
Michael Paquier	f554a95379	Remove initialization from PendingBackendStats `9a8dd2c5a6` has added an initialization to PendingBackendStats, which has been causing compilation warnings in the buildfarm. This code does not strictly require it as PendingBackendStats is always initialized with memset(0), so let's remove it. Per report from multiple buildfarm members, like ayu and batfish, via Tom Lane. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/1870853.1741749264@sss.pgh.pa.us	2025-03-12 20:37:43 +09:00
Heikki Linnakangas	043745c3a0	Improve snapmgr.c comment Add more details on the different kinds of snapshots, how to use them, and how the active snapshot stack works. Discussion: https://www.postgresql.org/message-id/7c56f180-b9e1-481e-8c1d-efa63de3ecbb@iki.fi	2025-03-11 23:28:38 +02:00
Heikki Linnakangas	8076c00592	Assert that a snapshot is active or registered before it's used The comment in GetTransactionSnapshot() said that you "should call RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be used very long". That felt too unclear to me. Make the comment more strongly worded. To enforce that rule and to catch potential bugs where a snapshot might get invalidated while it's still in use, add an assertion to HeapTupleSatisfiesMVCC() to check that the snapshot is registered or pushed to active stack. No new bugs were found by this, but it seems like good future-proofing. It's not a great place for the check; HeapTupleSatisfiesMVCC() is in fact safe to call with an unregistered snapshot, and the assertion won't catch other unsafe uses. But it goes a long way in practice. Fix a few cases that were playing fast and loose with that and just assumed that the snapshot cannot be invalidated during a scan. Those assumptions were not wrong, but they're not performance critical, so let's drop the excuses and just register the snapshot. These were false positives found by the new assertion. Discussion: https://www.postgresql.org/message-id/7c56f180-b9e1-481e-8c1d-efa63de3ecbb@iki.fi	2025-03-11 23:20:34 +02:00
Masahiko Sawada	bd65cb3cd4	pg_logicalinspect: Fix possible crash when passing a directory path. Previously, pg_logicalinspect functions were too trusting of their input and blindly passed it to SnapBuildRestoreSnapshot(). If the input pointed to a directory, the server could a PANIC error while attempting to fsync_fname() with isdir=false on a directory. This commit adds validation checks for input filenames and passes the LSN extracted from the filename to SnapBuildRestoreSnapshot() instead of the filename itself. It also adds regression tests for various input patterns and permission checks. Bug: #18828 Reported-by: Robins Tharakan <tharakan@gmail.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/18828-0f4701c635064211@postgresql.org	2025-03-11 09:56:40 -07:00
Tom Lane	8b1b342544	Improve EXPLAIN's display of window functions. Up to now we just punted on showing the window definitions used in a plan, with window function calls represented as "OVER (?)". To improve that, show the window definition implemented by each WindowAgg plan node, and reference their window names in OVER. For nameless window clauses generated by "OVER (...)", assign unique names w1, w2, etc. In passing, re-order the properties shown for a WindowAgg node so that the Run Condition (if any) appears after the Window property and before the Filter (if any). This seems more sensible since the Run Condition is associated with the Window and acts before the Filter. Thanks to David G. Johnston and Álvaro Herrera for design suggestions. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/144530.1741469955@sss.pgh.pa.us	2025-03-11 11:19:54 -04:00
Peter Geoghegan	426ea61117	nbtree: Make BTMaxItemSize into object-like macro. Make nbtree's "1/3 of a page limit" BTMaxItemSize function-like macro (which accepts a "page" argument) into an object-like macro that can be used from code that doesn't have convenient access to an nbtree page. Preparation for an upcoming patch that adds skip scan to nbtree. Parallel index scans that use skip scan will serialize datums (not just SAOP array subscripts) when scheduling primitive scans. BTMaxItemSize will be used by btestimateparallelscan to determine how much DSM to request. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wz=H_RG5weNGeUG_TkK87tRBnH9mGCQj6WpM4V4FNWKv2g@mail.gmail.com	2025-03-11 10:35:56 -04:00
Peter Geoghegan	0fbceae841	Show index search count in EXPLAIN ANALYZE, take 2. Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan/index-only scan/bitmap index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index searches can be unpredictable due to implementation details that interact with physical index characteristics (at least with nbtree SAOP scans, since Postgres 17 commit `5bf748b8`). The information shown also provides useful context when EXPLAIN ANALYZE runs a plan with an index scan node that successfully applied the skip scan optimization (set to be added to nbtree by an upcoming patch). The instrumentation works by teaching all index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs already increment the pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). Parallel queries have workers copy their local counter struct into shared memory when an index scan node ends -- even when it isn't a parallel aware scan node. An earlier version of this patch that only worked with parallel aware scans became commit `5ead85fb` (though that was quickly reverted by commit `d00107cd` following "debug_parallel_query=regress" buildfarm failures). Our approach doesn't match the approach used when tracking other index scan related costs (e.g., "Rows Removed by Filter:"). It is comparable to the approach used in similar cases involving costs that are only readily accessible inside an access method, not from the executor proper (e.g., "Heap Blocks:" output for a Bitmap Heap Scan, which was recently enhanced to show per-worker costs by commit `5a1e6df3`, using essentially the same scheme as the one used here). It is necessary for index AMs to have direct responsibility for maintaining the new counter, since the counter might need to be incremented multiple times per amgettuple call (or per amgetbitmap call). But it is also necessary for the executor proper to manage the shared memory now used to transfer each worker's counter struct to the leader. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com	2025-03-11 09:20:50 -04:00
Álvaro Herrera	17ce344f86	BRIN: be more strict about required support procs With improperly defined operator classes, it's possible to get a Postgres crash because we'd try to invoke a procedure that doesn't exist. This is because the code is being a bit too trusting that the opclass is correctly defined. Add some ereport(ERROR)s for cases where mandatory support procedures are not defined, transforming the crashes into errors. The particular case that was reported is an incomplete opclass in PostGIS. Backpatch all the way down to 13. Reported-by: Tobias Wendorff <tobias.wendorff@tu-dortmund.de> Diagnosed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/fb6d9a35-6c8e-4869-af80-0a4944a793a4@tu-dortmund.de	2025-03-11 12:50:35 +01:00
Daniel Gustafsson	d35d32d711	Add special case fast-paths for strict functions Many STRICT function calls will have one or two arguments, in which case we can speed up checking for NULL input by avoiding setting up a loop over the arguments. This adds EEOP_FUNCEXPR_STRICT_1 and the corresponding EEOP_FUNCEXPR_STRICT_2 for functions with one and two arguments respectively. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://postgr.es/m/415721CE-7D2E-4B74-B5D9-1950083BA03E@yesql.se Discussion: https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de	2025-03-11 12:02:42 +01:00
Daniel Gustafsson	8dd7c7cd0a	Replace EEOP_DONE with special steps for return/no return Knowing when the side-effects of an expression is the intended result of the execution, rather than the returnvalue, is important for being able generate more efficient JITed code. This replaces EEOP_DONE with two new steps: EEOP_DONE_RETURN and EEOP_DONE_NO_RETURN. Expressions which return a value should use the former step; expressions used for their side-effects which don't return value should use the latter. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://postgr.es/m/415721CE-7D2E-4B74-B5D9-1950083BA03E@yesql.se Discussion: https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de	2025-03-11 12:02:38 +01:00
Peter Eisentraut	dabccf4513	Move RemoveInheritedConstraint() call slightly earlier This change is harmless and does not affect the existing intended operation. It is necessary for a subsequent patch operation (NOT ENFORCED foreign keys), where we may need to change the child constraint to enforced. In this case, we would create the necessary triggers and queue the constraint for validation, so it is important to remove any unnecessary constraints before proceeding. This is a small change that could have been included in the previous "split tryAttachPartitionForeignKey" refactoring patch (commit `1d26c2d2c4`), but was kept separate to highlight the changes. Author: Amul Sul <amul.sul@enterprisedb.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA%40mail.gmail.com	2025-03-11 10:43:48 +01:00
Peter Eisentraut	1d26c2d2c4	refactor: Split tryAttachPartitionForeignKey() Split tryAttachPartitionForeignKey() into three functions: AttachPartitionForeignKey(), RemoveInheritedConstraint(), and DropForeignKeyConstraintTriggers(), so they can be reused in some subsequent patches for the NOT ENFORCED feature. Author: Amul Sul <amul.sul@enterprisedb.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA%40mail.gmail.com	2025-03-11 09:35:24 +01:00
Peter Eisentraut	64224a834c	refactor: re-add ATExecAlterChildConstr() ATExecAlterChildConstr() was removed in commit `80d7f99049`, but it is needed in some subsequent patches for the NOT ENFORCED feature, to recurse over child constraints. This adds it back in slightly altered form. Author: Amul Sul <amul.sul@enterprisedb.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA%40mail.gmail.com	2025-03-11 08:43:35 +01:00
Michael Paquier	76def4cdd7	Add WAL data to backend statistics This commit adds per-backend WAL statistics, providing the same information as pg_stat_wal, except that it is now possible to know how much WAL activity is happening in each backend rather than an overall aggregate of all the activity. Like pg_stat_wal, the implementation relies on pgWalUsage, tracking the difference of activity between two reports to pgstats. This data can be retrieved with a new system function called pg_stat_get_backend_wal(), that returns one tuple based on the PID provided in input. Like pg_stat_get_backend_io(), this is useful when joined with pg_stat_activity to get a live picture of the WAL generated for each running backend, showing how the activity is [un]balanced. pgstat_flush_backend() gains a new flag value, able to control the flush of the WAL stats. This commit relies mostly on the infrastructure provided by `9aea73fc61`, that has introduced backend statistics. Bump catalog version. A bump of PGSTAT_FILE_FORMAT_ID is not required, as backend stats do not persist on disk. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal	2025-03-11 09:04:11 +09:00
Tom Lane	29d6808ede	CREATE INDEX: do update index stats if autovacuum=off. This fixes a thinko from commit `d611f8b15`. The intent was to prevent updating the stats of the pre-existing heap if autovacuum is off, but it also disabled updating the stats of the just-created index. There is AFAICS no good reason to do the latter, since there could not be any pre-existing stats to refrain from overwriting, and the zeroed stats that are there to begin with are very unlikely to be useful. Moreover, the change broke our cross-version upgrade tests again. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1116282.1741374848@sss.pgh.pa.us	2025-03-10 17:49:27 -04:00
Heikki Linnakangas	f7c566a1a2	Fix a few more redundant calls of GetLatestSnapshot() Commit `2367503177` fixed this in RelationFindReplTupleByIndex(), but I missed two other similar cases. Per report from Ranier Vilela. Discussion: https://www.postgresql.org/message-id/CAEudQArUT1dE45WN87F-Gb7XMy_hW6x1DFd3sqdhhxP-RMDa0Q@mail.gmail.com Backpatch-through: 13	2025-03-10 18:58:10 +02:00
Heikki Linnakangas	2367503177	Fix snapshot used in logical replication index lookup The function calls GetLatestSnapshot() to acquire a fresh snapshot, makes it active, and was meant to pass it to table_tuple_lock(), but instead called GetLatestSnapshot() again to acquire yet another snapshot. It was harmless because the heap AM and all other known table AMs ignore the 'snapshot' argument anyway, but let's be tidy. In the long run, this perhaps should be redesigned so that snapshot was not needed in the first place. The table AM API uses TID + snapshot as the unique identifier for the row version, which is questionable when the row came from an index scan with a Dirty snapshot. You might lock a different row version when you use a different snapshot in the table_tuple_lock() call (a fresh MVCC snapshot) than in the index scan (DirtySnapshot). However, in the heap AM and other AMs where the TID alone identifies the row version, it doesn't matter. So for now, just fix the obvious albeit harmless bug. This has been wrong ever since the table AM API was introduced in commit `5db6df0c01`, so backpatch to all supported versions. Discussion: https://www.postgresql.org/message-id/83d243d6-ad8d-4307-8b51-2ee5844f6230@iki.fi Backpatch-through: 13	2025-03-10 17:07:38 +02:00
Alexander Korotkov	6bb6a62f3c	Use extended stats for precise estimation of bucket size in hash join Recognizing the real-life complexity where columns in the table often have functional dependencies, PostgreSQL's estimation of the number of distinct values over a set of columns can be underestimated (or much rarely, overestimated) when dealing with multi-clause JOIN. In the case of hash join, it can end up with a small number of predicted hash buckets and, as a result, picking non-optimal merge join. To improve the situation, we introduce one additional stage of bucket size estimation - having two or more join clauses estimator lookup for extended statistics and use it for multicolumn estimation. Clauses are grouped into lists, each containing expressions referencing the same relation. The result of the multicolumn estimation made over such a list is combined with others according to the caller's logic. Clauses that are not estimated are returned to the caller for further estimation. Discussion: https://postgr.es/m/52257607-57f6-850d-399a-ec33a654457b%40postgrespro.ru Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-03-10 13:42:01 +02:00
Alexander Korotkov	fae535da0a	Teach Append to consider tuple_fraction when accumulating subpaths. This change is dedicated to more active usage of IndexScan and parameterized NestLoop paths in partitioned cases under an Append node, as it already works with plain tables. As newly added regression tests demonstrate, it should provide more smartness to the partitionwise technique. With an indication of how many tuples are needed, it may be more meaningful to use the 'fractional branch' subpaths of the Append path list, which are more optimal for this specific number of tuples. Planning on a higher level, if the optimizer needs all the tuples, it will choose non-fractional paths. In the case when, during execution, Append needs to return fewer tuples than declared by tuple_fraction, it would not be harmful to use the 'intermediate' variant of paths. However, it will earn a considerable profit if a sensible set of tuples is selected. The change of the existing regression test demonstrates the positive outcome of this feature: instead of scanning the whole table, the optimizer prefers to use a parameterized scan, being aware of the only single tuple the join has to produce to perform the query. Discussion: https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.com Author: Nikita Malakhov <hukutoc@gmail.com> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Andy Fan <zhihuifan1213@163.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-03-10 13:38:39 +02:00
Peter Eisentraut	b83e8a2ca2	Remove support for temporal RESTRICT foreign keys It isn't clear how these should behave, so let's wait to implement them until we are sure how to do it. This feature was initially added by commit `89f908a6d0`, so it hasn't been released yet. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Discussion: https://postgr.es/m/e773bc11-4ac1-40de-bb91-814e02f05b6d%40eisentraut.org	2025-03-10 11:31:01 +01:00
Heikki Linnakangas	03f8e9a7fe	Fix incorrect assertion in libpqwalreceiver Was supposed to check the length of the array, but was checking its size in bytes. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Discussion: https://www.postgresql.org/message-id/CA%2BCOZaA_9afJxj9ZuO73U5P7WXP%2BZM9NGnZvTDCmBFz0FGP%2BwA@mail.gmail.com	2025-03-09 20:40:45 +02:00
Tom Lane	fedfcf6650	Don't try to parallelize array_agg() on an anonymous record type. This doesn't work because record_recv requires the typmod that identifies the specific record type (in our session) and array_agg_deserialize has no convenient way to get that information. The result is an "input of anonymous composite types is not implemented" error. We could probably make this work if we had to, but it does not seem worth the trouble, given that it took this long to get a field report. Just shut off parallelization, as though record_recv didn't exist. Oversight in commit `16fd03e95`. Back-patch to v16 where that came in. Reported-by: Kirill Zdornyy <kirill@dineserve.com> Diagnosed-by: Richard Guo <guofenglinux@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/atLI5Kce2ie1zcYjU0w_kjtVaxiYbYGTihrkLDmGZQnRDD4pnXukIATaABbnIj9pUnelC4ESvCXMm4HAyHg-v61XABaKpERj0A2IXzJZM7g=@dineserve.com Backpatch-through: 16	2025-03-09 13:11:20 -04:00
Tom Lane	7fb8801021	Clear errno before calling strtol() in spell.c. Per POSIX, a caller of strtol() that wishes to check for errors must set errno to 0 beforehand. Several places in spell.c neglected that, so that they risked delivering a false overflow error in case errno had been ERANGE already. Given the lack of field reports, this case may be unreachable at present --- but it's surely trouble waiting to happen, so fix it. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Discussion: https://postgr.es/m/CA+COZaBhsq6EromFm+knMJfzK6nTpG23zJ+K2=nfUQQXcj_xcQ@mail.gmail.com Backpatch-through: 13	2025-03-08 11:24:25 -05:00
Peter Geoghegan	67fc4c9fd7	Make parallel nbtree index scans use an LWLock. Teach parallel nbtree index scans to use an LWLock (not a spinlock) to protect the scan's shared descriptor state. Preparation for an upcoming patch that will add skip scan optimizations to nbtree. That patch will create the need to occasionally allocate memory while the scan descriptor is locked, while copying datums that were serialized by another backend. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com	2025-03-08 11:10:14 -05:00
Peter Eisentraut	8021c77769	Make amcanorder independent of amconsistentordering Follow-up to commit `af4002b381`: Make amconsistentordering not depend on amcanorder. Although they are related, they are independent properties. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org	2025-03-08 09:37:06 +01:00
Peter Eisentraut	661781f3a3	Fix typo Duplicate assignment in commit `af4002b381` should have been a different field. (But it didn't affect the outcome.) Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org	2025-03-08 08:06:30 +01:00
Michael Paquier	9a8dd2c5a6	Improve check for detection of pending data in backend statistics The callback pgstat_backend_have_pending_cb() is used as a way for pg_stat_report() to detect if there is any pending data for backend statistics. It did not include a check based on pgstat_tracks_backend_bktype(), that discards processes whose backend types do not support backend statistics. The logic is not a problem on HEAD, as processes that do not support backend statistics cannot touch PendingBackendStats, so the callback would always report that there is no pending data in this case. However, we would run into trouble once backend statistics include portions of pending stats that are not always zeroed, like pgWalUsage. There is no reason for pgstat_backend_have_pending_cb() to not check for pgstat_tracks_backend_bktype(), anyway, and this pattern is safer in the long run, so let's update the code to do so. While on it, this commit adds a proper initialization to PendingBackendStats. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/Z8l6EMM4ImVoWRkg@ip-10-97-1-34.eu-west-3.compute.internal	2025-03-08 10:56:30 +09:00
Peter Geoghegan	8e167e6188	nbtree: refine _bt_readnextpage contract comments. Another minor follow-up commit for commit `1bd4bc85`, which changed the _bt_readnextpage contract.	2025-03-07 18:35:13 -05:00
Tom Lane	34c3c5ce1c	Include column name in build_attrmap_by_position's error reports. Formerly we only provided the column number, but it's frequently more useful to mention the column name. The input tupdesc often doesn't have useful column names, but the output tupdesc usually contains user-supplied names, so report that one. Author: Marcos Pegoraro <marcos@f10.com.br> Co-authored-by: jian he <jian.universality@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Co-authored-by: Erik Wienhold <ewie@ewie.name> Reviewed-by: Vladlen Popolitov <v.popolitov@postgrespro.ru> Discussion: https://postgr.es/m/CAB-JLwanky28gjAMdnMh1CjyO1b2zLdr6UOA1-oY9G7PVL9KKQ@mail.gmail.com	2025-03-07 13:24:20 -05:00
Peter Eisentraut	7f24c02743	Improve possible performance regression Commit `ce62f2f2a0` introduced calls to GetIndexAmRoutineByAmId() in lsyscache.c functions. This call is a bit more expensive than a simple syscache lookup. So rearrange the nesting so that we call that one last and do the cheaper checks first. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org	2025-03-07 11:46:33 +01:00
Peter Eisentraut	af4002b381	Rename amcancrosscompare After more discussion about commit `ce62f2f2a0`, rename the index AM property amcancrosscompare to two separate properties amconsistentequality and amconsistentordering. Also improve the documentation and update some comments that were previously missed. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org	2025-03-07 11:46:33 +01:00
Dean Rasheed	6da469bada	Allow casting between bytea and integer types. This allows smallint, integer, and bigint values to be cast to and from bytea. The bytea value is the two's complement representation of the integer, with the most significant byte first. For example: 1234::bytea -> \x000004d2 (-1234)::bytea -> \xfffffb2e Author: Aleksander Alekseev <aleksander@timescale.com> Reviewed-by: Joel Jacobson <joel@compiler.org> Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAJ7c6TPtOp6%2BkFX5QX3fH1SVr7v65uHr-7yEJ%3DGMGQi5uhGtcA%40mail.gmail.com	2025-03-07 09:31:18 +00:00
Jeff Davis	d611f8b158	CREATE INDEX: don't update table stats if autovacuum=off. We previously fixed this for binary upgrade in `71b66171d0`, but a similar problem remained when dumping statistics without data. Fix by not opportunistically updating table stats during CREATE INDEX when autovacuum is disabled. For stats to be stable at all, the server needs to be aware that it should not take every opportunity to update stats. Per discussion, autovacuum=off is a signal that the user expects stats to be stable; though if necessary, we could create a more specific mode in the future. Reported-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAExHW5vf9D+8-a5_BEX3y=2y_xY9hiCxV1=C+FnxDvfprWvkng@mail.gmail.com Discussion: https://postgr.es/m/ca81cbf6e6ea2af838df972801ad4da52640a503.camel%40j-davis.com	2025-03-06 19:39:14 -08:00
Tom Lane	0f21db36d6	Fix some performance issues in GIN query startup. If a GIN index search had a lot of search keys (for example, "jsonbcol ?\| array[]" with tens of thousands of array elements), both ginFillScanKey() and startScanKey() took O(N^2) time. Worse, those loops were uncancelable for lack of CHECK_FOR_INTERRUPTS. The problem in ginFillScanKey() is the brute-force search key de-duplication done in ginFillScanEntry(). The most expedient solution seems to be to just stop trying to de-duplicate once there are "too many" search keys. We could imagine working harder, say by using a sort-and-unique algorithm instead of brute force compare-all-the-keys. But it seems unlikely to be worth the trouble. There is no correctness issue here, since the code already allowed duplicate keys if any extra_data is present. The problem in startScanKey() is the loop that attempts to identify the first non-required search key. In the submitted test case, that vainly tests all the key positions, and each iteration takes O(N) time. One part of that is that it's reinitializing the entryRes[] array from scratch each time, which is entirely unnecessary given that the triConsistentFn isn't supposed to scribble on its input. We can easily adjust the array contents incrementally instead. The other part of it is that the triConsistentFn may itself take O(N) time (and does in this test case). This is all extremely brute force: in simple cases with AND or OR semantics, we could know without any looping whatever that all or none of the keys are required. But GIN opclasses don't have any API for exposing that knowledge, so at least in the short run there is little to be done about that. Put in a CHECK_FOR_INTERRUPTS so that at least the loop is cancelable. These two changes together resolve the primary complaint that the test query doesn't respond promptly to cancel interrupts. Also, while they don't completely eliminate the O(N^2) behavior, they do provide quite a nice speedup for mid-sized examples. Bug: #18831 Reported-by: Niek <niek.brasa@hitachienergy.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18831-e845ac44ebc5dd36@postgresql.org Backpatch-through: 13	2025-03-06 11:54:31 -05:00
Amit Kapila	588acf6d0e	Avoid invalidating all RelationSyncCache entries on publication change. On change of publication via ALTER PUBLICATION ... SET/ADD/DROP commands, we were invalidating all the relations present in relation sync cache maintained by pgoutput. We need to invalidate only the relation entries that are changed as part of publication DDL. We have ensured that the publication DDL execution generated the invalidations required to invalidate impacted relation sync entries in RelationSyncCache. This improves the performance by avoiding building the cache entries for the cases where a publication has many tables but only one of them is dropped. Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OSCPR01MB14966C09AA201EFFA706576A7F5C92@OSCPR01MB14966.jpnprd01.prod.outlook.com	2025-03-06 14:19:38 +05:30
Jeff Davis	298944e8d8	Address stats import review comments. Reported-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxHG9MBQozbJQ4JRBcRbUO+t+sx4qLZX092rS_9b4SR_EA@mail.gmail.com	2025-03-05 23:07:25 -08:00
Michael Paquier	7f7f324eb5	Add more monitoring data for WAL writes in the WAL receiver This commit adds two improvements related to the monitoring of WAL writes for the WAL receiver. First, write counts and timings are now counted in pg_stat_io for the WAL receiver. These have been discarded from pg_stat_wal in `ff99918c62` due to performance concerns, related to the fact that we still relied on an on-disk file for the stats back then, even with track_wal_io_timing to avoid the overhead of the timestamp calculations. This implementation is simpler than the original proposal as it is possible to rely on the APIs of pgstat_io.c to do the job. Like the fsync and read data, track_wal_io_timing needs to be enabled to track the timings. Second, a wait event is added around the pg_pwrite() call in charge of the writes, using the exiting WAIT_EVENT_WAL_WRITE. This is useful as the WAL receiver data is tracked in pg_stat_activity. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z8gFnH4o3jBm5BRz@ip-10-97-1-34.eu-west-3.compute.internal	2025-03-06 09:41:37 +09:00
Heikki Linnakangas	393e0d2314	Split WaitEventSet functions to separate source file latch.c now only contains the Latch related functions, which build on the WaitEventSet abstraction. Most of the platform-dependent stuff is now in waiteventset.c. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/8a507fb6-df28-49d3-81a5-ede180d7f0fb@iki.fi	2025-03-06 01:26:16 +02:00
Heikki Linnakangas	84e5b2f07a	Use ModifyWaitEvent to update exit_on_postmaster_death This is in preparation for splitting WaitEventSet related functions to a separate source file. That will hide the details of WaitEventSet from WaitLatch, so it must use an exposed function instead of modifying WaitEventSet->exit_on_postmaster_death directly. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/8a507fb6-df28-49d3-81a5-ede180d7f0fb@iki.fi	2025-03-06 01:26:12 +02:00
Heikki Linnakangas	a98e4dee63	Remove unused ShutdownLatchSupport() function The only caller was removed in commit `80a8f95b3b`. I don't foresee needing it any time soon, and I'm working on some big changes in this area, so let's remove it out of the way. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/8a507fb6-df28-49d3-81a5-ede180d7f0fb@iki.fi	2025-03-05 23:52:04 +02:00
Peter Geoghegan	d00107cd63	Revert "Show index search count in EXPLAIN ANALYZE." This reverts commit `5ead85fbc8`. This commit shows test failures with debug_parallel_query=regress. The underlying issue needs to be debugged, so revert for now.	2025-03-05 10:27:31 -05:00
Andrew Dunstan	4603903d29	Allow json{b}_strip_nulls to remove null array elements An additional paramater ("strip_in_arrays") is added to these functions. It defaults to false. If true, then null array elements are removed as well as null valued object fields. JSON that just consists of a single null is not affected. Author: Florents Tselai <florents.tselai@gmail.com> Discussion: https://postgr.es/m/4BCECCD5-4F40-4313-9E98-9E16BEB0B01D@gmail.com	2025-03-05 10:04:02 -05:00
Peter Geoghegan	5ead85fbc8	Show index search count in EXPLAIN ANALYZE. Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index scans isn't predictable in advance (at least not with optimizations like the one added to nbtree by Postgres 17 commit `5bf748b8`). It will also be useful when EXPLAIN ANALYZE shows details of an nbtree index scan that uses skip scan optimizations set to be introduced by an upcoming patch. The instrumentation works by teaching index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs must already increment the index's pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). The new counter is stored in the scan descriptor (IndexScanDescData), which explain.c reaches by going through the scan node's PlanState. This approach doesn't match the approach used when tracking other index scan specific costs (e.g., "Rows Removed by Filter:"). It is similar to the approach used in other cases where we must track costs that are only readily accessible inside an access method, and not from the executor (e.g., "Heap Blocks:" output for a Bitmap Heap Scan). It is inherently necessary to maintain a counter that can be incremented multiple times during a single amgettuple call (or amgetbitmap call), and directly exposing PlanState.instrument to index access methods seems unappealing. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com	2025-03-05 09:36:48 -05:00
Heikki Linnakangas	635f580120	Rename some signal and interrupt handling functions for consistency The usual pattern for handling a signal is that the signal handler sets a flag and calls SetLatch(MyLatch), and CHECK_FOR_INTERRUPTS() or other code that is part of a wait loop calls another function to deal with it. The naming of the functions involved was a bit inconsistent, however. CHECK_FOR_INTERRUPTS() calls ProcessInterrupts() to do the heavy-lifting, but the analogous functions in aux processes were called HandleMainLoopInterrupts(), HandleStartupProcInterrupts(), etc. Similarly, most subroutines of ProcessInterrupts() were called Process(), but some were called Handle(). To make things less confusing, rename all the functions that are part of the overall signal/interrupt handling system but are not executed in a signal handler to e.g. ProcessSomething(), rather than HandleSomething(). The "Process" prefix is now consistently used in the non-signal-handler functions, and the "Handle" prefix in functions that are part of signal handlers, except for some completely unrelated functions that clearly have nothing to do with signal or interrupt handling. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://www.postgresql.org/message-id/8a384b26-1499-41f6-be33-64b801fb98b8@iki.fi	2025-03-05 16:22:26 +02:00
Álvaro Herrera	f4e53e10b6	Add ALTER TABLE ... ALTER CONSTRAINT ... SET [NO] INHERIT This allows to redefine an existing non-inheritable constraint to be inheritable, which allows to straighten up situations with NO INHERIT constraints so that thay can become normal constraints without having to re-verify existing data. For existing inheritance children this may require creating additional constraints, if they don't exist already. It also allows to do the opposite, if only for symmetry. Author: Suraj Kharage <suraj.kharage@enterprisedb.com> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CAF1DzPVfOW6Kk=7SSh7LbneQDJWh=PbJrEC_Wkzc24tHOyQWGg@mail.gmail.com	2025-03-05 13:50:22 +01:00
Michael Paquier	f4694e0f35	Fix some gaps in pg_stat_io with WAL receiver and WAL summarizer The WAL receiver and WAL summarizer processes gain each one a call to pgstat_report_wal(), to make sure that they report their WAL statistics to pgstats, gathering data for pg_stat_io. In the WAL receiver, the stats reports are timed with status updates sent to the primary, that depend on wal_receiver_status_interval and wal_receiver_timeout. This is a conservative choice, but perhaps we could be more aggressive with the frequency of the stats reports. An interesting historical fact is that the WAL receiver does writes and syncs of WAL, but it has never reported its statistics to pgstats in pg_stat_wal. In the WAL summarizer, the stats reports are done each time the process waits for WAL. While on it, pg_stat_io is adjusted so as these two processes do not report any rows when IOObject is not WAL, making the view easier to use with less rows. Two tests are added in TAP, checking statistics for the WAL summarizer and the WAL receiver. Status updates in the WAL receiver are currently possible in the recovery test 001_stream_rep.pl. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z8UKZyVSHUUQJHNb@paquier.xyz	2025-03-05 10:17:39 +09:00
Tomas Vondra	b229c10164	Enforce memory limit during parallel GIN builds Index builds are expected to respect maintenance_work_mem, just like other maintenance operations. For serial builds this is done simply by flushing the buffer in ginBuildCallback() into the index. But with parallel builds it's more complicated, because there are multiple places that can allocate memory. ginBuildCallbackParallel() does the same thing as ginBuildCallback(), except that the accumulated items are written into tuplesort. Then the entries with the same key get merged - first in the worker, then in the leader - and the TID lists may get (arbitrarily) long. It's unlikely it would exceed the memory limit, but it's possible. We address this by evicting some of the data if the list gets too long. We can't simply dump the whole in-memory TID list. The GIN index bulk insert code expects to see TIDs in monotonic order; it may fail if the TIDs go backwards. If the TID lists overlap, evicting the whole current TID list would break this (a later entry might add "old" TID values into the already-written part). In the workers this is not an issue, because the lists never overlap. But the leader may see overlapping lists produced by the workers. We can however derive a safe "horizon" TID - the entries (for a given key) are sorted by (key, first TID), which means no future list can add values before the last "first TID" we've seen. This patch tracks the "frozen" part of the TID list, which we know can't change by merging additional TID lists. If needed, we can evict this part of the list. We don't want to do this too often - the smaller lists we evict, the more expensive it'll be to merge them in the next step (especially in the leader). Therefore we only trim the list if we have at least 1024 frozen items, and if the whole list is at least 64kB large. These thresholds are somewhat arbitrary and conservative. We might calculate the values from maintenance_work_mem, but tests show that does not really improve anything (time, compression ratio, ...). So we stick to these conservative values to release memory faster. Author: Tomas Vondra Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com	2025-03-04 20:41:13 +01:00
Álvaro Herrera	7bbc46213d	Fix ALTER TABLE error message This bogus error message was introduced in 2013 by commit `f177cbfe67`, because of misunderstanding the processCASbits() API; at the time, no test cases were added that would be affected by this change. Only in `ca87c415e2` was one added (along with a couple of typos), with an XXX note that the error message was bogus. Fix the whole, add some test cases. Backpatch all the way back. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/202503041822.aobpqke3igvb@alvherre.pgsql	2025-03-04 20:07:30 +01:00
Masahiko Sawada	bacbc4863b	Refactor Copy{From\|To}GetRoutine() to use pass-by-reference argument. The change improves efficiency by eliminating unnecessary copying of CopyFormatOptions. The coverity also complained about inefficiencies caused by pass-by-value. Oversight in `7717f6300` and `2e4127b6d`. Reported-by: Junwang Zhao <zhjwpku@gmail.com> Reported-by: Tom Lane <tgl@sss.pgh.pa.us> (per reports from coverity) Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAEG8a3L6YCpPksTQMzjD_CvwDEhW3D_t=5md9BvvdOs5k+TA=Q@mail.gmail.com	2025-03-04 10:38:41 -08:00
Tomas Vondra	0b2a45a5d1	Compress TID lists when writing GIN tuples to disk When serializing GIN tuples to tuplesorts during parallel index builds, we can significantly reduce the amount of data by compressing the TID lists. The GIN opclasses may produce a lot of data (depending on how many keys are extracted from each row), and the TID compression is very efficient and effective. If the number of distinct keys is high, the first worker pass (reading data from the table and writing them into a private tuplesort) may not benefit from the compression very much. It is likely to spill data to disk before the TID lists get long enough for the compression to help. The second pass (writing the merged data into the shared tuplesort) is more likely to benefit from compression. The compression can be seen as a way to reduce the amount of disk space needed by the parallel builds, because the data is written twice. First into the per-worker tuplesorts, then into the shared tuplesort. Author: Tomas Vondra Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com	2025-03-04 19:02:05 +01:00
Tomas Vondra	c878de1db4	Make FP_LOCK_SLOTS_PER_BACKEND look like a function The FP_LOCK_SLOTS_PER_BACKEND macro looks like a constant, but it depends on the max_locks_per_transaction GUC, and thus can change. This is non-obvious and confusing, so make it look more like a function by renaming it to FastPathLockSlotsPerBackend(). While at it, use the macro when initializing fast-path shared memory, instead of using the formula. Reported-by: Andres Freund Discussion: https://postgr.es/m/ffiwtzc6vedo6wb4gbwelon5nefqg675t5c7an2ta7pcz646cg%40qwmkdb3l4ett	2025-03-04 18:33:12 +01:00
Heikki Linnakangas	d2e7068392	Fix outdated comment Commit `bc971f4025` replaced the latch-setting mechanism that the comment talked about with a condition variable. And before that, commit `2258e76f90` moved the code so that the comment got detached from the loop that it talked about, so move the comment closer to the loop.	2025-03-04 15:33:19 +02:00
Peter Eisentraut	3abbd8dbeb	Fix accidental use of = instead of == Fix for commit `630f9a43ce`. It used = instead of ==. The result would be an incorrect error message. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://www.postgresql.org/message-id/flat/CA%2BCOZaC-JMbhQ4O0Q8V1Bxa0R%2BNex_RN9D6UyuLPiEx_CK4Heg%40mail.gmail.com	2025-03-04 09:45:01 +01:00
Peter Eisentraut	f011acdd61	Fix ALTER TABLE ADD VIRTUAL GENERATED COLUMN when table rewrite demo: CREATE TABLE gtest20a (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 2) VIRTUAL); ALTER TABLE gtest20a ADD COLUMN c float8 DEFAULT RANDOM() CHECK (b < 60); ERROR: no generation expression found for column number 2 of table "pg_temp_17306" In ATRewriteTable, the variable OIDNewHeap (if valid) corresponding pg_attrdef default expression entry was not populated. So OIDNewHeap cannot be used to call expand_generated_columns_in_expr or build_generation_expression. Therefore in ATRewriteTable, we can only use the existing relation to expand the generated expression. Author: jian he <jian.universality@gmail.com> Reviewed-by: Srinath Reddy <srinath2133@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CACJufxEJ%3DFoajabWXjszo_yrQeKSxdZ87KJqBW373rSbajKGAA%40mail.gmail.com	2025-03-04 09:18:32 +01:00
Richard Guo	716a051aac	Avoid NullTest deduction for clone clauses In commit `b262ad440`, we introduced an optimization that reduces an IS NOT NULL qual on a column defined as NOT NULL to constant true, and an IS NULL qual on a NOT NULL column to constant false, provided we can prove that the input expression of the NullTest is not nullable by any outer join. This deduction happens after we have generated multiple clones of the same qual condition to cope with commuted-left-join cases. However, performing the NullTest deduction for clone clauses can be unsafe, because we don't have a reliable way to determine if the input expression of a NullTest is non-nullable: nullingrel bits in clone clauses may not reflect reality, so we dare not draw conclusions from clones about whether Vars are guaranteed not-null. To fix, we check whether the given RestrictInfo is a clone clause in restriction_is_always_true and restriction_is_always_false, and avoid performing any reduction if it is. There are several ensuing plan changes in predicate.out, and we have to modify the tests to ensure that they continue to test what they are intended to. Additionally, this fix causes the test case added in `f00ab1fd1` to no longer trigger the bug that commit fixed, so we also remove that test case. Back-patch to v17 where this bug crept in. Reported-by: Ronald Cruz <cruz@rentec.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/f5320d3d-77af-4ce8-b9c3-4715ff33f213@rentec.com Backpatch-through: 17	2025-03-04 16:11:03 +09:00
Michael Paquier	c76db55c90	Split pgstat_bestart() into three different routines pgstat_bestart(), used post-authentication to set up a backend entry in the PgBackendStatus array, so as its data becomes visible in pg_stat_activity and related catalogs, has its logic divided into three routines with this commit, called in order at different steps of the backend initialization: * pgstat_bestart_initial() sets up the backend entry with a minimal amount of information, reporting it with a new BackendState called STATE_STARTING while waiting for backend initialization and client authentication to complete. The main benefit that this offers is observability, so as it is possible to monitor the backend activity during authentication. This step happens earlier than in the logic prior to this commit. pgstat_beinit() happens earlier as well, before authentication. * pgstat_bestart_security() reports the SSL/GSS status of the connection, once authentication completes. Auxiliary processes, for example, do not need to call this step, hence it is optional. This step is called after performing authentication, same as previously. * pgstat_bestart_final() reports the user and database IDs, takes the entry out of STATE_STARTING, and reports its application_name. This is called as the last step of the three, once authentication completes. An injection point is added, with a test checking that the "starting" phase of a backend entry is visible in pg_stat_activity. Some follow-up patches are planned to take advantage of this refactoring with more information provided in backend entries during authentication (LDAP hanging was a problem for the author, initially). Author: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAOYmi+=60deN20WDyCoHCiecgivJxr=98s7s7-C8SkXwrCfHXg@mail.gmail.com	2025-03-04 14:09:44 +09:00
Michael Paquier	40d3f82744	Add more assertions in palloc0() and palloc_extended() palloc() includes an assertion checking that an alloc() implementation never returns NULL for all MemoryContextMethods. This commit adds a similar assertion in palloc0(). In palloc_extend(), a different assertion is added, checking that MCXT_ALLOC_NO_OOM is set when an alloc() routine returns NULL. These additions can be useful to catch errors when implementing a new set of MemoryContextMethods routines. Author: Andreas Karlsson <andreas@proxel.se> Discussion: https://postgr.es/m/507e8eba-2035-4a12-a777-98199a66beb8@proxel.se	2025-03-04 10:53:10 +09:00
Melanie Plageman	06eae9e621	Trigger more frequent autovacuums with relallfrozen Calculate the insert threshold for triggering an autovacuum of a relation based on the number of unfrozen pages. By only considering the unfrozen portion of the table when calculating how many tuples to add to the insert threshold, we can trigger more frequent vacuums of insert-heavy tables. This increases the chances of vacuuming those pages when they still reside in shared buffers This also increases the number of autovacuums triggered by tuples inserted and not by wraparound risk. We prefer to freeze these pages during insert-triggered autovacuums, as anti-wraparound vacuums are not automatically canceled by conflicting lock requests. We calculate the unfrozen percentage of the table using the recently added (`99f8f3fbbc`) relallfrozen column of pg_class. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_aj-P7YyBz_cPNwztz6ohP%2BvWis%3Diz3YcomkB3NpYA--w%40mail.gmail.com	2025-03-03 14:42:00 -05:00
Tom Lane	35c8dd9e11	Simplify some logic around setting pg_attribute.atthasdef. DefineRelation was of the opinion that it could usefully pre-fill atthasdef flags to eliminate work for StoreAttrDefault. This is not the case, however: the tupledesc that it's filling is not the one that InsertPgAttributeTuples will work from. The tupledesc used there is made by RelationBuildLocalRelation, which deliberately doesn't copy atthasdef. Moreover, if this did happen as the code thinks, it would be wrong for the case of plain "DEFAULT NULL" clauses, since we detect and ignore simple-null-Const defaults later on. Hence, remove the useless code. It also emerges that it's not really worth a special-case path in StoreAttrDefault() for atthasdef already being set, because as far as we can see that never happens: cases where an existing default gets updated always do RemoveAttrDefault first, so as to clean up possibly-no-longer-correct dependency entries. If it were the case the code would still work, anyway. Also remove a nearby comment made moot by `5eaa0e92e`. Author: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxHFssPvkP1we7WMhPD_1kwgbG52o=kQgL+TnVoX5LOyCQ@mail.gmail.com	2025-03-03 13:35:48 -05:00
Tom Lane	4528768d98	Remove now-dead code in StoreAttrDefault(). StoreAttrDefault() is no longer responsible for filling attmissingval, so remove the code for that. Get rid of RawColumnDefault.missingMode, too, as we no longer need that to pass information around. While here, clean up some sloppy coding in StoreAttrDefault(), such as failure to use XXXGetDatum macros. These aren't bugs but they're not good code either. Reported-by: jian he <jian.universality@gmail.com> Author: jian he <jian.universality@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxHFssPvkP1we7WMhPD_1kwgbG52o=kQgL+TnVoX5LOyCQ@mail.gmail.com	2025-03-03 13:09:20 -05:00
Tom Lane	95f650674d	Fix broken handling of domains in atthasmissing logic. If a domain type has a default, adding a column of that type (without any explicit DEFAULT clause) failed to install the domain's default value in existing rows, instead leaving the new column null. This is unexpected, and it used to work correctly before v11. The cause is confusion in the atthasmissing mechanism about which default value to install: we'd only consider installing an explicitly-specified default, and then we'd decide that no table rewrite is needed. To fix, take the responsibility for filling attmissingval out of StoreAttrDefault, and instead put it into ATExecAddColumn's existing logic that derives the correct value to fill the new column with. Also, centralize the logic that determines the need for default-related table rewriting there, instead of spreading it over four or five places. In the back branches, we'll leave the attmissingval-filling code in StoreAttrDefault even though it's now dead, for fear that some extension may be depending on that functionality to exist there. A separate HEAD-only patch will clean up the now-useless code. Reported-by: jian he <jian.universality@gmail.com> Author: jian he <jian.universality@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxHFssPvkP1we7WMhPD_1kwgbG52o=kQgL+TnVoX5LOyCQ@mail.gmail.com Backpatch-through: 13	2025-03-03 12:43:44 -05:00
Melanie Plageman	99f8f3fbbc	Add relallfrozen to pg_class Add relallfrozen, an estimate of the number of pages marked all-frozen in the visibility map. pg_class already has relallvisible, an estimate of the number of pages in the relation marked all-visible in the visibility map. This is used primarily for planning. relallfrozen, together with relallvisible, is useful for estimating the outstanding number of all-visible but not all-frozen pages in the relation for the purposes of scheduling manual VACUUMs and tuning vacuum freeze parameters. A future commit will use relallfrozen to trigger more frequent vacuums on insert-focused workloads with significant volume of frozen data. Bump catalog version Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_aj-P7YyBz_cPNwztz6ohP%2BvWis%3Diz3YcomkB3NpYA--w%40mail.gmail.com	2025-03-03 11:18:05 -05:00
Tomas Vondra	8492feb98f	Allow parallel CREATE INDEX for GIN indexes Allow using parallel workers to build a GIN index, similarly to BTREE and BRIN. For large tables this may result in significant speedup when the build is CPU-bound. The work is divided so that each worker builds index entries on a subset of the table, determined by the regular parallel scan used to read the data. Each worker uses a local tuplesort to sort and merge the entries for the same key. The TID lists do not overlap (for a given key), which means the merge sort simply concatenates the two lists. The merged entries are written into a shared tuplesort for the leader. The leader needs to merge the sorted entries again, before writing them into the index. But this way a significant part of the work happens in the workers, and the leader is left with merging fewer large entries, which is more efficient. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for GIN indexes (e.g. uniqueness checks). Original patch by me, with reviews and substantial improvements by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com	2025-03-03 16:53:06 +01:00
Michael Paquier	3f1db99bfa	Handle auxiliary processes in SQL functions of backend statistics This commit impacts the following SQL functions, authorizing the access to the PGPROC entries of auxiliary processes when attempting to fetch or reset backend-level pgstats entries: - pg_stat_reset_backend_stats() - pg_stat_get_backend_io() This is relevant since `a051e71e28` for at least the WAL summarizer, WAL receiver and WAL writer processes, that has changed the backend statistics to authorize these three following the addition of WAL I/O statistics in pg_stat_io and backend statistics. The code is more flexible with future changes written this way, adapting automatically to any updates done in pgstat_tracks_backend_bktype(). While on it, pgstat_report_wal() gains a call to pgstat_flush_backend(), making sure that backend I/O statistics are updated when calling this routine. This makes the statistics report correctly for the WAL writer. WAL receiver and WAL summarizer do not call pgstat_report_wal() yet (spoiler: both should). It should be possible to lift some of the existing restrictions for other auxiliary processes, as well, but this is left as future work. Reported-by: Rahila Syed <rahilasyed90@gmail.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/CAH2L28v9BwN8_y0k6FQ591=0g2Hj_esHLGj3bP38c9nmVykoiA@mail.gmail.com	2025-03-03 09:57:48 +09:00
Peter Eisentraut	56ba0463d3	Set amcancrosscompare to true for hash This was missed in the refactoring in patch `ce62f2f2a0`, which thus created a regression. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org	2025-03-01 09:15:27 +01:00
Masahiko Sawada	8a1012b35d	Re-export NextCopyFromRawFields() to copy.h. Commit `7717f63006` removed NextCopyFromRawFields() from copy.h. While it was hoped that NextCopyFrom() could serve as an alternative, certain use cases still require NextCopyFromRawFields(). For instance, extensions like file_text_array_fdw, which process source data with an unknown number of columns, rely on this function. Per buildfarm member crake. Reported-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Sutou Kouhei <kou@clear-code.com> Discussion: https://postgr.es/m/5c7e1ac8-5083-4c08-af19-cb9ade2f16ce@dunslane.net	2025-02-28 15:11:41 -08:00
Masahiko Sawada	7717f63006	Refactor COPY FROM to use format callback functions. This commit introduces a new CopyFromRoutine struct, which is a set of callback routines to read tuples in a specific format. It also makes COPY FROM with the existing formats (text, CSV, and binary) utilize these format callbacks. This change is a preliminary step towards making the COPY FROM command extensible in terms of input formats. Similar to `2e4127b6d2`, this refactoring contributes to a performance improvement by reducing the number of "if" branches that need to be checked on a per-row basis when sending field representations in text or CSV mode. The performance benchmark results showed ~5% performance gain in text or CSV mode. Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com	2025-02-28 10:29:36 -08:00
Robert Haas	77cb08be51	Avoid including explain.h in explain_format.h and explain_dr.h As per a suggestion from Tom Lane, we do this by declaring "struct ExplainState" here and refer to that rather than "ExplainState". Also per Tom, CreateExplainSerializeDestReceiver was still defined in explain.h in addition to explain_dr.h. Remove leftover prototype. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CA+TgmoYtaad3i21V0jqua-fbr+CR0ix6uBvEX8_s6BG96abd=g@mail.gmail.com	2025-02-28 13:17:29 -05:00
Robert Haas	51d3e279c3	Fix missing space in EXPLAIN ANALYZE output. Commit `ddb17e387a` introduced this regression. Ideally, the regression tests would have caught this mistake, but apparently they don't test with timing enabled, presumably because that would make the output vary. Author: Thom Brown <thom@linux.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Discussion: http://postgr.es/m/CAA-aLv6nq=UeiyvM7_Mxgo9TVBzs2oh46b9vfyLzuyVEz3j1-g@mail.gmail.com	2025-02-28 13:04:12 -05:00
Michael Paquier	c2a50ac678	Invent pgstat_fetch_stat_backend_by_pid() This code is extracted from pg_stat_get_backend_io() in pgstatfuncs.c, so as it can be shared with other areas that need backend pgstats entries while having the benefits of the various sanity checks refactored here. As per its name, this retrieves backend statistics based on a PID, with the option of retrieving a BackendType if given in input. Currently, this is used for the backend-level IO statistics. The next move would be to reuse that for the backend-level WAL statistics. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal	2025-02-28 11:20:31 +09:00
Masahiko Sawada	2e4127b6d2	Refactor COPY TO to use format callback functions. This commit introduces a new CopyToRoutine struct, which is a set of callback routines to copy tuples in a specific format. It also makes the existing formats (text, CSV, and binary) utilize these format callbacks. This change is a preliminary step towards making the COPY TO command extensible in terms of output formats. Additionally, this refactoring contributes to a performance improvement by reducing the number of "if" branches that need to be checked on a per-row basis when sending field representations in text or CSV mode. The performance benchmark results showed ~5% performance gain in text or CSV mode. Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com	2025-02-27 15:03:52 -08:00
Robert Haas	555960a0fb	Create explain_dr.c and move DestReceiver-related code there. explain.c has grown rather large, and the code that deals with the DestReceiver that supports the SERIALIZE option is pretty easily severable from the rest of explain.c; hence, move it to a separate file. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: http://postgr.es/m/CA+TgmoYutMw1Jgo8BWUmB3TqnOhsEAJiYO=rOQufF4gPLWmkLQ@mail.gmail.com	2025-02-27 13:14:16 -05:00
Robert Haas	9173e8b604	Create explain_format.c and move relevant code there. explain.c has grown rather large, so move various functions that are principally concerned with output generation to a new source file, explain_format.c, instead of lumping them in with everything else that is part of explain.c Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: http://postgr.es/m/CA+TgmoYutMw1Jgo8BWUmB3TqnOhsEAJiYO=rOQufF4gPLWmkLQ@mail.gmail.com	2025-02-27 12:37:10 -05:00
Robert Haas	95dbd827f2	EXPLAIN: Always use two fractional digits for row counts. Commit `ddb17e387a` attempted to avoid confusing users by displaying digits after the decimal point only when nloops > 1, since it's impossible to have a fraction row count after a single iteration. However, this made the regression tests unstable since parallal queries will have nloops>1 for all nodes below the Gather or Gather Merge in normal cases, but if the workers don't start in time and the leader finishes all the work, they will suddenly have nloops==1, making it unpredictable whether the digits after the decimal point would be displayed or not. Although `44cbba9a7f` seemed to fix the immediate failures, it may still be the case that there are lower-probability failures elsewhere in the regression tests. Various fixes are possible here. For example, it has previously been proposed that we should try to display the digits after the decimal point only if rows/nloops is an integer, but currently rows is storead as a float so it's not theoretically an exact quantity -- precision could be lost in extreme cases. It has also been proposed that we should try to display the digits after the decimal point only if we're under some sort of construct that could potentially cause looping regardless of whether it actually does. While such ideas are not without merit, this patch adopts the much simpler solution of always display two decimal digits. If that approach stands up to scrutiny from the buildfarm and human users, it spares us the trouble of doing anything more complex; if not, we can reassess. This commit incidentally reverts `44cbba9a7f`, which should no longer be needed. Author: Robert Haas <robertmhaas@gmail.com> Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> Discussion: http://postgr.es/m/CA+TgmoazzVHn8sFOMFAEwoqBTDxKT45D7mvkyeHgqtoD2cn58Q@mail.gmail.com	2025-02-27 11:27:16 -05:00
Peter Eisentraut	ce62f2f2a0	Generalize hash and ordering support in amapi Stop comparing access method OID values against HASH_AM_OID and BTREE_AM_OID, and instead check the IndexAmRoutine for an index to see if it advertises its ability to perform the necessary ordering, hashing, or cross-type comparing functionality. A field amcanorder already existed, this uses it more widely. Fields amcanhash and amcancrosscompare are added for the other purposes. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-27 17:03:31 +01:00
Alexander Korotkov	e167191dc1	Get rid of ojrelid local variable in remove_rel_from_query() As spotted by Coverity, the calculation of ojrelid mixes signed and unsigned types causes possible overflow and undefined behavior. Instead of trying to fix the expression, this commit eliminates the relied local variable. The explicit branching is used to replace the -1 value. That, in turn, requires changing the signature of the remove_rel_from_eclass() function. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/914330.1740330169%40sss.pgh.pa.us Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>	2025-02-27 11:22:01 +02:00
Thomas Munro	55918f798b	Remove arbitrary cap on read_stream.c buffer queue. Previously the internal queue of buffers was capped at max_ios * 4, though not less than io_combine_limit, at allocation time. That was done in the first version based on conservative theories about resource usage and heuristics pending later work. The configured I/O depth could not always be reached with dense random streams generated by ANALYZE, VACUUM, the proposed Bitmap Heap Scan patch, and also sequential streams with the proposed AIO subsystem to name some examples. The new formula is (max_ios + 1) * io_combine_limit, enough buffers for the full configured I/O concurrency level using the full configured I/O combine size, plus the buffers from one finished but not yet consumed full-sized I/O. Significantly more memory would be needed for high GUC values if the client code requests a large per-buffer data size, but that is discouraged (existing and proposed stream users try to keep it under a few words, if not zero). With this new formula, an intermediate variable could have overflowed under maximum GUC values, so its data type is adjusted to cope. Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-02-27 20:49:48 +13:00
Amit Kapila	8709dccc79	Fix the race condition in ReplicationSlotAcquire(). After commit `f41d8468dd`, a process could acquire and use a replication slot that had just been invalidated, leading to failures while accessing WAL. To ensure that we don't accidentally start using invalid slots, we must perform the invalidation check after acquiring the slot or under the spinlock where we associate the slot with a particular process. We choose the earlier method to keep the code simple. Reported-by: Hou Zhijie <houzj.fnst@fujitsu.com> Author: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CABdArM7J-LbGoMPGUPiFiLOyB_TZ5+YaZb=HMES0mQqzVTn8Gg@mail.gmail.com	2025-02-27 09:47:04 +05:30
Michael Paquier	495864a4cf	Refactor code of pg_stat_get_wal() building result tuple This commit adds to pgstatfuncs.c a new routine called pg_stat_wal_build_tuple(), helper routine for pg_stat_get_wal(). This is in charge of filling one tuple based on the contents of PgStat_WalStats retrieved from pgstats. This refactoring will be used by an upcoming patch introducing backend-level WAL statistics, simplifying the main patch. Note that it is not possible for stats_reset to be NULL in pg_stat_wal; backend statistics need to be able to handle this case. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal	2025-02-27 11:54:36 +09:00
Michael Paquier	62ec3e1f67	Fix possible double-release of spinlock in procsignal.c `9d9b9d46f3` has added spinlocks to protect the fields in ProcSignal flags, introducing a code path in ProcSignalInit() where a spinlock could be released twice if the pss_pid field of a ProcSignalSlot is found as already set. Multiple spinlock releases have no effect with most spinlock implementations, but this could cause the code to run into issues when the spinlock is acquired concurrently by a different process. This sanity check on pss_pid generates a LOG that can be delayed until after the spinlock is released as, like older versions up to v17, the code expects the initialization of the ProcSignalSlot to happen even if pss_pid is found incorrect. The code is changed so as the old pss_pid is read while holding the slot's spinlock, with the LOG from the sanity check generated after releasing the spinlock, preventing the double release. Author: Maksim Melnikov <m.melnikov@postgrespro.ru> Co-authored-by: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/dca47527-2d8b-4e3b-b5a0-e2deb73371a4@postgrespro.ru	2025-02-27 09:43:06 +09:00
Tom Lane	40e27d04b4	Use attnum to identify index columns in pg_restore_attribute_stats(). Previously we used attname for both table and index columns, but that is problematic for indexes because their attnames are assigned by internal rules that don't guarantee to preserve the names across dump and reload. (This is what's causing the remaining buildfarm failures in cross-version-upgrade tests.) Fortunately we can use attnum instead, since there's no such thing as adding or dropping columns in an existing index. We met this same problem previously with ALTER INDEX ... SET STATISTICS, and solved it the same way, cf commit `5b6d13eec`. In pg_restore_attribute_stats() itself, we accept either attnum or attname, but the policy used by pg_dump is to always use attname for tables and attnum for indexes. Author: Tom Lane <tgl@sss.pgh.pa.us> Author: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/1457469.1740419458@sss.pgh.pa.us	2025-02-26 16:36:20 -05:00
Michael Paquier	0e42d31b0b	Adding new PgStat_WalCounters structure in pgstat.h This new structure contains the counters and the data related to the WAL activity statistics gathered from WalUsage, separated into its own structure so as it can be shared across more than one Stats structure in pg_stat.h. This refactoring will be used by an upcoming patch introducing backend-level WAL statistics. Bump PGSTAT_FILE_FORMAT_ID. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal	2025-02-26 16:48:54 +09:00
Michael Paquier	d7cbeaf261	Remove pgstat_flush_wal() All the processes that generate WAL should call pgstat_report_wal() to report all their statistics related to WAL, and this is already what happens in the tree. Keeping pgstat_report_wal() is confusing while the other routine is encouraged. This routine is not required since `fc415edf8c`, where it was lastly used in pgstat_report_stat() before an equivalent callback existed. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z71oPkJJICrRB5Ws@paquier.xyz	2025-02-26 15:37:28 +09:00
Michael Paquier	adc6032fa8	Improve FATAL message for invalid TLI history at recovery The original message did not mention where the checkpoint record LSN was found, a control file or a backup_label file. A couple of LOG messages are generated before this FATAL check is reached, providing more details about the way recovery is set up. However, knowing this information in this specific message is useful for debugging. This is also useful for instances where log_min_messages is set to FATAL or more, where LOG messages do not show up. Author: Benoit Lobréau <benoit.lobreau@dalibo.com> Reviewed-by: David Steele <david@pgbackrest.org> Discussion: https://postgr.es/m/4ed10bc8-5513-4d8e-8643-8abcaa08336d@dalibo.com	2025-02-26 14:26:16 +09:00
Michael Paquier	6c349d83b6	Re-add GUC track_wal_io_timing This commit is a rework of `2421e9a51d`, about which Andres Freund has raised some concerns as it is valuable to have both track_io_timing and track_wal_io_timing in some cases, as the WAL write and fsync paths can be a major bottleneck for some workloads. Hence, it can be relevant to not calculate the WAL timings in environments where pg_test_timing performs poorly while capturing some IO data under track_io_timing for the non-WAL IO paths. The opposite can be also true: it should be possible to disable the non-WAL timings and enable the WAL timings (the previous GUC setups allowed this possibility). track_wal_io_timing is added back in this commit, controlling if WAL timings should be calculated in pg_stat_io for the read, fsync and write paths, as done previously with pg_stat_wal. pg_stat_wal previously tracked only the sync and write parts (now removed), read stats is new data tracked in pg_stat_io, all three are aggregated if track_wal_io_timing is enabled. The read part matters during recovery or if a XLogReader is used. Extra note: more control over if the types of timings calculated in pg_stat_io could be done with a GUC that lists pairs of (IOObject,IOOp). Reported-by: Andres Freund <andres@anarazel.de> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/3opf2wh2oljco6ldyqf7ukabw3jijnnhno6fjb4mlu6civ5h24@fcwmhsgmlmzu	2025-02-26 09:49:59 +09:00
Jeff Davis	a5cbdeb98a	Remove redundant pg_set__stats() variants. After commit `f3dae2ae58`, the primary purpose of separating the pg_set__stats() from the pg_restore_*_stats() variants was eliminated. Leave pg_restore_relation_stats() and pg_restore_attribute_stats(), which satisfy both purposes, and remove pg_set_relation_stats() and pg_set_attribute_stats(). Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/1457469.1740419458@sss.pgh.pa.us	2025-02-25 16:15:47 -08:00
Andres Freund	ecbff4378b	Change _mdfd_segpath() to return paths by value This basically mirrors the changes done in the predecessor commit. While there isn't currently a need to get these paths in critical sections, it seems a shame to unnecessarily allocate memory in these paths now that relpath() doesn't allocate anymore. Discussion: https://postgr.es/m/xeri5mla4b5syjd5a25nok5iez2kr3bm26j2qn4u7okzof2bmf@kwdh2vf7npra	2025-02-25 09:02:07 -05:00

... 19 20 21 22 23 ...

28616 commits