postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-07-15 12:51:05 -04:00

Author	SHA1	Message	Date
Nathan Bossart	3b88e50d6c	Add more columns to pg_stats, pg_stats_ext, and pg_stats_ext_exprs. This commit adds table OID and attribute number columns to pg_stats, and it adds table OID and statistics object OID columns to pg_stats_ext and pg_stats_ext_exprs. A proposed follow-up commit would use pg_stats.tableid to simplify a query in pg_dump. The others have no immediate purpose but may be useful later. Bumps catversion. Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM%3DcoCVy92QkVUUTLdo5eO2bMDtwMrzRn_8miAhX%2BuPaqXg%40mail.gmail.com	2026-03-17 09:26:27 -05:00
Peter Eisentraut	c9babbc881	Dump labels in reproducible order In pg_get_propgraphdef(), sort the labels before writing out, for a consistent dump order. Also, since we now have a list, we can get rid of the separate table scan to get the count. Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org	2026-03-17 14:07:29 +01:00
Michael Paquier	233e6ae953	gen_guc_tables.pl: Improve detection of inconsistent data This commit adds two improvements to gen_guc_tables.pl: 1) When finding two entries with the same name, the script complained about these being not in alphabetical order, which was confusing. Duplicated entries are now reported as their own error. 2) While the presence of the required fields is checked for all the parameters, the script did not perform any checks on the non-required fields. A check is added to check that any field defined matches with what can be accepted. Previously, a typo in the name of a required field would cause the field to be reported as missing. Non-mandatory fields would be silently ignored, which was problematic as we could lose some information. Author: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAN4CZFP=3xUoXb9jpn5OWwicg+rbyrca8-tVmgJsQAa4+OExkw@mail.gmail.com	2026-03-17 17:38:55 +09:00
Michael Paquier	1a7ccd2b33	Refactor some code around ALTER TABLE [NO] INHERIT [NO] INHERIT is not supported for partitioned tables, but this portion of tablecmds.c did not apply the same rules as the other sub-commands, checking the relkind in the execution phase, not the preparation phase. This commit refactors the code to centralize the relkind and other checks in the preparation phase for both command patterns, getting rid of one translatable string on the way. ATT_PARTITIONED_TABLE is removed from ATSimplePermissions(), and the child relation is checked the same way for both sub-commands. The ALTER TABLE patterns that now fail at preparation failed already at execution, hence there should be no changes from the user perspective except more consistent error messages generated. Some comments at the top of ATPrepAddInherit() were incorrect, CreateInheritance() being the routine checking the columns and constraints between the parent and its to-be-child. Author: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAEoWx2kggo1N2kDH6OSfXHL_5gKg3DqQ0PdNuL4LH4XSTKJ3-g@mail.gmail.com	2026-03-17 14:34:29 +09:00
David Rowley	d8a859d22b	Reduce size of CompactAttribute struct to 8 bytes Previously, this was 16 bytes. With the use of some bitflags and by reducing the attcacheoff field size to a 16-bit type, we can halve the size of the struct. It's unlikely that caching the offsets for offsets larger than what will fit in a 16-bit int will help much as the tuple is very likely to have some non-fixed-width types anyway, the offsets of which we cannot cache. Shrinking this down to 8 bytes helps by accessing fewer cachelines when performing tuple deformation. The fields used there are all fully fledged fields, which don't require any bitmasking to extract the value of. It also helps to more efficiently calculate the address of a compact_attrs[] element in TupleDesc as the x86 LEA instruction can work with 8 byte offsets, which allows the element address to be calculated from the TupleDesc's address in a single instruction using LEA's concurrent shift and add. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com	2026-03-17 15:06:31 +13:00
Fujii Masao	d927b4bd97	Fix WAL flush LSN used by logical walsender during shutdown Commit `6eedb2a5fd` made the logical walsender call XLogFlush(GetXLogInsertRecPtr()) to ensure that all pending WAL is flushed, fixing a publisher shutdown hang. However, if the last WAL record ends at a page boundary, GetXLogInsertRecPtr() can return an LSN pointing past the page header, which can cause XLogFlush() to report an error. A similar issue previously existed in the GiST code. Commit `b1f14c9672` introduced GetXLogInsertEndRecPtr(), which returns a safe WAL insertion end location (returning the start of the page when the last record ends at a page boundary), and updated the GiST code to use it with XLogFlush(). This commit fixes the issue by making the logical walsender use XLogFlush(GetXLogInsertEndRecPtr()) when flushing pending WAL during shutdown. Backpatch to all supported versions. Reported-by: Andres Freund <andres@anarazel.de> Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/vzguaguldbcyfbyuq76qj7hx5qdr5kmh67gqkncyb2yhsygrdt@dfhcpteqifux Backpatch-through: 14	2026-03-17 08:10:20 +09:00
David Rowley	7a2ab122a1	Fix thinko in nocachegetattr() and nocache_index_getattr() This code was recently adjusted by `c456e3911`, but that commit didn't get the logic correct when finding the attnum to start walking the tuple in. If there is a NULL, we need to start walking the tuple before it. Author: David Rowley <dgrowleyml@gmail.com> Reported-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAHewXNnb-s_=VdVUZ9h7dPA0u3hxV8x2aU3obZytnqQZ_MiROA@mail.gmail.com	2026-03-17 09:00:39 +13:00
Álvaro Herrera	fba4233c83	Reduce header inclusions via execnodes.h Remove a bunch of #include lines from execnodes.h. Most of these requier suitable typedefs to be added, so that it still compiles standalone. In one case, the fix is to move a struct definition to the one .c file where it is needed. Also some light clean up in plannodes.h and genam.h, though not as extensive as in execnodes.h. Author: Álvaro Herrera <alvherre@kurilemu.de> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/202603131240.ihwqdxnj7w2o@alvherre.pgsql	2026-03-16 14:34:57 +01:00
Peter Eisentraut	5c2a8d272b	Use C11 alignas in typedef definitions They were already using pg_attribute_aligned. This replaces that with alignas and moves that into the required syntactic position. Suggested-by: Peter Eisentraut <peter@eisentraut.org> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/d7a788fa-e609-4894-a8be-2f70e135424f%40eisentraut.org	2026-03-16 11:35:51 +01:00
Peter Eisentraut	2f094e7ac6	SQL Property Graph Queries (SQL/PGQ) Implementation of SQL property graph queries, according to SQL/PGQ standard (ISO/IEC 9075-16:2023). This adds: - GRAPH_TABLE table function for graph pattern matching - DDL commands CREATE/ALTER/DROP PROPERTY GRAPH - several new system catalogs and information schema views - psql \dG command - pg_get_propgraphdef() function for pg_dump and psql A property graph is a relation with a new relkind RELKIND_PROPGRAPH. It acts like a view in many ways. It is rewritten to a standard relational query in the rewriter. Access privileges act similar to a security invoker view. (The security definer variant is not currently implemented.) Starting documentation can be found in doc/src/sgml/ddl.sgml and doc/src/sgml/queries.sgml. Author: Peter Eisentraut <peter@eisentraut.org> Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Ajay Pal <ajay.pal.k@gmail.com> Reviewed-by: Henson Choi <assam258@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org	2026-03-16 10:14:18 +01:00
Fujii Masao	fd6ecbfa75	Ensure "still waiting on lock" message is logged only once per wait. When log_lock_waits is enabled, the "still waiting on lock" message is normally emitted only once while a session continues waiting. However, if the wait is interrupted, for example by wakeups from client_connection_check_interval, SIGHUP for configuration reloads, or similar events, the message could be emitted again each time the wait resumes. For example, with very small client_connection_check_interval values (e.g., 100 ms), this behavior could flood the logs with repeated messages, making them difficult to use. To prevent this, this commit guards the "still waiting on lock" message so it is reported at most once during a lock wait, even if the wait is interrupted. This preserves the intended behavior when no interrupts occur. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hüseyin Demir <huseyin.d3r@gmail.com> Discussion: https://postgr.es/m/CAHGQGwHZUmg+r4kMcPYt_Z-txxVX+CJJhfra+qemxKXvAxYbpw@mail.gmail.com	2026-03-16 18:10:57 +09:00
Michael Paquier	c336133c65	Reject ALTER TABLE .. CLUSTER earlier for partitioned tables ALTER TABLE .. CLUSTER ON and SET WITHOUT CLUSTER are not supported for partitioned tables and already fail with a check happening when the sub-command is executed, not when it is prepared. This commit moves the relkind check for partitioned tables to happen when the sub-command is prepared in ATSimplePermissions(). This matches with the practice of the other sub-commands of ALTER TABLE, shaving one translatable string. mark_index_clustered() can be a bit simplified, switching one elog(ERROR) to an assertion. Note that mark_index_clustered() can also be called through a CLUSTER command, but it cannot be reached for a partitioned table, per the assertion based on the relkind in cluster_rel(), and there is only one caller of rebuild_relation(). Author: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAEoWx2kggo1N2kDH6OSfXHL_5gKg3DqQ0PdNuL4LH4XSTKJ3-g@mail.gmail.com	2026-03-16 17:48:39 +09:00
Fujii Masao	8fe315f18d	Add stats_reset column to pg_statio_all_sequences pg_statio_all_sequences lacked a stats_reset column, unlike the other pg_statio_* views that already expose it. This commit adds the column so users can see when the statistics in this view were last reset. Also this commit updates the documentation for pg_stat_reset_single_table_counters() to clarify that it can reset statistics for sequences and materialized views as well. Catalog version bumped. Author: Sami Imseih <samimseih@gmail.com> Co-authored-by: Shihao Zhong <zhong950419@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0v0OPGyDpwxkX81CtTt9xsj9-TNxhm=8JdOvEKPsVVFNg@mail.gmail.com	2026-03-16 17:24:08 +09:00
Peter Eisentraut	a41bc38439	Fix accidentally casting away const Recently introduced in commit `8c2b30487c`.	2026-03-16 07:37:03 +01:00
Amit Kapila	5f39698c90	Remove obsolete speculative insert cleanup in ReorderBuffer. Commit `4daa140a2f` introduced proper decoding for speculative aborts. As a result, the internal state is guaranteed to be clean when a new speculative insert is encountered. This patch removes the defensive cleanup code that is no longer reachable. Author: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/23256.1772702981@localhost	2026-03-16 10:14:22 +05:30
Michael Paquier	bfa3c4f106	Optimize hash index bulk-deletion with streaming read This commit refactors hashbulkdelete() to use streaming reads, improving the efficiency of the operation by prefetching upcoming buckets while processing a current bucket. There are some specific changes required to make sure that the cleanup work happens in accordance to the data pushed to the stream read callback. When the cached metadata page is refreshed to be able to process the next set of buckets, the stream is reset and the data fed to the stream read callback has to be updated. The reset needs to happen in two code paths, when _hash_getcachedmetap() is called. The author has seen better performance numbers than myself on this one (with tweaks similar to `6c228755ad`). The numbers are good enough for both of us that this change is worth doing, in terms of IO and runtime. Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com	2026-03-16 09:22:09 +09:00
Tom Lane	82ff54377e	Move -ffast-math defense to float.c and remove the configure check. We had defenses against -ffast-math in timestamp-related files, which is a pretty obsolete place for them since we've not supported floating-point timestamps in a long time. Remove those and instead put one in float.c, which is still broken by using this switch. Add some commentary to put more color on why it's a bad idea. Also remove the check from configure. That was just there to fail faster, but it doesn't really seem necessary anymore, and besides we have no corresponding check in meson.build. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Suggested-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/abFXfKC8zR0Oclon%40ip-10-97-1-34.eu-west-3.compute.internal	2026-03-15 19:34:52 -04:00
David Rowley	c456e39113	Optimize tuple deformation This commit includes various optimizations to improve the performance of tuple deformation. We now precalculate CompactAttribute's attcacheoff, which allows us to remove the code from the deform routines which was setting the attcacheoff. Setting the attcacheoff is now handled by TupleDescFinalize(), which must be called before the TupleDesc is used for anything. Having TupleDescFinalize() means we can store the first attribute in the TupleDesc which does not have an offset cached. That allows us to add a dedicated deforming loop to deform all attributes up to the final one with an attcacheoff set, or up to the first NULL attribute, whichever comes first. Here we also improve tuple deformation performance of tuples with NULLs. Previously, if the HEAP_HASNULL bit was set in the tuple's t_infomask, deforming would, one-by-one, check each and every bit in the NULL bitmap to see if it was zero. Now, we process the NULL bitmap 1 byte at a time rather than 1 bit at a time to find the attnum with the first NULL. We can now deform the tuple without checking for NULLs up to just before that attribute. We also record the maximum attribute number which is guaranteed to exist in the tuple, that is, has a NOT NULL constraint and isn't an atthasmissing attribute. When deforming only attributes prior to the guaranteed attnum, we've no need to access the tuple's natt count. As an additional optimization, we only count fixed-width columns when calculating the maximum guaranteed column, as this eliminates the need to emit code to fetch byref types in the deformation loop for guaranteed attributes. Some locations in the code deform tuples that have yet to go through NOT NULL constraint validation. We're unable to perform the guaranteed attribute optimization when that's the case. This optimization is opt-in via the TupleTableSlot using the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS flag. This commit also adds a more efficient way of populating the isnull array by using a bit-wise SWAR trick which performs multiplication on the inverse of the tuple's bitmap byte and masking out all but the lower bit of each of the boolean's byte. This results in much more optimal code when compared to determining the NULLness via att_isnull(). 8 isnull elements are processed at once using this method, which means we need to round the tts_isnull array size up to the next 8 bytes. The palloc code does this anyway, but the round-up needed to be formalized so as not to overwrite the sentinel byte in MEMORY_CONTEXT_CHECKING builds. Doing this also allows the NULL-checking deforming loop to more efficiently check the isnull array, rather than doing the bit-wise processing for each attribute that att_isnull() does. The level of performance improvement from these changes seems to vary depending on the CPU architecture. Apple's M chips seem particularly fond of the changes, with some of the tested deform-heavy queries going over twice as fast as before. With x86-64, the speedups aren't quite as large. With tables containing only a small number of columns, the speedups will be less. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com	2026-03-16 11:46:00 +13:00
David Rowley	503620311e	Add all required calls to TupleDescFinalize() As of this commit all TupleDescs must have TupleDescFinalize() called on them once the TupleDesc is set up and before BlessTupleDesc() is called. In this commit, TupleDescFinalize() does nothing. This change has only been separated out from the commit that properly implements this function to make the change more obvious. Any extension which makes its own TupleDesc will need to be modified to call the new function. The follow-up commit which properly implements TupleDescFinalize() will cause any code which forgets to do this to fail in assert-enabled builds in BlessTupleDesc(). It may still be worth mentioning this change in the release notes so that extension authors update their code. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com	2026-03-16 11:45:49 +13:00
Tom Lane	e5a77d876d	Save a few bytes per CatCTup. CatalogCacheCreateEntry() computed the space needed for a CatCTup as sizeof(CatCTup) + MAXIMUM_ALIGNOF. That's not our usual style, and it wastes memory by allocating more padding than necessary. On 64-bit machines sizeof(CatCTup) would be maxaligned already since it contains pointer fields, therefore this code is wasting 8 bytes compared to the more usual MAXALIGN(sizeof(CatCTup)). While at it, we don't really need to do MemoryContextSwitchTo() when we're only allocating one block. Author: ChangAo Chen <cca5507@qq.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/tencent_A42E0544C6184FE940CD8E3B14A3F0A39605@qq.com	2026-03-15 18:05:38 -04:00
Melanie Plageman	99bf1f8aa6	Save vmbuffer in heap-specific scan descriptors for on-access pruning Future commits will use the visibility map in on-access pruning to fix VM corruption and set the VM if the page is all-visible. Saving the vmbuffer in the scan descriptor reduces the number of times it would need to be pinned and unpinned, making the overhead of doing so negligible. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/C3AB3F5B-626E-4AAA-9529-23E9A20C727F%40gmail.com	2026-03-15 11:09:10 -04:00
Melanie Plageman	8d2c1df4f4	Avoid BufferGetPage() calls in heap_update() BufferGetPage() isn't cheap and heap_update() calls it multiple times when it could just save the page from a single call. Do that. While we are at it, make separate variables for old and new page in heap_xlog_update(). It's confusing to reuse "page" for both pages. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_a%2BhO4PCptyaPR7AMZd7FjcHfOFKKJT8ouU3KedMud0tQ%40mail.gmail.com	2026-03-15 10:42:34 -04:00
Melanie Plageman	a3511443e5	Initialize missing fields in CreateExecutorState() `d47cbf474e` and `cbc127917e` forgot to initialize a few fields they introduced in the EState, so do that now. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com	2026-03-15 10:13:14 -04:00
Tom Lane	2eb87345e1	Fix aclitemout() to work during early bootstrap. "initdb -d" has been broken since commit `f95d73ed4`, because I changed aclitemin to work in bootstrap mode but failed to consider aclitemout. That routine isn't reached by default, but it is if the elog message level is high enough, so it needs to work without catalog access too. This patch just makes it use its existing code paths to print role OIDs numerically. We could alternatively invent an inverse of boot_get_role_oid() and print them symbolically, but that would take more code and it's not apparent that it'd be any better for debugging purposes. Reported-by: Greg Burd <greg@burd.me> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/4416.1773328045@sss.pgh.pa.us	2026-03-14 13:46:54 -04:00
Tomas Vondra	02eecead86	Tighten asserts on ParallelWorkerNumber The comment about ParallelWorkerNumbr in parallel.c says: In parallel workers, it will be set to a value >= 0 and < the number of workers before any user code is invoked; each parallel worker will get a different parallel worker number. However asserts in various places collecting instrumentation allowed (ParallelWorkerNumber == num_workers). That would be a bug, as the value is used as index into an array with num_workers entries. Fixed by adjusting the asserts accordingly. Backpatch to all supported versions. Discussion: https://postgr.es/m/5db067a1-2cdf-4afb-a577-a04f30b69167@vondra.me Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Backpatch-through: 14	2026-03-14 15:26:39 +01:00
David Rowley	4deecb52af	Allow sibling call optimization in slot_getsomeattrs_int() This changes the TupleTableSlotOps contract to make it so the getsomeattrs() function is in charge of calling slot_getmissingattrs(). Since this removes all code from slot_getsomeattrs_int() aside from the getsomeattrs() call itself, we may as well adjust slot_getsomeattrs() so that it calls getsomeattrs() directly. We leave slot_getsomeattrs_int() intact as this is still called from the JIT code. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com	2026-03-14 13:52:09 +13:00
Peter Geoghegan	8a879119a1	Use fake LSNs to improve nbtree dropPin behavior. Use fake LSNs in all nbtree critical sections that write a WAL record. That way we can safely apply the _bt_killitems LSN trick with logged and unlogged indexes alike. This brings the same benefits to plain scans of unlogged relations that commit `2ed5b87f` brought to plain scans of logged relations: scans will drop their leaf page pin eagerly (by applying the "dropPin" optimization), which avoids blocking progress by VACUUM. This is particularly helpful with applications that allow a scrollable cursor to remain idle for long periods. Preparation for an upcoming commit that will add the amgetbatch interface, and switch nbtree over to it (from amgettuple) to enable I/O prefetching. The index prefetching read stream's effective prefetch distance is adversely affected by any buffer pins held by the index AM. At the same time, it can be useful for prefetching to read dozens of leaf pages ahead of the scan to maintain an adequate prefetch distance. The index prefetching patch avoids this tension by always eagerly dropping index page pins of the kind traditionally held as an interlock against unsafe concurrent TID recycling by VACUUM (essentially the same way that amgetbitmap routines have always avoided holding onto pins). The work from this commit makes that possible during scans of nbtree unlogged indexes -- without our having to give up on setting LP_DEAD bits on index tuples altogether. Follow-up to commit `d774072f`, which moved the fake LSN infrastructure out of GiST so that it could be used by other index AMs. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com	2026-03-13 20:37:39 -04:00
Peter Geoghegan	d774072f00	Move fake LSN infrastructure out of GiST. Move utility functions used by GiST to generate fake LSNs into xlog.c and xloginsert.c, so that other index AMs can also generate fake LSNs. Preparation for an upcoming commit that will add support for fake LSNs to nbtree, allowing its dropPin optimization to be used during scans of unlogged relations. That commit is itself preparation for another upcoming commit that will add a new amgetbatch/btgetbatch interface to enable I/O prefetching. Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming XLOG_ASSIGN_LSN. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com	2026-03-13 19:38:17 -04:00
Jeff Davis	9b860373da	Add error code to user-visible message. Reported-by: Alexander Lakhin <exclusion@gmail.com>	2026-03-13 16:07:54 -07:00
Tomas Vondra	b1f14c9672	Use GetXLogInsertEndRecPtr in gistGetFakeLSN The function used GetXLogInsertRecPtr() to generate the fake LSN. Most of the time this is the same as what XLogInsert() would return, and so it works fine with the XLogFlush() call. But if the last record ends at a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the page header. In such case XLogFlush() fails with errors like this: ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000 Such failures are very hard to trigger, particularly outside aggressive test scenarios. Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN without skipping the header. This is the same as GetXLogInsertRecPtr(), except that it calls XLogBytePosToEndRecPtr(). Initial investigation by me, root cause identified by Andres Freund. This is a long-standing bug in gistGetFakeLSN(), probably introduced by `c6b92041d3` in PG13. Backpatch to all supported versions. Reported-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo Backpatch-through: 14	2026-03-13 23:25:24 +01:00
Heikki Linnakangas	311a851436	Free memory allocated for unrecognized_protocol_options Since `4966bd3ed9` Valgrind started to warn about little amount of memory being leaked in ProcessStartupPacket(). This is not critical but the warnings may distract from real issues. Fix it by freeing the list after use. Author: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://www.postgresql.org/message-id/CAJ7c6TN3Hbb5p=UHx0SPVN+h_JwPAV6rxoqOm7gHBMFKfnGK-Q@mail.gmail.com	2026-03-13 23:37:19 +02:00
Andres Freund	ce5d489166	Fix bug due to confusion about what IsMVCCSnapshot means In `0b96e734c5` I (Andres) relied on page_collect_tuples() being called only with an MVCC snapshot, and added assertions to that end, but did not realize that IsMVCCSnapshot() allows both proper MVCC snapshots and historical snapshots, which behave quite similarly to MVCC snapshots. Unfortunately that can lead to incorrect visibility results during logical decoding, as a historical snapshot is interpreted as a plain MVCC snapshot. The only reason this wasn't noticed earlier is that it's hard to reach as most of the time there are no sequential scans during logical decoding. To fix the bug and avoid issues like this in the future, split IsMVCCSnapshot() into IsMVCCSnapshot() and IsMVCCLikeSnapshot(), where now only the latter includes historic snapshots. One effect of this is that during logical decoding no page-at-a-time snapshots are used, as otherwise runtime branches to handle historic snapshots would be needed in some performance critical paths. Given how uncommon sequential scans are during logical decoding, that seems acceptable. Author: Antonin Houska <ah@cybertec.at> Reported-by: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/61812.1770637345@localhost	2026-03-13 13:53:19 -04:00
Nathan Bossart	e0a3a3fd53	Optimize COPY FROM (FORMAT {text,csv}) using SIMD. Presently, such commands scan the input buffer one byte at a time looking for special characters. This commit adds a new path that uses SIMD instructions to skip over chunks of data without any special characters. This can be much faster. To avoid regressions, SIMD processing is disabled for the remainder of the COPY FROM command as soon as we encounter a short line or a special character (except for end-of-line characters, else we'd always disable it after the first line). This is perhaps too conservative, but it could probably be made more lenient in the future via fine-tuned heuristics. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Ayoub Kazar <ma_kazar@esi.dz> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Neil Conway <neil.conway@gmail.com> Reviewed-by: Greg Burd <greg@burd.me> Tested-by: Manni Wood <manni.wood@enterprisedb.com> Tested-by: Mark Wong <markwkm@gmail.com> Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com	2026-03-13 11:07:32 -05:00
Peter Eisentraut	8c2b30487c	Factor out constructSetOpTargetlist() from transformSetOperationTree() This would be used separately by a future patch. It also makes a little smaller. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org	2026-03-13 16:16:40 +01:00
Heikki Linnakangas	f9de9bf302	Add callback for I/O error messages in SLRUs Historically, all SLRUs were addressed by transaction IDs, but that hasn't been true for a long time. However, the error message on I/O error still always talked about accessing a transaction ID. This commit adds a callback that allows subsystems to construct their own error messages, which can then correctly refer to a transaction ID, multixid or whatever else is used to address the particular SLRU. Author: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://www.postgresql.org/message-id/CACG=ezZZfurhYV+66ceubxQAyWqv9vaUi0yoO4-t48OE5xc0DQ@mail.gmail.com	2026-03-13 16:21:06 +02:00
Fujii Masao	723619eaa3	Add stats_reset column to pg_stat_database_conflicts. This commit adds a stats_reset column to pg_stat_database_conflicts, allowing users to see when the statistics in this view were last reset. This makes the view consistent with pg_stat_database and other statistics views. Catalog version bumped. Author: Shihao Zhong <zhong950419@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAGRkXqS98OebEWjax99_LVAECsxCB8i=BfsdAL34i-5QHfwyOQ@mail.gmail.com	2026-03-13 22:17:14 +09:00
Heikki Linnakangas	2e1dcf8c54	Check for interrupts during non-fast-update GIN insertion ginExtractEntries() can produce a lot of entries for a single item. During index build, we check for interrupts between entries, and the fast-update codepath does it as part of vacuum_delay_point(), but the non-fast update insertion codepath was uninterruptible. Add CHECK_FOR_INTERRUPTS() between entries in the non-fast update codepath too. Author: Vinod Sridharan <vsridh90@gmail.com> Discussion: https://www.postgresql.org/message-id/CAFMdLD6mQvAuStiOGvBJxAEfo6wdjZhj3+JveTLxOX8MVn4zmA@mail.gmail.com	2026-03-13 15:12:32 +02:00
Alexander Korotkov	fa6f2f624c	Rework ginScanToDelete() to pass Buffers instead of BlockNumbers. Previously, ginScanToDelete() and ginDeletePage() passed BlockNumbers and re-read pages that were already pinned and locked during the tree walk. The caller ginVacuumPostingTree()) held a cleanup-locked root buffer, yet ginScanToDelete() re-read it by block number with special-case code to skip re-locking. At first, this commit gives both functions more appropriate names, ginScanPostingTreeToDelete() and ginDeletePostingPage(), indicating they deal with posting trees/pages. This is more descriptive and similar to the way we name other GIN functions, for instance, ginVacuumPostingTree() and ginVacuumPostingTreeLeaves(). Then rework both functions to pass Buffers directly. DataPageDeleteStack now carries buffer, myoff (downlink offset in parent), and isRoot per level, so ginScanPostingTreeToDelete() takes only GinVacuumState and DataPageDeleteStack pointers. Also, ginDeletePostingPage() receives the three Buffers directly, and no longer reads or releases them itself. The caller reads and locks child pages before recursing, and manages buffer lifecycle afterward. This eliminates the confusing isRoot special cases in buffer management, including the apparent (but unreachable) double release of the root buffer identified by Andres Freund. Add comments explaining the locking protocol and the DataPageDeleteStack structure. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/utrlxij43fbguzw4kldte2spc4btoldizutcqyrfakqnbrp3ir@ph3sphpj4asz Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Jinbinge <jinbinge@126.com>	2026-03-13 13:50:13 +02:00
Heikki Linnakangas	f30cebb954	Fix pointer type of ShmemAllocatorData->index This went unnoticed in commit `e2362eb2bd` because the pointer is cast to/from a void pointer.	2026-03-13 11:00:15 +02:00
Andrew Dunstan	a0b6ef29a5	Enable fast default for domains with non-volatile constraints Previously, ALTER TABLE ADD COLUMN always forced a table rewrite when the column type was a domain with constraints (CHECK or NOT NULL), even if the default value satisfied those constraints. This was because contain_volatile_functions() considers CoerceToDomain immutable, so the code conservatively assumed any constrained domain might fail. Improve this by using soft error handling (ErrorSaveContext) to evaluate the CoerceToDomain expression at ALTER TABLE time. If the default value passes the domain's constraints, the value is stored as a "missing" attribute default and no table rewrite is needed. If the constraint check fails, we fall back to a table rewrite, preserving the historical behavior that constraint violations are only raised when the table actually contains rows. Domains with volatile constraint expressions always require a table rewrite since the constraint result could differ per evaluation and cannot be cached. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io> Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com	2026-03-12 18:05:01 -04:00
Andrew Dunstan	487cf2cbd2	Extend DomainHasConstraints() to optionally check constraint volatility Add an optional bool *has_volatile output parameter to DomainHasConstraints(). When non-NULL, the function checks whether any CHECK constraint contains a volatile expression. Callers that don't need this information pass NULL and get the same behavior as before. This is needed by a subsequent commit that enables the fast default optimization for domains with non-volatile constraints: we can safely evaluate such constraints once at ALTER TABLE time, but volatile constraints require a full table rewrite. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io> Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com	2026-03-12 18:04:16 -04:00
Peter Geoghegan	a367c433ad	Use simplehash for backend-private buffer pin refcounts. Replace dynahash with simplehash for the per-backend PrivateRefCountHash overflow table. Simplehash generates inlined, open-addressed lookup code, avoiding the per-call overhead of dynahash that becomes noticeable when many buffers are pinned with a CPU-bound workload. Motivated by testing of the index prefetching patch, which pins many more buffers concurrently than typical index scans. Author: Peter Geoghegan <pg@bowt.ie> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com	2026-03-12 13:26:16 -04:00
Peter Geoghegan	d071e1cfec	nbtree: Avoid allocating _bt_search stack. Avoid allocating memory for an nbtree descent stack during index scans. We only require a descent stack during inserts, when it is used to determine where to insert a new pivot tuple/downlink into the target leaf page's parent page in the event of a page split. (Page deletion's first phase also performs a _bt_search that requires a descent stack.) This optimization improves performance by minimizing palloc churn. It speeds up index scans that call _bt_search frequently/descend the index many times, especially when the cost of scanning the index dominates (e.g., with index-only skip scans). Testing has shown that the underlying issue causes performance problems for an upcoming patch that will replace btgettuple with a new btgetbatch interface to enable I/O prefetching. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com	2026-03-12 13:22:36 -04:00
Michael Paquier	6c228755ad	Use streaming read for VACUUM cleanup of GIN This commit replace the synchronous ReadBufferExtended() loop done in ginvacuumcleanup() with the streaming read equivalent, to improve I/O efficiency during GIN index vacuum cleanup operations. With dm_delay to emulate some latency and debug_io_direct=data to force synchronous writes and force the read path to be exercised, the author has noticed a 5x improvement in runtime, with a substantial reduction in IO stats numbers. I have reproduced similar numbers while running similar tests, with improvements becoming better with more tuples and more pages manipulated. Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com	2026-03-12 11:48:31 +09:00
Richard Guo	383eb21ebf	Convert NOT IN sublinks to anti-joins when safe The planner has historically been unable to convert "x NOT IN (SELECT y ...)" sublinks into anti-joins. This is because standard SQL semantics for NOT IN require that if the comparison "x = y" returns NULL, the "NOT IN" expression evaluates to NULL (effectively false), causing the row to be discarded. In contrast, an anti-join preserves the row if no match is found. Due to this semantic mismatch regarding NULL handling, the conversion was previously considered unsafe. However, if we can prove that neither side of the comparison can yield NULL values, and further that the operator itself cannot return NULL for non-null inputs, the behavior of NOT IN and anti-join becomes identical. Enabling this conversion allows the planner to treat the sublink as a first-class relation rather than an opaque SubPlan filter. This unlocks global join ordering optimization and permits the selection of the most efficient join algorithm based on cost, often yielding significant performance improvements for large datasets. This patch verifies that neither side of the comparison can be NULL and that the operator is safe regarding NULL results before performing the conversion. To verify operator safety, we require that the operator be a member of a B-tree or Hash operator family. This serves as a proxy for standard boolean behavior, ensuring the operator does not return NULL on valid non-null inputs, as doing so would break index integrity. For operand non-nullability, this patch makes use of several existing mechanisms. It leverages the outer-join-aware-Var infrastructure to verify that a Var does not come from the nullable side of an outer join, and consults the NOT-NULL-attnums hash table to efficiently verify schema-level NOT NULL constraints. Additionally, it employs find_nonnullable_vars to identify Vars forced non-nullable by qual clauses, and expr_is_nonnullable to deduce non-nullability for other expression types. The logic for verifying the non-nullability of the subquery outputs was adapted from prior work by David Rowley and Tom Lane. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/CAMbWs495eF=-fSa5CwJS6B-BaEi3ARp0UNb4Lt3EkgUGZJwkAQ@mail.gmail.com	2026-03-12 09:45:18 +09:00
Andres Freund	6322a028fa	bufmgr: Fix use of wrong variable in GetPrivateRefCountEntrySlow() Unfortunately, in `30df61990c`, I made GetPrivateRefCountEntrySlow() set a wrong cache hint when moving entries from the hash table to the faster array. There are no correctness concerns due to this, just an unnecessary loss of performance. Noticed while testing the index prefetching patch. Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com	2026-03-11 17:52:21 -04:00
Jeff Davis	547c15f9f8	Fix use of volatile. Commit `8185bb5347` misused volatile. Fix it. See also `6307b096e2`. Reported-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/1bb21c7d-885f-4f07-a3ed-21b60d7c92c6@eisentraut.org	2026-03-11 14:27:58 -07:00
Andrew Dunstan	342051d73b	Add support for altering CHECK constraint enforceability This commit adds support for ALTER TABLE ALTER CONSTRAINT ... [NOT] ENFORCED for CHECK constraints. Previously, only foreign key constraints could have their enforceability altered. When changing from NOT ENFORCED to ENFORCED, the operation not only updates catalog information but also performs a full table scan in Phase 3 to validate that existing data satisfies the constraint. For partitioned tables and inheritance hierarchies, the operation recurses to all child tables. When changing to NOT ENFORCED, we must recurse even if the parent is already NOT ENFORCED, since child constraints may still be ENFORCED. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@cybertec.at> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com	2026-03-11 16:15:35 -04:00
Andrew Dunstan	a9747153e1	rename alter constraint enforceability related functions The functions AlterConstrEnforceabilityRecurse and ATExecAlterConstrEnforceability are being renamed to AlterFKConstrEnforceabilityRecurse and ATExecAlterFKConstrEnforceability, respectively. The current alter constraint functions only handle Foreign Key constraints. Renaming them to be more explicit about the constraint type is necessary; otherwise, it will cause confusion when we later introduce the ability to alter the enforceability of other constraints. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com	2026-03-11 16:14:58 -04:00
Andres Freund	a766125efd	bufmgr: Switch to standard order in MarkBufferDirtyHint() When we were updating hint bits with just a share lock MarkBufferDirtyHint() had to use a non-standard order of operations, i.e. WAL log the buffer before marking the buffer dirty. This was required because the lock level used to set hints did not conflict with the lock level that was used to flush pages, which would have allowed flushing the page out before the WAL record. The non-standard order in turn required preventing the checkpoint from starting between writing the WAL record and flushing out the page. Now that setting hints and writing out buffers use share-exclusive, we can revert back to the normal order of operations. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d	2026-03-11 14:58:29 -04:00

1 2 3 4 5 ...

28174 commits