postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-04-15 22:10:45 -04:00

Author	SHA1	Message	Date
Peter Eisentraut	ca87c415e2	Add support for NOT ENFORCED in CHECK constraints This adds support for the NOT ENFORCED/ENFORCED flag for constraints, with support for check constraints. The plan is to eventually support this for foreign key constraints, where it is typically more useful. Note that CHECK constraints do not currently support ALTER operations, so changing the enforceability of an existing constraint isn't possible without dropping and recreating it. This could be added later. Author: Amul Sul <amul.sul@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Tested-by: Triveni N <triveni.n@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com	2025-01-11 10:52:30 +01:00
Jeff Davis	e0ece2a981	TupleHashTable: store additional data along with tuple. Previously, the caller needed to allocate the memory and the TupleHashTable would store a pointer to it. That wastes space for the palloc overhead as well as the size of the pointer itself. Now, the TupleHashTable relies on the caller to correctly specify the additionalsize, and allocates that amount of space. The caller can then request a pointer into that space. Discussion: https://postgr.es/m/b9cbf0219a9859dc8d240311643ff4362fd9602c.camel@j-davis.com Reviewed-by: Heikki Linnakangas	2025-01-10 17:14:37 -08:00
David Rowley	34c6e65242	Make verify_compact_attribute available in non-assert builds `6f3820f37` adjusted the assert-enabled validation of the CompactAttribute to call a new external function to perform the validation. That commit made it so the function was only available when building with USE_ASSERT_CHECKING, and because TupleDescCompactAttr() is a static inline function, the call to verify_compact_attribute() was compiled into any extension which uses TupleDescCompactAttr(). This caused issues for such extensions when loading the assert-enabled extension into PostgreSQL versions without asserts enabled due to that function being unavailable in core. To fix this, make verify_compact_attribute() available unconditionally, but make it do nothing unless building with USE_ASSERT_CHECKING. Author: Andrew Kane <andrew@ankane.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAOdR5yHfMEMW00XGo=v1zCVUS6Huq2UehXdvKnwtXPTcZwXhmg@mail.gmail.com	2025-01-11 13:45:54 +13:00
Tatsuo Ishii	a9dcbb4d5c	Add new StringInfo APIs to allow callers to specify the buffer size. Previously StringInfo APIs allocated buffers with fixed initial allocation size of 1024 bytes. This may be too large and inappropriate for some callers that can do with smaller memory buffers. To fix this, introduce new APIs that allow callers to specify initial buffer size. extern StringInfo makeStringInfoExt(int initsize); extern void initStringInfoExt(StringInfo str, int initsize); Existing APIs (makeStringInfo() and initStringInfo()) are changed to call makeStringInfoExt and initStringInfoExt respectively (via inline helper functions makeStringInfoInternal and initStringInfoInternal), with the default buffer size of 1024. Reviewed-by: Nathan Bossart, David Rowley, Michael Paquier, Gurjeet Singh Discussion: https://postgr.es/m/20241225.123704.1194662271286702010.ishii%40postgresql.org	2025-01-11 08:23:46 +09:00
Nathan Bossart	3d0b4b1068	Use a non-locking initial test in TAS_SPIN on AArch64. Our testing showed that this is helpful at sufficiently high contention levels and doesn't hurt performance on smaller machines. The new TAS_SPIN macro for AArch64 is identical to the ones added for PPC and x86_64 (see commits `bc2a050d40` and `b03d196be0`). Reported-by: Salvatore Dipietro Reviewed-by: Jingtang Zhang, Andres Freund Tested-by: Tom Lane Discussion: https://postgr.es/m/ZxgDEb_VpWyNZKB_%40nathan	2025-01-10 13:18:04 -06:00
Álvaro Herrera	cc811f92ba	Adjust signature of cluster_rel() and its subroutines cluster_rel() receives the OID of the relation to process, which it opens and locks; but then its subroutine copy_table_data() also receives the relation OID and opens it by itself. This is a bit wasteful. It's better to have cluster_rel() receive the relation already open, and pass it down to its subroutines as necessary; then cluster_rel closes the rel before returning. This simplifies things. But a better motivation to make this change is that a future command to do logical-decoding-based "concurrent VACUUM FULL" will need to release all locks on the relation (and possibly on the clustering index) at some point. Since it makes little sense to keep the relation reference without the lock, the cluster_rel() function will also close it (and the index). With this arrangement, neither the function nor its subroutines need open extra references, which, again, makes things simpler. Author: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/82651.1720540558@antos	2025-01-10 13:09:38 +01:00
Michael Paquier	f0bf7857be	Merge pgstat_count_io_op_n() and pgstat_count_io_op() The pgstat_count_io_op() function, which counts a single I/O operation, wraps pgstat_count_io_op_n() with a counter value of 1. The latter is declared in pgstat.h and used nowhere in the code, so let's remove it in favor of the former. This change makes also the code more symmetric with pgstat_count_io_op_time(), that already uses a similar set of arguments, except that it counts also the I/O time. This will ease a bit the integration of a follow-up patch that adds byte-level tracking in pg_stat_io for some of its attributes, lifting the current restriction based on BLCKSZ as all I/O operations are assumed to be block-based. Author: Nazir Bilal Yavuz Reviewed-by: Bertrand Drouvot Discussion: https://postgr.es/m/CAN55FZ32ze812=yjyZg1QeXhKvACUM_Nu0_gyPQcUKKuVHL5xA@mail.gmail.com	2025-01-10 09:57:27 +09:00
Michael Paquier	2c14037bb5	Refactor some code related to backend statistics This commit changes the way pending backend statistics are tracked by moving them into a new structure called PgStat_BackendPending, removing PgStat_BackendPendingIO. PgStat_BackendPending currently only includes PgStat_PendingIO for the pending I/O stats. pgstat_flush_backend() is extended with a "flags" argument to control which parts of the stats of a backend should be flushed. With this refactoring, it becomes easier to plug into backend statistics more data. A patch to add information related to WAL in this stats kind is under discussion. Author: Bertrand Drouvot Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal	2025-01-10 09:00:48 +09:00
Álvaro Herrera	69ab446514	Fix SLRU bank selection code The originally submitted code (using bit masking) was correct when the number of slots was restricted to be a power of two -- but that limitation was removed during development that led to commit `53c2a97a92`, which made the bank selection code incorrect. This led to always using a smaller number of banks than available. Change said code to use integer modulo instead, which works correctly with an arbitrary number of banks. It's likely that we could improve on this to avoid runtime use of integer division. But with this change we're, at least, not wasting memory on unused banks, and more banks mean less contention, which is likely to have a much higher performance impact than a single instruction's latency. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/9444dc46-ca47-43ed-9058-89c456316306@postgrespro.ru	2025-01-09 07:39:05 +01:00
Thomas Munro	026762dae3	Provide 64-bit ftruncate() and lseek() on Windows. Change our ftruncate() macro to use the 64-bit variant of chsize(), and add a new macro to redirect lseek() to _lseeki64(). Back-patch to all supported releases, in preparation for a bug fix. Tested-by: Davinder Singh <davinder.singh@enterprisedb.com> Discussion: https://postgr.es/m/CAKZiRmyM4YnokK6Oenw5JKwAQ3rhP0YTz2T-tiw5dAQjGRXE3Q%40mail.gmail.com	2025-01-09 15:00:58 +13:00
Jeff Davis	229e7793d9	Fix duplicate typedef from commit `a2f17f004d`. Reported-by: Thomas Munro	2025-01-08 15:25:05 -08:00
Jeff Davis	a2f17f004d	Control collation behavior with a method table. Previously, behavior branched based on the provider. A method table is less error-prone and more flexible. The ctype behavior will be addressed in an upcoming commit. Reviewed-by: Andreas Karlsson Discussion: https://postgr.es/m/2830211e1b6e6a2e26d845780b03e125281ea17b.camel%40j-davis.com	2025-01-08 14:26:46 -08:00
Jeff Davis	8a96faedc4	Remove unused TupleHashTableData->entrysize. Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com	2025-01-07 14:49:18 -08:00
Jeff Davis	834c9e807c	Add missing typedefs.list entry for AggStatePerGroupData. Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com	2025-01-07 14:33:21 -08:00
Nathan Bossart	c758119e5b	Allow changing autovacuum_max_workers without restarting. This commit introduces a new parameter named autovacuum_worker_slots that controls how many autovacuum worker slots to reserve during server startup. Modifying this new parameter's value does require a server restart, but it should typically be set to the upper bound of what you might realistically need to set autovacuum_max_workers. With that new parameter in place, autovacuum_max_workers can now be changed with a SIGHUP (e.g., pg_ctl reload). If autovacuum_max_workers is set higher than autovacuum_worker_slots, a WARNING is emitted, and the server will only start up to autovacuum_worker_slots workers at a given time. If autovacuum_max_workers is set to a value less than the number of currently-running autovacuum workers, the existing workers will continue running, but no new workers will be started until the number of running autovacuum workers drops below autovacuum_max_workers. Reviewed-by: Sami Imseih, Justin Pryzby, Robert Haas, Andres Freund, Yogesh Sharma Discussion: https://postgr.es/m/20240410212344.GA1824549%40nathanxps13	2025-01-06 15:01:22 -06:00
Heikki Linnakangas	5e68f61192	Remove duplicate definitions in proc.h These are also present in procnumber.h Reported-by: Peter Eisentraut Discussion: https://www.postgresql.org/message-id/bd04d675-4672-4f87-800a-eb5d470c15fc@eisentraut.org	2025-01-06 11:56:03 +02:00
John Naylor	3e70da2781	Always use the caller-provided context for radix tree leaves Previously, it would not have worked for a caller to pass a slab context, since it would have been used for other things which likely had incompatible size. In an attempt to be helpful and avoid possible space wastage due to aset's power-of-two rounding, RT_CREATE would create an additional slab context if the value type was fixed-length and larger than pointer size. The problem was, we have since added the bump context type, and the generation context was a possibility as well, so silently overriding the caller's choice may actually be worse. Commit `e8a6f1f908` arranged so that the caller-provided context is used only for leaves, so it's safe for the caller to use slab here if they wish. As demonstration, use slab in one of the radix tree regression tests. Reviewed by Masahiko Sawada Discussion: https://postgr.es/m/CANWCAZZDCo4k5oURg_pPxM6+WZ1oiG=sqgjmQiELuyP0Vtrwig@mail.gmail.com	2025-01-06 13:26:02 +07:00
John Naylor	e8a6f1f908	Get rid of radix tree's general purpose memory context Previously, this was notionally used only for the entry point of the tree and as a convenient parent for other contexts. For shared memory, the creator previously allocated the entry point in this context, but attaching backends didn't have access to that, so they just used the caller's context. For the sake of consistency, allocate every instance of an entry point in the caller's context. For local memory, allocate the control object in the caller's context as well. This commit also makes the "leaf context" the notional parent of the child contexts used for nodes, so it's a bit of a misnomer, but a future commit will make the node contexts independent rather than children, so leave it this way for now to avoid code churn. The memory context parameter for RT_CREATE is now unused in the case of shared memory, so remove it and adjust callers to match. In passing, remove unused "context" member from struct TidStore, which seems to have been an oversight. Reviewed by Masahiko Sawada Discussion: https://postgr.es/m/CANWCAZZDCo4k5oURg_pPxM6+WZ1oiG=sqgjmQiELuyP0Vtrwig@mail.gmail.com	2025-01-06 11:21:21 +07:00
John Naylor	960013f2a1	Use caller's memory context for radix tree iteration state Typically only one iterator is present at any time, so it's overkill to devote an entire context for this. Get rid of it and use the caller's context. This is tidy-up work, so no backpatch in this form. However, a hypothetical extension to v17 that tried to start iteration from an attaching backend would result in a crash, so that'll be fixed separately in a way that doesn't change behavior in core. Patch by me, reported and reviewed by Masahiko Sawada Discussion: https://postgr.es/m/CAD21AoBB2U47V=F+wQRB1bERov_of5=BOZGaybjaV8FLQyqG3Q@mail.gmail.com	2025-01-06 09:01:58 +07:00
David Rowley	11012c5037	Fix an assortment of spelling mistakes and typos Author: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/5812a0b9-b0cf-4151-9a14-d9f00e4f2858@gmail.com	2025-01-02 12:42:01 +13:00
Bruce Momjian	50e6eb731d	Update copyright for 2025 Backpatch-through: 13	2025-01-01 11:21:55 -05:00
Michael Paquier	ebf2ab40e5	Remove redundant wording in pg_statistic.h Author: Junwang Zhao Discussion: https://postgr.es/m/CAEG8a3JbMCHna=N5ZSx6huLnTDfW34kw7Pf2n8+3M-9UrrwesA@mail.gmail.com	2024-12-30 12:18:45 +09:00
Tom Lane	508a97ee49	Replace PGPROC.isBackgroundWorker with isRegularBackend. Commit `34486b609` effectively redefined isBackgroundWorker as meaning "not a regular backend", whereas before it had the narrower meaning of AmBackgroundWorkerProcess(). For clarity, rename the field to isRegularBackend and invert its sense. Discussion: https://postgr.es/m/1808397.1735156190@sss.pgh.pa.us	2024-12-28 16:21:54 -05:00
Tom Lane	34486b6092	Exclude parallel workers from connection privilege/limit checks. Cause parallel workers to not check datallowconn, rolcanlogin, and ACL_CONNECT privileges. The leader already checked these things (except for rolcanlogin which might have been checked for a different role). Re-checking can accomplish little except to induce unexpected failures in applications that might not even be aware that their query has been parallelized. We already had the principle that parallel workers rely on their leader to pass a valid set of authorization information, so this change just extends that a bit further. Also, modify the ReservedConnections, datconnlimit and rolconnlimit logic so that these limits are only enforced against regular backends, and only regular backends are counted while checking if the limits were already reached. Previously, background processes that had an assigned database or role were subject to these limits (with rather random exclusions for autovac workers and walsenders), and the set of existing processes that counted against each limit was quite haphazard as well. The point of these limits, AFAICS, is to ensure the availability of PGPROC slots for regular backends. Since all other types of processes have their own separate pools of PGPROC slots, it makes no sense either to enforce these limits against them or to count them while enforcing the limit. While edge-case failures of these sorts have been possible for a long time, the problem got a good deal worse with commit `5a2fed911` (CVE-2024-10978), which caused parallel workers to make some of these checks using the leader's current role where before we had used its AuthenticatedUserId, thus allowing parallel queries to fail after SET ROLE. The previous behavior was fairly accidental and I have no desire to return to it. This patch includes reverting `73c9f91a1`, which was an emergency hack to suppress these same checks in some cases. It wasn't complete, as shown by a recent bug report from Laurenz Albe. We can also revert `fd4d93d26` and `492217301`, which hacked around the same problems in one regression test. In passing, remove the special case for autovac workers in CheckMyDatabase; it seems cleaner to have AutoVacWorkerMain pass the INIT_PG_OVERRIDE_ALLOW_CONNS flag, now that that does what's needed. Like `5a2fed911`, back-patch to supported branches (which sadly no longer includes v12). Discussion: https://postgr.es/m/1808397.1735156190@sss.pgh.pa.us	2024-12-28 16:08:50 -05:00
Tom Lane	2bdf1b2a2e	Reserve a PGPROC slot and semaphore for the slotsync worker process. The need for this was missed in commit `93db6cbda`, with the result being that if we launch a slotsync worker it would consume one of the PGPROCs in the max_connections pool. That could lead to inability to launch the worker, or to subsequent failures of connection requests that should have succeeded according to the configured settings. Rather than create some one-off infrastructure to support this, let's group the slotsync worker with the existing autovac launcher in a new category of "special worker" processes. These are kind of like auxiliary processes, but they cannot use that infrastructure because they need to be able to run transactions. For the moment, make these processes share the PGPROC freelist used for autovac workers (which previously supplied the autovac launcher too). This is partly to avoid an ABI change in v17, and partly because it seems silly to have a freelist with at most two members. This might be worth revisiting if we grow enough workers in this category. Tom Lane and Hou Zhijie. Back-patch to v17. Discussion: https://postgr.es/m/1808397.1735156190@sss.pgh.pa.us	2024-12-28 12:30:42 -05:00
Noah Misch	ff90ee6145	In REASSIGN OWNED of a database, lock the tuple as mandated. Commit `aac2c9b4fd` mandated such locking and attempted to fulfill that mandate, but it missed REASSIGN OWNED. Hence, it remained possible to lose VACUUM's inplace update of datfrozenxid if a REASSIGN OWNED processed that database at the same time. This didn't affect the other inplace-updated catalog, pg_class. For pg_class, REASSIGN OWNED calls ATExecChangeOwner() instead of the generic AlterObjectOwner_internal(), and ATExecChangeOwner() fulfills the locking mandate. Like in GRANT, implement this by following the locking protocol for any catalog subject to the generic AlterObjectOwner_internal(). It would suffice to do this for IsInplaceUpdateOid() catalogs only. Back-patch to v13 (all supported versions). Kirill Reshke. Reported by Alexander Kukushkin. Discussion: https://postgr.es/m/CAFh8B=mpKjAy4Cuun-HP-f_vRzh2HSvYFG3rhVfYbfEBUhBAGg@mail.gmail.com	2024-12-28 07:16:22 -08:00
Peter Eisentraut	db6856c991	syncrep parser: pure parser and reentrant scanner Use the flex %option reentrant and the bison option %pure-parser to make the generated scanner and parser pure, reentrant, and thread-safe. Make the generated scanner use palloc() etc. instead of malloc() etc. Previously, we only used palloc() for the buffer, but flex would still use malloc() for its internal structures. Now, all the memory is under palloc() control. Simplify flex scan buffer management: Instead of constructing the buffer from pieces and then using yy_scan_buffer(), we can just use yy_scan_string(), which does the same thing internally. The previous code was necessary because we allocated the buffer with palloc() and the rest of the state was handled by malloc(). But this is no longer the case; everything is under palloc() now. Use flex yyextra to handle context information, instead of global variables. This complements the other changes to make the scanner reentrant. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org	2024-12-24 18:05:06 +01:00
Peter Eisentraut	e4a8fb8fef	replication parser: pure parser and reentrant scanner Use the flex %option reentrant and the bison option %pure-parser to make the generated scanner and parser pure, reentrant, and thread-safe. Make the generated scanner use palloc() etc. instead of malloc() etc. Previously, we only used palloc() for the buffer, but flex would still use malloc() for its internal structures. As a result, there could be some small memory leaks in case of uncaught errors. Now, all the memory is under palloc() control, so there are no more such issues. Simplify flex scan buffer management: Instead of constructing the buffer from pieces and then using yy_scan_buffer(), we can just use yy_scan_string(), which does the same thing internally. The previous code was necessary because we allocated the buffer with palloc() and the rest of the state was handled by malloc(). But this is no longer the case; everything is under palloc() now. Use flex yyextra to handle context information, instead of global variables. This complements the other changes to make the scanner reentrant. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Co-authored-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org	2024-12-24 16:40:09 +01:00
Peter Eisentraut	1eb7cb21c2	Remove pgrminclude annotations Per git log, the last time someone tried to do something with pgrminclude was around 2011. Many (not all) of the "pgrminclude ignore" annotations are of a newer date but seem to have just been copied around during refactorings and file moves and don't seem to reflect an actual need anymore. There have been some parallel experiments with include-what-you-use (IWYU) annotations, but these don't seem to correspond very strongly to pgrminclude annotations, so there is no value in keeping the existing ones even for that kind of thing. So, wipe them all away. We can always add new ones in the future based on actual needs. Discussion: https://www.postgresql.org/message-id/flat/2d4dc7b2-cb2e-49b1-b8ca-ba5f7024f05b%40eisentraut.org	2024-12-24 11:49:07 +01:00
David Rowley	6f3820f37a	Fix race condition in TupleDescCompactAttr assert code `5983a4cff` added CompactAttribute as an abbreviated alternative to FormData_pg_attribute to allow more cache-friendly processing in tasks related to TupleDescs. That commit contained some assert-only code to check that the CompactAttribute had been populated correctly, however, the method used to do that checking caused the TupleDesc's CompactAttribute to be zeroed before it was repopulated and compared to the snapshot taken before the memset call. This caused issues as the type cache caches TupleDescs in shared memory which can be used by multiple backend processes at the same time. There was a window of time between the zero and repopulation of the CompactAttribute where another process would mistakenly think that the CompactAttribute is invalid due to the memset. To fix this, instead of taking a snapshot of the CompactAttribute and calling populate_compact_attribute() and comparing the snapshot to the freshly populated TupleDesc's CompactAttribute, refactor things so we can just populate a temporary CompactAttribute on the stack. This way we don't touch the TupleDesc's memory. Reported-by: Alexander Lakhin, SQLsmith Discussion: https://postgr.es/m/ca3a256a-5d12-42db-aabe-a75a030d9fb9@gmail.com	2024-12-24 14:54:24 +13:00
David Rowley	7ec4b9ff80	Fix incorrect source filename references Jian He reported the src/include/utility/tcop.h one and the remainder were found by using a script to look for src/* and check that we have a filename or directory of that name. In passing, fix some out-date comments. Reported-by: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CACJufxGoE3H-7VgO02=PrR4SNuVWDVbfTyUnwO0HvS-Lxurnog@mail.gmail.com	2024-12-23 19:41:49 +13:00
David Rowley	db448ce5ad	Optimize alignment calculations in tuple form/deform Here we convert CompactAttribute.attalign from a char, which is directly derived from pg_attribute.attalign into a uint8, which stores the number of bytes to align the column's value by in the tuple. This allows tuple deformation and tuple size calculations to move away from using the inefficient att_align_nominal() macro, which manually checks each TYPALIGN_* char to translate that into the alignment bytes for the given type. Effectively, this commit changes those to TYPEALIGN calls, which are branchless and only perform some simple arithmetic with some bit-twiddling. The removed branches were often mispredicted by CPUs, especially so in real-world tables which often contain a mishmash of different types with different alignment requirements. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com	2024-12-21 09:43:26 +13:00
Heikki Linnakangas	1f81b48a9d	Mark CatalogSnapshotData static Like CurrentSnapshotData, it should not be accessed directly outside snapmgr.c.	2024-12-20 19:37:50 +02:00
Thomas Munro	38c579b089	Fix corruption when relation truncation fails. RelationTruncate() does three things, while holding an AccessExclusiveLock and preventing checkpoints: 1. Logs the truncation. 2. Drops buffers, even if they're dirty. 3. Truncates some number of files. Step 2 could previously be canceled if it had to wait for I/O, and step 3 could and still can fail in file APIs. All orderings of these operations have data corruption hazards if interrupted, so we can't give up until the whole operation is done. When dirty pages were discarded but the corresponding blocks were left on disk due to ERROR, old page versions could come back from disk, reviving deleted data (see pgsql-bugs #18146 and several like it). When primary and standby were allowed to disagree on relation size, standbys could panic (see pgsql-bugs #18426) or revive data unknown to visibility management on the primary (theorized). Changes: * WAL is now unconditionally flushed first * smgrtruncate() is now called in a critical section, preventing interrupts and causing PANIC on file API failure * smgrtruncate() has a new parameter for existing fork sizes, because it can't call smgrnblocks() itself inside a critical section The changes apply to RelationTruncate(), smgr_redo() and pg_truncate_visibility_map(). That last is also brought up to date with other evolutions of the truncation protocol. The VACUUM FileTruncate() failure mode had been discussed in older reports than the ones referenced below, with independent analysis from many people, but earlier theories on how to fix it were too complicated to back-patch. The more recently invented cancellation bug was diagnosed by Alexander Lakhin. Other corruption scenarios were spotted by me while iterating on this patch and earlier commit `75818b3a`. Back-patch to all supported releases. Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reported-by: rootcause000@gmail.com Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/18146-04e908c662113ad5%40postgresql.org Discussion: https://postgr.es/m/18426-2d18da6586f152d6%40postgresql.org	2024-12-20 23:57:02 +13:00
David Rowley	02a8d0c452	Remove pg_attribute.attcacheoff column The column is no longer needed as the offset is now cached in the CompactAttribute struct per commit `5983a4cff`. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com	2024-12-20 23:22:37 +13:00
David Rowley	5983a4cffc	Introduce CompactAttribute array in TupleDesc, take 2 The new compact_attrs array stores a few select fields from FormData_pg_attribute in a more compact way, using only 16 bytes per column instead of the 104 bytes that FormData_pg_attribute uses. Using CompactAttribute allows performance-critical operations such as tuple deformation to be performed without looking at the FormData_pg_attribute element in TupleDesc which means fewer cacheline accesses. For some workloads, tuple deformation can be the most CPU intensive part of processing the query. Some testing with 16 columns on a table where the first column is variable length showed around a 10% increase in transactions per second for an OLAP type query performing aggregation on the 16th column. However, in certain cases, the increases were much higher, up to ~25% on one AMD Zen4 machine. This also makes pg_attribute.attcacheoff redundant. A follow-on commit will remove it, thus shrinking the FormData_pg_attribute struct by 4 bytes. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com	2024-12-20 22:31:26 +13:00
Tom Lane	e0a2721f7c	Get rid of old version of BuildTupleHashTable(). It was reasonable to preserve the old API of BuildTupleHashTable() in the back branches, but in HEAD we should actively discourage use of that version. There are no remaining callers in core, so just get rid of it. Then rename BuildTupleHashTableExt() back to BuildTupleHashTable(). While at it, fix up the miserably-poorly-maintained header comment for BuildTupleHashTable[Ext]. It looks like more than one patch in this area has had the opinion that updating comments is beneath them. Discussion: https://postgr.es/m/538343.1734646986@sss.pgh.pa.us	2024-12-19 18:07:00 -05:00
Tom Lane	8d96f57d5c	Improve planner's handling of SetOp plans. Remove the code for inserting flag columns in the inputs of a SetOp. That was the only reason why there would be resjunk columns in a set-operations plan tree, so we can get rid of some code that supported that, too. Get rid of choose_hashed_setop() in favor of building Paths for the hashed and sorted alternatives, and letting them fight it out within add_path(). Remove set_operation_ordered_results_useful(), which was giving wrong answers due to examining the wrong ancestor node: we need to examine the immediate SetOperationStmt parent not the topmost node. Instead make each caller of recurse_set_operations() pass down the relevant parent node. (This thinko seems to have led only to wasted planning cycles and possibly-inferior plans, not wrong query answers. Perhaps we should back-patch it, but I'm not doing so right now.) Teach generate_nonunion_paths() to consider pre-sorted inputs for sorted SetOps, rather than always generating a Sort node. Patch by me; thanks to Richard Guo and David Rowley for review. Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us	2024-12-19 17:02:25 -05:00
Tom Lane	2762792952	Convert SetOp to read its inputs as outerPlan and innerPlan. The original design for set operations involved appending the two input relations into one and adding a flag column that allows distinguishing which side each row came from. Then the SetOp node pries them apart again based on the flag. This is bizarre. The only apparent reason to do it is that when sorting, we'd only need one Sort node not two. But since sorting is at least O(N log N), sorting all the data is actually worse than sorting each side separately --- plus, we have no chance of taking advantage of presorted input. On top of that, adding the flag column frequently requires an additional projection step that adds cycles, and then the Append node isn't free either. Let's get rid of all of that and make the SetOp node have two separate children, using the existing outerPlan/innerPlan infrastructure. This initial patch re-implements nodeSetop.c and does a bare minimum of work on the planner side to generate correctly-shaped plans. In particular, I've tried not to change the cost estimates here, so that the visible changes in the regression test results will only involve removal of useless projection steps and not any changes in whether to use sorted vs hashed mode. For SORTED mode, we combine successive identical tuples from each input into groups, and then merge-join the groups. The tuple comparisons now use SortSupport instead of simple equality, but the group-formation part should involve roughly the same number of tuple comparisons as before. The cross-comparisons between left and right groups probably add to that, but I'm not sure to quantify how many more comparisons we might need. For HASHED mode, nodeSetop's logic is almost the same as before, just refactored into two separate loops instead of one loop that has an assumption that it will see all the left-hand inputs first. In both modes, I added early-exit logic to not bother reading the right-hand relation if the left-hand input is empty, since neither INTERSECT nor EXCEPT modes can produce any output if the left input is empty. This could have been done before in the hashed mode, but not in sorted mode. Sorted mode can also stop as soon as it exhausts the left input; any remaining right-hand tuples cannot have matches. Also, this patch adds some infrastructure for detecting whether child plan nodes all output the same type of tuple table slot. If they do, the hash table logic can use slightly more efficient code based on assuming that that's the input slot type it will see. We'll make use of that infrastructure in other plan node types later. Patch by me; thanks to Richard Guo and David Rowley for review. Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us	2024-12-19 16:23:45 -05:00
Peter Eisentraut	3e4bacb171	bootstrap: pure parser and reentrant scanner Use the flex %option reentrant and the bison option %pure-parser to make the generated scanner and parser pure, reentrant, and thread-safe. Make the generated scanner use palloc() etc. instead of malloc() etc. For the bootstrap scanner and parser, reentrancy and memory management aren't that important, but we make this change here anyway so that all the scanners and parsers in the backend use a similar set of options and APIs. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org	2024-12-19 15:37:44 +01:00
Michael Paquier	9aea73fc61	Add backend-level statistics to pgstats This adds a new variable-numbered statistics kind in pgstats, where the object ID key of the stats entries is based on the proc number of the backends. This acts as an upper-bound for the number of stats entries that can exist at once. The entries are created when a backend starts after authentication succeeds, and are removed when the backend exits, making the stats entry exist for as long as their backend is up and running. These are not written to the pgstats file at shutdown (note that write_to_file is disabled, as a safety measure). Currently, these stats include only information about the I/O generated by a backend, using the same layer as pg_stat_io, except that it is now possible to know how much activity is happening in each backend rather than an overall aggregate of all the activity. A function called pg_stat_get_backend_io() is added to access this data depending on the PID of a backend. The existing structure could be expanded in the future to add more information about other statistics related to backends, depending on requirements or ideas. Auxiliary processes are not included in this set of statistics. These are less interesting to have than normal backends as they have dedicated entries in pg_stat_io, and stats kinds of their own. This commit includes also pg_stat_reset_backend_stats(), function able to reset all the stats associated to a single backend. Bump catalog version and PGSTAT_FILE_FORMAT_ID. Author: Bertrand Drouvot Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, Michael Paquier, Nazir Bilal Yavuz Discussion: https://postgr.es/m/ZtXR+CtkEVVE/LHF@ip-10-97-1-34.eu-west-3.compute.internal	2024-12-19 13:19:22 +09:00
Melanie Plageman	1a0da347a7	Bitmap Table Scans use unified TBMIterator With the repurposing of TBMIterator as an interface for both parallel and serial iteration through TIDBitmaps in commit `7f9d4187e7`, bitmap table scans may now use it. Modify bitmap table scan code to use the TBMIterator. This requires moving around a bit of code, so a few variables are initialized elsewhere. Author: Melanie Plageman Reviewed-by: Tomas Vondra Discussion: https://postgr.es/m/c736f6aa-8b35-4e20-9621-62c7c82e2168%40vondra.me	2024-12-18 18:43:39 -05:00
Melanie Plageman	7f9d4187e7	Add common interface for TBMIterators Add and use TBMPrivateIterator, which replaces the current TBMIterator for serial use cases, and repurpose TBMIterator to be a unified interface for both the serial ("private") and parallel ("shared") TID Bitmap iterator interfaces. This encapsulation simplifies call sites for callers supporting both parallel and serial TID Bitmap access. TBMIterator is not yet used in this commit. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Heikki Linnakangas Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi	2024-12-18 18:19:28 -05:00
Melanie Plageman	68d9662be1	Make rs_cindex and rs_ntuples unsigned HeapScanDescData.rs_cindex and rs_ntuples can't be less than 0. All scan types using the heap scan descriptor expect these values to be >= 0. Make that expectation clear by making rs_cindex and rs_ntuples unsigned. Also remove the test in heapam_scan_bitmap_next_tuple() that checks if rs_cindex < 0. This was never true, but now that rs_cindex is unsigned, it makes even less sense. While we are at it, initialize both rs_cindex and rs_ntuples to 0 in initscan(). Author: Melanie Plageman Reviewed-by: Dilip Kumar Discussion: https://postgr.es/m/CAAKRu_ZxF8cDCM_BFi_L-t%3DRjdCZYP1usd1Gd45mjHfZxm0nZw%40mail.gmail.com	2024-12-18 11:47:38 -05:00
David Rowley	d96d1d5152	Fix incorrect slot type in BuildTupleHashTableExt `0f5738202` adjusted the execGrouping.c code so it made use of ExprStates to generate hash values. That commit made a wrong assumption that the slot type to pass to ExecBuildHash32FromAttrs() is always &TTSOpsMinimalTuple. That's not the case as the slot type depends on the slot type passed to LookupTupleHashEntry(), which for nodeRecursiveunion.c, could be any of the current slot types. Here we fix this by adding a new parameter to BuildTupleHashTableExt() to allow the slot type to be passed in. In the case of nodeSubplan.c and nodeAgg.c the slot type is always &TTSOpsVirtual, so for both of those cases, it's beneficial to pass the known slot type as that allows ExecBuildHash32FromAttrs() to skip adding the tuple deform step to the resulting ExprState. Another possible fix would have been to have ExecBuildHash32FromAttrs() set "fetch.kind" to NULL so that ExecComputeSlotInfo() always determines the EEOP_INNER_FETCHSOME is required, however, that option isn't favorable as slows down aggregation and hashed subplan evaluation due to the extra (needless) deform step. Thanks to Nathan Bossart for bisecting to find the offending commit based on Paul's report. Reported-by: Paul Ramsey <pramsey@cleverelephant.ca> Discussion: https://postgr.es/m/99F064C1-B3EB-4BE7-97D2-D2A0AA487A71@cleverelephant.ca	2024-12-18 12:05:55 +13:00
Melanie Plageman	4b565a198b	Make visibilitymap_set() return previous state of vmbits It can be useful to know the state of a relation page's VM bits before visibilitymap_set(). visibilitymap_set() has the old value on hand, so returning it is simple. This commit does not use visibilitymap_set()'s new return value. Author: Melanie Plageman Reviewed-by: Masahiko Sawada, Andres Freund, Nitin Jadhav, Bilal Yavuz Discussion: https://postgr.es/m/flat/CAAKRu_ZQe26xdvAqo4weHLR%3DivQ8J4xrSfDDD8uXnh-O-6P6Lg%40mail.gmail.com#6d8d2b4219394f774889509bf3bdc13d, https://postgr.es/m/ctdjzroezaxmiyah3gwbwm67defsrwj2b5fpfs4ku6msfpxeia%40mwjyqlhwr2wu	2024-12-17 14:19:03 -05:00
Tom Lane	c91963da13	Set the stack_base_ptr in main(), not in random other places. Previously we did this in PostmasterMain() and InitPostmasterChild(), which meant that stack depth checking was disabled in non-postmaster server processes, for instance in single-user mode. That seems like a fairly bad idea, since there's no a-priori restriction on the complexity of queries we will run in single-user mode. Moreover, this led to not having quite the same stack depth limit in all processes, which likely has no real-world effect but it offends my inner neatnik. Setting the depth in main() guarantees that check_stack_depth() is armed and has a consistent interpretation of stack depth in all forms of server processes. While at it, move the code associated with checking the stack depth out of tcop/postgres.c (which was never a great home for it) into a new file src/backend/utils/misc/stack_depth.c. Discussion: https://postgr.es/m/2081982.1734393311@sss.pgh.pa.us	2024-12-17 12:08:42 -05:00
Peter Eisentraut	fb1a18810f	Remove ts_locale.c's lowerstr() lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as str_tolower() that the rest of the system uses, except that the former don't use the common locale provider framework but instead use the global libc locale settings. This patch replaces uses of lowerstr*() with str_tolower(..., DEFAULT_COLLATION_OID). For instances that use a libc locale globally, this will result in exactly the same behavior. For instances that use other locale providers, you now get consistent behavior and are no longer dependent on the libc locale settings (for this case; there are others). Most uses of these functions are for processing dictionary and configuration files. In those cases, using the default collation seems appropriate. At least we don't have a more specific collation available. But the code in contrib/pg_trgm should really depend on the collation of the columns being processed. This is not done here, this can be done in a separate patch. (You can probably construct some edge cases where this change would create some locale-related upgrade incompatibility, for example if before you used a combination of ICU and a differently-behaving libc locale. We can document this in the release notes, but I don't think there is anything more we can do about this.) Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://www.postgresql.org/message-id/flat/653f3b84-fc87-45a7-9a0c-bfb4fcab3e7d%40eisentraut.org	2024-12-17 14:04:55 +01:00
Peter Eisentraut	d3aad4ac57	Remove ts_locale.c's t_isdigit(), t_isspace(), t_isprint() These do the same thing as the standard isdigit(), isspace(), and isprint() but with multibyte and encoding support. But all the callers are only interested in analyzing single-byte ASCII characters. So this extra layer is overkill and we can replace the uses with the standard functions. All the t_is() functions in ts_locale.c are under scrutiny because they don't use the common locale provider framework but instead use the global libc locale settings. For the functions being touched by this patch, we don't need all that anyway, as mentioned above, so the simplest solution is to just remove them. The few remaining t_is() functions will need a different treatment in a separate patch. pg_trgm has some compile-time options with macros such as KEEPONLYALNUM. These are not documented, and the non-default variant is not supported by any test cases. As part of this undertaking, I'm removing the non-default variant, as it is in the way of cleanup. So in this case, the not-KEEPONLYALNUM code path is gone. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://www.postgresql.org/message-id/flat/653f3b84-fc87-45a7-9a0c-bfb4fcab3e7d%40eisentraut.org	2024-12-17 12:52:29 +01:00
Jeff Davis	86a5d6006a	Refactor string case conversion into provider-specific files. Create API entry points pg_strlower(), etc., that work with any provider and give the caller control over the destination buffer. Then, move provider-specific logic into pg_locale_builtin.c, pg_locale_icu.c, and pg_locale_libc.c as appropriate. Discussion: https://postgr.es/m/7aa46d77b377428058403723440862d12a8a129a.camel@j-davis.com	2024-12-16 09:35:18 -08:00

1 2 3 4 5 ...

11896 commits