postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-04-25 16:18:21 -04:00

Author	SHA1	Message	Date
Masahiko Sawada	8030b839d3	Remove an unstable wait from parallel autovacuum regression test. The test 001_parallel_autovacuum.pl verified that vacuum delay parameters are propagated to parallel vacuum workers by using injection points. It previously waited for autovacuum to complete on the test_autovac table. However, since injection points are cluster-wide, an autovacuum worker could be triggered on tables in other databases (e.g., template1) and get stuck at the same injection point. This could lead to a timeout when the test waits for the expected table's autovacuum to finish. This commit removes the wait for autovacuum completion from this specific test case. Since the primary goal is to verify the propagation of parameter updates, which is already confirmed via log messages, waiting for the entire vacuum process to finish is unnecessary and prone to instability in concurrent test environments. Author: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0s+kZZRMSF4HW7tZ9W2jS1o4B+Fg8dr5a-T6mANX+mdQA@mail.gmail.com	2026-04-09 09:13:32 -07:00
Andres Freund	7fc36c5db5	instrumentation: Avoid CPUID 0x15/0x16 for Hypervisor TSC frequency This restricts the retrieval of the TSC frequency whilst under a Hypervisor to either Hypervisor-specific CPUID registers (0x40000010), or TSC calibration. We previously allowed retrieving from the traditional CPUID registers for TSC frequency (0x15/0x16) like on bare metal, but it turns out that they are not trustworthy when virtualized and can report wildly incorrect frequencies, like 7 kHz when the actual calibrated frequencty is 2.5 GHz. Per report from buildfarm member drongo. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/jr4hk2sxhqcfpb67ftz5g4vw33nm67cgf7go3wwmqsafu5aclq%405m67ukuhyszz	2026-04-09 11:50:46 -04:00
Nathan Bossart	60165db6e1	Add LOG_NEVER error level code. This logging level means not to emit the log, which is useful for functions like relation_needs_vacanalyze(). This function accepts a log level argument but not all callers want it to emit logs. Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3101163.1775676098%40sss.pgh.pa.us	2026-04-09 10:18:15 -05:00
Richard Guo	8b6c89e377	Fix integer overflow in nodeWindowAgg.c In nodeWindowAgg.c, the calculations for frame start and end positions in ROWS and GROUPS modes were performed using simple integer addition. If a user-supplied offset was sufficiently large (close to INT64_MAX), adding it to the current row or group index could cause a signed integer overflow, wrapping the result to a negative number. This led to incorrect behavior where frame boundaries that should have extended indefinitely (or beyond the partition end) were treated as falling at the first row, or where valid rows were incorrectly marked as out-of-frame. Depending on the specific query and data, these overflows can result in incorrect query results, execution errors, or assertion failures. To fix, use overflow-aware integer addition (ie, pg_add_s64_overflow) to check for overflows during these additions. If an overflow is detected, the boundary is now clamped to INT64_MAX. This ensures the logic correctly treats the boundary as extending to the end of the partition. Bug: #19405 Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/19405-1ecf025dda171555@postgresql.org Backpatch-through: 14	2026-04-09 19:28:33 +09:00
Richard Guo	c1408956e3	Strip PlaceHolderVars from partition pruning operands When pulling up a subquery, its targetlist items may be wrapped in PlaceHolderVars to enforce separate identity or as a result of outer joins. This causes any upper-level WHERE clauses referencing these outputs to contain PlaceHolderVars, which prevents partprune.c from recognizing that they match partition key columns, defeating partition pruning. To fix, strip PlaceHolderVars from operands before comparing them to partition keys. A PlaceHolderVar with empty phnullingrels appearing in a relation-scan-level expression is effectively a no-op, so stripping it is safe. This parallels the existing treatment in indxpath.c for index matching. In passing, rename strip_phvs_in_index_operand() to strip_noop_phvs() and move it from indxpath.c to placeholder.c, since it is now a general-purpose utility used by both index matching and partition pruning code. Back-patch to v18. Although this issue exists before that, changes in that version made it common enough to notice. Given the lack of field reports for older versions, I am not back-patching further. In the v18 back-patch, strip_phvs_in_index_operand() is retained as a thin wrapper around the new strip_noop_phvs() to avoid breaking third-party extensions that may reference it. Reported-by: Cándido Antonio Martínez Descalzo <candido@ninehq.com> Diagnosed-by: David Rowley <dgrowleyml@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAH5YaUwVUWETTyVECTnhs7C=CVwi+uMSQH=cOkwAUqMdvXdwWA@mail.gmail.com Backpatch-through: 18	2026-04-09 16:41:31 +09:00
Amit Langote	e1cc57fabd	Add nkeys parameter to recheck_matched_pk_tuple() The function looped over ii_NumIndexKeyAttrs elements of the skeys array, but one caller (ri_FastPathFlushArray) passes a one-element array since it only handles single-column FKs. The function signature did not communicate this constraint, which static analysis flags as a potential out-of-bounds read. Add an nkeys parameter and assert that it matches ii_NumIndexKeyAttrs, then use it in the loop. The call sites already know the key count. Reported-by: Evan Montgomery-Recht <montge@mianetworks.net> Discussion: https://postgr.es/m/CAEg7pwcKf01FmDqFAf-Hzu_pYnMYScY_Otid-pe9uw3BJ6gq9g@mail.gmail.com	2026-04-09 14:45:31 +09:00
Michael Paquier	e0fa5bd146	Reduce presence of syscache.h in src/include/ `ee642cccc4` has added syscache.h in inval.h and objectaddress.h, enlarging by a lot the footprint of this header, particularly via objectaddress.h. A change in syscache.h would cause a lot more files to be recompiled. This commit reduces the presence of syscache.h by switching to a direct use of syscache_ids.h in inval.h and objectaddress.h, where the enum SysCacheIdentifier is defined. genbki.pl gains an #ifndef block for this header, so as its inclusion is more controlled. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/vlcexdcimsmvu3aplt2yxpfndkgtuvjsrms2fdl46rbw3k2kug@drspkoxlaije	2026-04-09 08:49:36 +09:00
Álvaro Herrera	2cff363715	Simplify declaration of memcpy target The existing one is understandable failing on (some?) 32-bit platforms. Reported-by: Tomas Vondra <tomas@vondra.me> Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1c197f2d-49a2-4830-8dde-55867218b62d@vondra.me	2026-04-08 22:58:56 +02:00
Peter Eisentraut	f8eec1ced6	Add missing PGDLLIMPORT markings	2026-04-08 15:49:33 +02:00
Thomas Munro	a1643d40b3	Remove RADIUS support. Our RADIUS implementation supported only the deprecated RADIUS/UDP variant, without the recommended Message-Authenticator attribute to mitigate against the Blast-RADIUS vulnerability. By now, popular RADIUS servers are expected to generate loud warnings or reject our authentication attempts outright. Since there have been no user reports about this, it seems unlikely that there are users. Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Michael Banck <mbanck@gmx.net> Discussion: https://postgr.es/m/CA%2BhUKG%2BSH309V8KECU5%3DxuLP9Dks0v9f9UVS2W74fPAE5O21dg%40mail.gmail.com	2026-04-08 22:38:43 +12:00
Etsuro Fujita	28972b6fc3	Add support for importing statistics from remote servers. Add a new FDW callback routine that allows importing remote statistics for a foreign table directly to the local server, instead of collecting statistics locally. The new callback routine is called at the beginning of the ANALYZE operation on the table, and if the FDW failed to import the statistics, the existing callback routine is called on the table to collect statistics locally. Also implement this for postgres_fdw. It is enabled by "restore_stats" option both at the server and table level. Currently, it is the user's responsibility to ensure remote statistics to import are up-to-date, so the default is false. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Etsuro Fujita <etsuro.fujita@gmail.com> Discussion: https://postgr.es/m/CADkLM%3DchrYAx%3DX2KUcDRST4RLaRLivYDohZrkW4LLBa0iBhb5w%40mail.gmail.com	2026-04-08 19:15:00 +09:00
Thomas Munro	d1c01b79d4	aio: Adjust I/O worker pool automatically. The size of the I/O worker pool used to implement io_method=worker was previously controlled by the io_workers setting, defaulting to 3. It was hard to know how to tune it effectively. That is replaced with: io_min_workers=2 io_max_workers=8 (up to 32) io_worker_idle_timeout=60s io_worker_launch_interval=100ms The pool is automatically sized within the configured range according to recent variation in demand. It grows when existing workers detect that latency might be introduced by queuing, and shrinks when the highest-numbered worker is idle for too long. Work was already concentrated into low-numbered workers in anticipation of this logic. The logic for waking extra workers now also tries to measure and reduce the number of spurious wakeups, though they are not entirely eliminated. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com	2026-04-08 19:08:32 +12:00
John Naylor	948ef7cdc4	Exit early from pg_comp_crc32c_pmull for small inputs The vectorized path in commit `fbc57f2bc` had a side effect of putting more branches in the path taken for small inputs. To reduce risk of regressions, only proceed with the vectorized path if we can guarantee that the remaining input after the alignment preamble is greater than 64 bytes. That also allows removing length checks in the alignment preamble. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/CANWCAZZ48GuLYhJCcTy8TXysjrMVJL6n1n7NP94=iG+t80YKPw@mail.gmail.com	2026-04-08 13:52:14 +07:00
Thomas Munro	ce11e63f81	pg_upgrade: Check for unsupported encodings. Since we have dropped MULE_INTERNAL, add a check that all encodings used in the source cluster are still supported according to PG_ENCODING_BE_VALID(). This is done generically, in case we decide to drop another encoding some day. Suggested-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA%2BhUKGKXDXh-FdU0orjfv%2BF08f%3DD91BhV3Ra-4zL-q%2BJmGYqTA%40mail.gmail.com	2026-04-08 17:45:09 +12:00
Thomas Munro	77645d44e3	Remove MULE_INTERNAL encoding. This was useful before widespread Unicode adoption, and was based on the internal encoding Emacs used to mix multiple sub-encodings. Emacs itself has stopped using it, and our implementation hadn't been updated with modern underlying standards. It is thought to be very unlikely that anyone is still using it in the field. Since such a complex encoding comes with costs and risks, we agreed to drop support. Any existing database using this encoding would need to be dumped and restored with a new encoding to upgrade to PostgreSQL 19, most likely UTF8, since pg_upgrade would fail. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/CA%2BhUKGKXDXh-FdU0orjfv%2BF08f%3DD91BhV3Ra-4zL-q%2BJmGYqTA%40mail.gmail.com	2026-04-08 17:40:06 +12:00
Andres Freund	2c16deee2f	instrumentation: Allocate query level instrumentation in ExecutorStart Until now extensions that wanted to measure overall query execution could create QueryDesc->totaltime, which the core executor would then start and stop. That's a bit odd and composes badly, e.g. extensions always had to use INSTRUMENT_ALL, because otherwise another extension might not get what they need. Instead this introduces a new field, QueryDesc->query_instr_options, that extensions can use to indicate whether they need query level instrumentation populated, and with which instrumentation options. Extensions should take care to only add options they need, instead of replacing the options of others. The prior name of the field, totaltime, sounded like it would only measure time, but these days the instrumentation infrastructure can track more resources. The secondary benefit is that this will make it obvious to extensions that they may not create the Instrumentation struct themselves anymore (often extensions build only against a postgres build without assertions). Adjust pg_stat_statements and auto_explain to match, and lower the requested instrumentation level for auto_explain to INSTRUMENT_TIMER, since the summary instrumentation it needs is only runtime. The reason to push this now, rather in the PG 20 cycle, is that `5a79e78501` already required extensions using query level instrumentations to adjust their code, and it seemed undesirable to require them to do so again for 20. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAP53Pkyqsht+exJQYRsjhSWYKu+vFGHhPub7m6PmFD6Or0=p1g@mail.gmail.com	2026-04-08 00:06:45 -04:00
Fujii Masao	db93032a7c	Fix slotsync worker blocking promotion when stuck in wait Previously, on standby promotion, the startup process sent SIGUSR1 to the slotsync worker (or a backend performing slot synchronization) and waited for it to exit. This worked in most cases, but if the process was blocked waiting for a response from the primary (e.g., due to a network failure), SIGUSR1 would not interrupt the wait. As a result, the process could remain stuck, causing the startup process to wait for a long time and delaying promotion. This commit fixes the issue by introducing a new procsignal reason, PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process sends this signal, and the handler sets interrupt flags so the process exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing promotion to complete without delay. Backpatch to v17, where slotsync was introduced. Author: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com Backpatch-through: 17	2026-04-08 11:22:21 +09:00
Andres Freund	544000288e	instrumentation: Move ExecProcNodeInstr to allow inlining This moves the implementation of ExecProcNodeInstr, the ExecProcNode variant that gets used when instrumentation is on, to be defined in instrument.c instead of execProcNode.c, and marks functions it uses as inline. This allows compilers to generate an optimized implementation, and shows a 4 to 12% reduction in instrumentation overhead for queries that move lots of rows. Author: Lukas Fittl <lukas@fittl.com> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com	2026-04-07 21:36:49 -04:00
Tomas Vondra	e157fe6f76	Add EXPLAIN (IO) instrumentation for TidRangeScan Adds support for EXPLAIN (IO) instrumentation for TidRange scans. This requires adding shared instrumentation for parallel scans, using the separate DSM approach introduced by `dd78e69cfc`. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me	2026-04-07 23:25:05 +02:00
Andres Freund	16fca48254	pg_test_timing: Also test RDTSC[P] timing, report time source, TSC frequency This adds support to pg_test_timing for the different timing sources added by `294520c444`. Author: Lukas Fittl <lukas@fittl.com> Author: David Geier <geidav.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> (in an earlier version) Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de	2026-04-07 17:12:08 -04:00
Tomas Vondra	3b1117d6e2	Add EXPLAIN (IO) instrumentation for SeqScan Adds support for EXPLAIN (IO) instrumentation for sequential scans. This requires adding shared instrumentation, using the separate DSM approach introduced by `dd78e69cfc`. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me	2026-04-07 23:07:03 +02:00
Tom Lane	b268928f93	Suppress unused-variable warning. x86 machines lacking HAVE__CPUIDEX saw a complaint about "unused variable 'reg'", per buildfarm as well as local experience. Oversight in `bcb2cf41f`.	2026-04-07 17:03:20 -04:00
Tomas Vondra	681daed931	Add EXPLAIN (IO) infrastructure with BitmapHeapScan support Allows collecting details about AIO / prefetch for scan nodes backed by a ReadStream. This may be enabled by a new "IO" option in EXPLAIN, and it shows information about the prefetch distance and I/O requests. As of this commit this applies only to BitmapHeapScan, because that's the only scan node using a ReadStream and collecting instrumentation from workers in a parallel query. Support for SeqScan and TidRangeScan, the other scan nodes using ReadStream, will be added in subsequent commits. The stats are collected only when required by EXPLAIN ANALYZE, with the IO option (disabled by default). The amount of collected statistics is very limited, but we don't want to clutter EXPLAIN with too much data. The IOStats struct is stored in the scan descriptor as a field, next to other fields used by table AMs. A pointer to the field is passed to the ReadStream, and updated directly. It's the responsibility of the table AM to allocate the struct (e.g. in ambeginscan) whenever the flag SO_SCAN_INSTRUMENT flag is passed to the scan, so that the executor and ReadStream has access to it. The collected stats are designed for ReadStream, but are meant to be reasonably generic in case a TAM manages I/Os in different ways. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me	2026-04-07 22:33:34 +02:00
Tomas Vondra	10d5a12a93	Switch EXPLAIN to unaligned output for json/xml/yaml Use unaligned output for multiple EXPLAIN queries using non-text format in regression tests. With aligned output adding/removing explain fields can be very disruptive, as it often modifies the whole block because of padding. Unaligned output does not have this issue. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me	2026-04-07 22:12:27 +02:00
Tom Lane	4edd6036d6	Fix WITHOUT OVERLAPS' interaction with domains. UNIQUE/PRIMARY KEY ... WITHOUT OVERLAPS requires the no-overlap column to be a range or multirange, but it should allow a domain over such a type too. This requires minor adjustments in both the parser and executor. In passing, fix a nearby break-instead-of-continue thinko in transformIndexConstraint. This had the effect of disabling parse-time validation of the no-overlap column's type in the context of ALTER TABLE ADD CONSTRAINT, if it follows a dropped column. We'd still complain appropriately at runtime though. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxGoAmN_0iJ=hjTG0vGpOSOyy-vYyfE+-q0AWxrq2_p5XQ@mail.gmail.com Backpatch-through: 18	2026-04-07 14:45:37 -04:00
Andres Freund	294520c444	instrumentation: Use Time-Stamp Counter on x86-64 to lower overhead This allows the direct use of the Time-Stamp Counter (TSC) value retrieved from the CPU using RDTSC/RDTSCP instructions, instead of APIs like clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with ANALYZE and TIMING ON. Tests showed that the overhead on top of actual runtime when instrumenting queries moving lots of rows through the plan can be reduced from 2x as slow to 1.2x as slow compared to the actual runtime. More complex workloads such as TPCH queries have also shown ~20% gains when instrumented compared to before. To control use of the TSC, the new "timing_clock_source" GUC is introduced, whose default ("auto") automatically uses the TSC when reliable, for example when running on modern Intel CPUs, or when running on Linux and the system clocksource is reported as "tsc". The use of the operating system clock source can be enforced by setting "system", or on x86-64 architectures the use of TSC can be enforced by explicitly setting "tsc". In order to use the TSC the frequency is first determined by use of CPUID, and if not available, by running a short calibration loop at program start, falling back to the system clock source if TSC values are not stable. Note, that we split TSC usage into the RDTSC CPU instruction which does not wait for out-of-order execution (faster, less precise) and the RDTSCP instruction, which waits for outstanding instructions to retire. RDTSCP is deemed to have little benefit in the typical InstrStartNode() / InstrStopNode() use case of EXPLAIN, and can be up to twice as slow. To separate these use cases, the new macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC. The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed to be used when precision is more important than performance. When the system timing clock source is used both of these macros instead utilize the system APIs (clock_gettime / QueryPerformanceCounter) like before. Additional users of interval timing, such as track_io_timing and track_wal_io_timing could also benefit from being converted to use INSTR_TIME_SET_CURRENT_FAST() but are left for future changes. Author: Lukas Fittl <lukas@fittl.com> Author: Andres Freund <andres@anarazel.de> Author: David Geier <geidav.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an earlier version) Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com> (in an earlier version) Reviewed-by: Robert Haas <robertmhaas@gmail.com> (in an earlier version) Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version) Discussion: https://postgr.es/m/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de	2026-04-07 13:00:24 -04:00
Andres Freund	bcb2cf41f9	Allow retrieving x86 TSC frequency/flags from CPUID This adds additional x86 specific CPUID checks for flags needed for determining whether the Time-Stamp Counter (TSC) is usable on a given system, as well as a helper function to retrieve the TSC frequency from CPUID. This is intended for a future patch that will utilize the TSC to lower the overhead of timing instrumentation. In passing, always make pg_cpuid_subleaf reset the variables used for its result, to avoid accidentally using stale results if __get_cpuid_count errors out and the caller doesn't check for it. Author: Lukas Fittl <lukas@fittl.com> Author: David Geier <geidav.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: John Naylor <john.naylor@postgresql.org> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version) Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de	2026-04-07 13:00:24 -04:00
Andres Freund	0022622c93	instrumentation: Standardize ticks to nanosecond conversion method The timing infrastructure (INSTR_* macros) measures time elapsed using clock_gettime() on POSIX systems, which returns the time as nanoseconds, and QueryPerformanceCounter() on Windows, which is a specialized timing clock source that returns a tick counter that needs to be converted to nanoseconds using the result of QueryPerformanceFrequency(). This conversion currently happens ad-hoc on Windows, e.g. when calling INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every invocation, despite the frequency being stable after program start, incurring unnecessary overhead. It also causes a fractured implementation where macros are defined differently between platforms. To ease code readability, and prepare for a future change that intends to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce new pg_ticks_to_ns() / pg_ns_to_ticks() functions that get called from INSTR_* macros on all platforms. These functions rely on a separately initialized ticks_per_ns_scaled value, that represents the conversion ratio. This value is initialized from QueryPerformanceFrequency() on Windows, and set to zero on x86-64 POSIX systems, which results in the ticks being treated as nanoseconds. Other architectures always directly return the original ticks. To support this, pg_initialize_timing() is introduced, and is now mandatory for both the backend and any frontend programs to call before utilizing INSTR_* macros. In passing, fix variable names in comment documenting INSTR_TIME_ADD_NANOSEC(). Author: Lukas Fittl <lukas@fittl.com> Author: David Geier <geidav.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de	2026-04-07 13:00:24 -04:00
Jacob Champion	b977bd308a	oauth: Allow validators to register custom HBA options OAuth validators can already use custom GUCs to configure behavior globally. But we currently provide no ability to adjust settings for individual HBA entries, because the original design focused on a world where a provider covered a "single audience" of users for one database cluster. This assumption does not apply to multitenant use cases, where a single validator may be controlling access for wildly different user groups. To improve this use case, add two new API calls for use by validator callbacks: RegisterOAuthHBAOptions() and GetOAuthHBAOption(). Registering options "foo" and "bar" allows a user to set "validator.foo" and "validator.bar" in an oauth HBA entry. These options are stringly typed (syntax validation is solely the responsibility of the defining module), and names are restricted to a subset of ASCII to avoid tying our hands with future HBA syntax improvements. Unfortunately, we can't check the custom option names during a reload of the configuration, like we do with standard HBA options, without requiring all validators to be loaded via shared_preload_libraries. (I consider this to be a nonstarter: most validators should probably use session_preload_libraries at most, since requiring a full restart just to update authentication behavior will be unacceptable to many users.) Instead, the new validator.* options are checked against the registered list at connection time. Multiple alternatives were proposed and/or prototyped, including extending the GUC system to allow per-HBA overrides, joining forces with recent refactoring work on the reloptions subsystem, and giving the ability to customize HBA options to all PostgreSQL extensions. I personally believe per-HBA GUC overrides are the best option, because several existing GUCs like authentication_timeout and pre_auth_delay would fit there usefully. But the recent addition of SNI per-host settings in `4f433025f` indicates that a more general solution is needed, and I expect that to take multiple releases' worth of discussion. This compromise patch, then, is intentionally designed to be an architectural dead end: simple to describe, cheap to maintain, and providing just enough functionality to let validators move forward for PG19. The hope is that it will be replaced in the future by a solution that can handle per-host, per-HBA, and other per-context configuration with the same functionality that GUCs provide today. In the meantime, the bulk of the code in this patch consists of strict guardrails on the simple API, to try to ensure that we don't have any reason to regret its existence during its unknown lifespan. I owe particular thanks here to Zsolt Parragi, who prototyped several approaches that guided the final design. Suggested-by: Zsolt Parragi <zsolt.parragi@percona.com> Suggested-by: VASUKI M <vasukianand0119@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAN4CZFM3b8u5uNNNsY6XCya257u%2BDofms3su9f11iMCxvCacag%40mail.gmail.com	2026-04-07 08:15:19 -07:00
Jacob Champion	6d00fb9048	libpq: Split PGOAUTHDEBUG=UNSAFE into multiple options PGOAUTHDEBUG is a blunt instrument: you get all the debugging features, or none of them. The most annoying consequence during manual use is the Curl debug trace, which tends to obscure the device flow prompt entirely. The promotion of PGOAUTHCAFILE into its own feature in `993368113` improved the situation somewhat, but there's still the discomfort of knowing you have to opt into many dangerous behaviors just to get the single debug feature you wanted. Explode the PGOAUTHDEBUG syntax into a comma-separated list. The old "UNSAFE" value enables everything, like before. Any individual unsafe features still require the envvar to begin with an "UNSAFE:" prefix, to try to interrupt the flow of someone who is about to do something they should not. So now, rather than PGOAUTHDEBUG=UNSAFE # enable all the unsafe things a developer can say PGOAUTHDEBUG=call-count # only show me the call count. safe! PGOAUTHDEBUG=UNSAFE:trace # print secrets, but don't allow HTTP To avoid adding more build system scaffolding to libpq-oauth, implement this entirely in a small private header. This unfortunately can't be standalone, so it needs a headerscheck exception. Author: Zsolt Parragi <zsolt.parragi@percona.com> Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAOYmi%2B%3DfbZNJSkHVci%3DGpR8XPYObK%3DH%2B2ERRha0LDTS%2BifsWnw%40mail.gmail.com Discussion: https://postgr.es/m/CAN4CZFMmDZMH56O9vb_g7vHqAk8ryWFxBMV19C39PFghENg8kA%40mail.gmail.com	2026-04-07 08:15:14 -07:00
Álvaro Herrera	e76d8c749c	Reserve replication slots specifically for REPACK Add a new GUC max_repack_replication_slots, which lets the user reserve some additional replication slots for concurrent repack (and only concurrent repack). With this, the user doesn't have to worry about changing the max_replication_slots in order to cater for use of concurrent repack. (We still use the same pool of bgworkers though, but that's less commonly a problem than slots.) Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com> Discussion: https://postgr.es/m/202604012148.nnnmyxxrr6nh@alvherre.pgsql	2026-04-07 16:55:29 +02:00
Heikki Linnakangas	979387f188	Fix harmless leftover in _hash_kill_items() Checking for 'havePin' is sufficient here. An earlier version of the patch didn't have the 'havePin' variable and used 'so->hashso_bucket_buf == so->currPos.buf' as the condition when both locking and unlocking the page. The havePin variable was added later during development, but the unlocking condition wasn't fully updated. Tidy it up. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/b9de8d05-3b02-4a27-9b0b-03972fa4bfd3@iki.fi	2026-04-07 17:38:11 +03:00
Andrew Dunstan	55890a9194	Add errdetail() with PID and UID about source of termination signal. When a backend is terminated via pg_terminate_backend() or an external SIGTERM, the error message now includes the sender's PID and UID as errdetail, making it easier to identify the source of unexpected terminations in multi-user environments. On platforms that support SA_SIGINFO (Linux, FreeBSD, and most modern Unix systems), the signal handler captures si_pid and si_uid from the siginfo_t structure. On platforms without SA_SIGINFO, the detail is simply omitted. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Chao Li <1356863904@qq.com> Discussion: https://postgr.es/m/CAKZiRmyrOWovZSdixpLd3PGMQXuQL_zw2Ght5XhHCkQ1uDsxjw@mail.gmail.com	2026-04-07 10:22:33 -04:00
Robert Haas	c10edb102a	pg_stash_advice: Allow stashed advice to be persisted to disk. If pg_stash_advice.persist = true, stashed advice will be written to pg_stash_advice.tsv in the data directory, periodically and at shutdown. On restart, stash modifications are locked out until this file has been reloaded, but queries will not be, so there may be a short window after startup during which previously-stashed advice is not automatically applied. Author: Robert Haas <rhaas@postgresql.org> Co-authored-by: Lukas Fittl <lukas@fittl.com> Discussion: https://postgr.es/m/CA+Tgmob87qsWa-VugofU6epuV0H5XjWZGMbQas4Q-ADKmvSyBg@mail.gmail.com	2026-04-07 10:11:36 -04:00
Andres Freund	29e7dbf5e4	Minimal fix for WAIT FOR ... MODE 'standby_flush' The investigation into the negative test performance impact of `7e8aeb9e48` lead to discovering that there are a few issues with WAIT FOR. This commit is just a minimal fix to prevent hangs in standby_flush mode, due to WAIT FOR ... 'standby_flush' seeing a 0 LSN if a newly started walreceiver does not receive any writes, because the stanby is already caught up. There are several other issues and this is isn't necessarily the best fix. But this way we get the hangs out of the way. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/zqbppucpmkeqecfy4s5kscnru4tbk6khp3ozqz6ad2zijz354k@w4bdf4z3wqoz	2026-04-07 09:48:09 -04:00
Heikki Linnakangas	9480c585df	Tidy up #ifdef USE_INJECTION_POINTS guards Remove unnecessary #ifdef guard around the function prototypes; they are already inside a larger #ifdef block. Move #include "subsystems.h" inside the USE_INJECTION_POINTS guard; it's needed for InjectionPointShmemCallbacks, which is a also inside the guard. Reported-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/87y0iz2c1v.fsf@wibble.ilmari.org	2026-04-07 16:18:31 +03:00
Álvaro Herrera	be142fa008	Fix tests under wal_level=minimal Buildfarm members which have specifically configured to use wal_level=minimal fail the repack regression tests, which require wal_level=replica. Add a temp config file to fix that.	2026-04-07 15:14:32 +02:00
Tomas Vondra	884f9b3c76	Use add_size/mul_size for index instrumentation size calculations Use overflow-safe size arithmetic in the Index[Only]Scan and parallel instrumentation functions, consistent with other executor nodes (Hash, Sort, Agg, Memoize). This was an oversight in `dd78e69cfc`. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me	2026-04-07 12:47:28 +02:00
Tomas Vondra	9c18b47e61	Fix BitmapHeapScan non-parallel-aware EXPLAIN ANALYZE Allocates shared bitmap table scan instrumentation for all parallel scans. Previously, the instrumentation was only allocated for parallel-aware scans, other bitmap heap scans in the parallel query had no shared instrumentation and EXPLAIN didn't report exact/lossy pages. This affected cases like scans on the outside of a parallel join or queries run with debug_parallel_query=regress. Fixed by allocating a separate DSM chunk for shared instrumentation and doing so regardless of parallel-awareness. The instrumentation is allocated in its own DSM chunk, separate from ParallelBitmapHeapState. Report an initial patch by me. The approach with a separate DSM was proposed and implemented by Melanie. Not backpatched. The issue affects Postgres 18 (since `5a1e6df3b8`), but having multiple DSM chunks is possible only since `dd78e69cfc`. If we decide to fix this in backbranches too, it will need to be done in a less invasive way. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me	2026-04-07 12:47:13 +02:00
Álvaro Herrera	0d3dba38c7	Allow logical replication snapshots to be database-specific By default, the logical decoding assumes access to shared catalogs, so the snapshot builder needs to consider cluster-wide XIDs during startup. That in turn means that, if any transaction is already running (and has XID assigned), the snapshot builder needs to wait for its completion, as it does not know if that transaction performed catalog changes earlier. A possible problem with this concept is that if REPACK (CONCURRENTLY) is running in some database, backends running the same command in other databases get stuck until the first one has committed. Thus only a single backend in the cluster can run REPACK (CONCURRENTLY) at any time. Likewise, REPACK (CONCURRENTLY) can block walsenders starting on behalf of subscriptions throughout the cluster. This patch adds a new option to logical replication output plugin, to declare that it does not use shared catalogs (i.e. catalogs that can be changed by transactions running in other databases in the cluster). In that case, no snapshot the backend will use during the decoding needs to contain information about transactions running in other databases. Thus the snapshot builder only needs to wait for completion of transactions in the current database. Currently we only use this option in the REPACK background worker. It could possibly be used in the plugin for logical replication too, however that would need thorough analysis of that plugin. Bump WAL version number, due to a new field in xl_running_xacts. Author: Antonin Houska <ah@cybertec.at> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/90475.1775218118@localhost	2026-04-07 12:31:18 +02:00
Álvaro Herrera	a3b069ef90	Avoid different-size pointer-to-integer cast Buildfarm member mamba is unhappy that I wrote "(Datum) NULL" in commit `28d534e2ae`: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-07%2005%3A08%3A08 Use "(Datum) 0" which is what we do everywhere else. Discussion: https://postgr.es/m/CANWCAZaOs_+WPH13ow33Q==+FwBwVZkqzm4vND=WEB4_NBmv1Q@mail.gmail.com	2026-04-07 12:28:05 +02:00
Heikki Linnakangas	6f5ad00ab7	Optimize sort and deduplication in ginExtractEntries() Remove NULLs from the array first, and use qsort to deduplicate only the non-NULL items. This simplifies the comparison function. Also replace qsort_arg() with a templated version so that the comparison function can be inlined. These changes make ginExtractEntries() a little faster especially for simple datatypes like integers. Author: David Geier <geidav.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/6d16b6bd-a1ff-4469-aefb-a1c8274e561a@iki.fi	2026-04-07 13:26:39 +03:00
Peter Eisentraut	b6ccd30d8f	Add isolation tests for UPDATE/DELETE FOR PORTION OF Add documentation about concurrency issues related to UPDATE/DELETE FOR PORTION OF as well as supporting isolation tests. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com	2026-04-07 11:22:11 +02:00
Álvaro Herrera	5bcc3fbd19	Fix valgrind failure Buildfarm member skink reports that the new REPACK code is trying to write uninitialized bytes to disk, which correspond to padding space in the SerializedSnapshotData struct. Silence that by initializing the memory in SerializeSnapshot() to all zeroes. Co-authored-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com> Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/1976915.1775537087@sss.pgh.pa.us	2026-04-07 11:13:50 +02:00
John Naylor	8c3e22a8f8	Use .h for the file containing the page checksum code fragment Commit `5e13b0f24` used a .c file for a file containing a code fragment, to avoid adding an exception to headerscheck. That turned out to be too clever, since it meant installation didn't happen by the usual mechanism. Make it look like a normal header and add the requisite exception. Bug: #19450 Reported-by: RekGRpth <rekgrpth@gmail.com> Discussion: https://postgr.es/m/19450-bb0612c50c6786e5@postgresql.org	2026-04-07 15:52:55 +07:00
John Naylor	30229be755	Simplify SortSupport for the macaddr data type As of commit `6aebedc38` Datums are 64-bit values. Since MAC addresses have only 6 bytes, the abbreviated key always contains the entire MAC address and is thus authoritative (for practical purposes -- the tuple sort machinery has no way of knowing that). Abbreviating this datatype is cheap, and aborting abbreviation prevents optimizations like radix sort, so remove cardinality estimation. Author: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Michael Paquier <michael@paquier.xyz> Suggested-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://postgr.es/m/CAJ7c6TMk10rF_LiMz6j9rRy1rqk-5s+wBPuBefLix4cY+-4s1w@mail.gmail.com	2026-04-07 13:29:27 +07:00
Michael Paquier	49cc0d4148	Mark JumbleState as a const in the post_parse_analyze hook This commit changes the post_parse_analyze_hook_type() hook to take a const JumbleState, to tell external modules that they are not allowed to touch the JumbleState that has been compiled by the core code. This fixes a pretty old problem with pg_stat_statements, that had always the idea of modifying the lengths of the constants stored in the JumbleState. The previous state could confuse extensions that need to look at a JumbleState depending on the loading order, if pg_stat_statements is part of the stack loaded. Another piece included in this commit is the move of the routine fill_in_constant_lengths() to queryjumblefuncs.c, to give an option to extensions to compile the lengths of the constants, if necessary. I was surprised by the number of external code that carries a copy of this routine (see the thread for details). Previously, this routine modified JumbleState. It now copies the set of LocationLens from JumbleState, and fills the constant lengths for separate use. pg_stat_statements is updated to use the new ComputeConstantLengths(). JumbleState is now marked with a const in the module, where relevant. Author: Sami Imseih <samimseih@gmail.com> Co-authored-by: Lukas Fittl <lukas@fittl.com> Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com	2026-04-07 15:22:49 +09:00
John Naylor	51098839cf	Split CREATE STATISTICS error reasons out into errdetails Some errmsgs in statscmds.c were phrased as "...cannot be used because...". Put the reasons into errdetails. While at it, switch from passive voice to "cannot create..." for the errmsg. Author: Yugo Nagata <nagata@sraoss.co.jp> Suggested-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://postgr.es/m/CANWCAZaZeX0omWNh_ZbD_JVujzYQdRUW8UZOQ4dWh9Sg7OcAow@mail.gmail.com	2026-04-07 11:37:48 +07:00
Michael Paquier	3284e3f63c	Fix injection point detach timing problem in TAP test for lock stats injection_points_detach() could fail because of a concurrent cleanup triggered by injection_points_set_local() when a session finishes. This problem could be reproduced by adding a hardcoded sleep in InjectionPointDetach(), and has been detected by the CI. As the test is designed so as the injection point is detached before being awaken, there is no need for it to be local, similarly to test 010_index_concurrently_upsert. This commit removes injection_points_set_local(), replacing it with a confirmation that the point has been attached in the session expected to block on a lock. With this removal, the detach cannot happen concurrently anymore, only before when the point is woken up. Issue introduced by `557a9f1e3e`, where the test has been added. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/rp6wz4lnz5qn4zlh7uxtavzfrmqvycy2g42z4zasfss2gxi54f@zzcsjdvdflwp	2026-04-07 13:17:13 +09:00
Michael Paquier	17132f55c5	Fix shmem allocation of fixed-sized custom stats kind StatsShmemSize(), that computes the shmem size needed for pgstats, includes the amount of shared memory wanted by all the custom stats kinds registered. However, the shared memory allocation was done by ShmemAlloc() in StatsShmemInit(), meaning that the space reserved was not used, wasting some memory. These extra allocations would show up under "<anonymous>" in pg_shmem_allocations, as the allocations done by ShmemAlloc() are not tracked by ShmemIndexEnt. Issue introduced by `7949d95945`. Author: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/04b04387-92f5-476c-90b0-4064e71c5f37@iki.fi Backpatch-through: 18	2026-04-07 11:59:49 +09:00
Amit Langote	5c54c3ed1b	Fix deferred FK check batching introduced by commit `b7b27eb41a` That commit introduced AfterTriggerIsActive() to detect whether we are inside the after-trigger firing machinery, so that RI trigger functions can take the batched fast path. It was implemented using query_depth >= 0, which correctly identified immediate trigger firing but missed the deferred case where query_depth is -1 at COMMIT via AfterTriggerFireDeferred(). This caused deferred FK checks to fall back to the per-row fast path instead of the batched path. The correct check is whether we are inside an after-trigger firing loop specifically. Introduce afterTriggerFiringDepth, a counter incremented around the trigger-firing loops in AfterTriggerEndQuery, AfterTriggerFireDeferred, and AfterTriggerSetState, and decremented after FireAfterTriggerBatchCallbacks() returns. AfterTriggerIsActive() now returns afterTriggerFiringDepth > 0. Reported-by: Chao Li <li.evan.chao@gmail.com> Author: Chao Li <li.evan.chao@gmail.com> Co-authored-by: Amit Langote <amitlangote09@gmail.com> Discussion: https://postgr.es/m/C2133B47-79CD-40FF-B088-02D20D654806@gmail.com	2026-04-07 10:45:59 +09:00
Michael Paquier	9897957805	Fix shared memory size of template code for custom fixed-sized pgstats On HEAD, the template code for custom fixed-sized pgstats is in the test module test_custom_stats. On REL_18_STABLE, this code lives in the test module injection_points. Both cases were underestimating the size of the shared memory area required for the storage of the stats data, using a single entry rather than the whole area. This underestimation meant that there was no memory allocated for the LWLock required for the stats, and even more. This problem would be also misleading for extension developers looking at this code. This issue has been noticed while digging into a different bug reported by Heikki Linnakangas, showing that the underestimation was causing failures in the TAP tests of the test modules for 32-bit builds. The other issue reported, related to the memory allocation of custom fixed-sized pgstats, will be fixed in a follow-up commit. Discussion: https://postgr.es/m/adMk_lWbnz3HDOA8@paquier.xyz Backpatch-through: 18	2026-04-07 08:24:32 +09:00
Melanie Plageman	dd78e69cfc	Allocate separate DSM chunk for parallel Index[Only]Scan instrumentation Previously, parallel index and index-only scans packed the parallel scan descriptor and shared instrumentation (for EXPLAIN ANALYZE) into a single DSM allocation. Since scans may be instrumented without being parallel-aware, and vice versa, using separate DSM chunks -- each with its own TOC key -- is cleaner. A future commit will extend this pattern to other scan node types. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me	2026-04-06 19:10:19 -04:00
Melanie Plageman	43222b8e53	Assert no duplicate keys in shm_toc_insert() shm_toc_insert() silently accepts duplicate keys. Since shm_toc_lookup() returns the first matching entry, any later entry with the same key would be unreachable. Add an assertion to catch this. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me	2026-04-06 18:41:47 -04:00
Nathan Bossart	87f61f0c82	Add pg_stat_autovacuum_scores system view. This view contains one row for each table in the current database, showing the current autovacuum scores for that specific table. It also shows whether autovacuum would vacuum or analyze the table. Bumps catversion. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com	2026-04-06 16:56:33 -05:00
Daniel Gustafsson	b3a37ffbc5	Use PG_DATA_CHECKSUM_OFF instead of hardcoded value For a long time, the online checksums patchset kept the "off" state as literal zero without a label to be consistent with the previous coding which only had a label for the "on" state. Later, when an "off" label was made not all uses in the code got the memo. Fix by setting these to PG_DATA_CHECKSUM_OFF. While there, fix a duplicate word in a comment introduced by the same commit. Author: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CAJ7c6TPRTnQFXXX1CRcYoTLXw2swtDH==uSz1MYoMKdLrKZHjA@mail.gmail.com	2026-04-06 22:11:53 +02:00
Álvaro Herrera	28d534e2ae	Add CONCURRENTLY option to REPACK When this flag is specified, REPACK no longer acquires access-exclusive lock while the new copy of the table is being created; instead, it creates the initial copy under share-update-exclusive lock only (same as vacuum, etc), and it follows an MVCC snapshot; it sets up a replication slot starting at that snapshot, and uses a concurrent background worker to do logical decoding starting at the snapshot to populate a stash of concurrent data changes. Those changes can then be re-applied to the new copy of the table just before swapping the relfilenodes. Applications can continue to access the original copy of the table normally until just before the swap, which is the only point at which the access-exclusive lock is needed. There are some loose ends in this commit: 1. concurrent repack needs its own replication slot in order to apply logical decoding, which are a scarce resource and easy to run out of. 2. due to the way the historic snapshot is initially set up, only one REPACK process can be running at any one time on the whole system. 3. there's a danger of deadlocking (and thus abort) due to the lock upgrade required at the final phase. These issues will be addressed in upcoming commits. The design and most of the code are by Antonin Houska, heavily based on his own pg_squeeze third-party implementation. Author: Antonin Houska <ah@cybertec.at> Co-authored-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Discussion: https://postgr.es/m/5186.1706694913@antos Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql	2026-04-06 21:55:08 +02:00
Alexander Korotkov	7e8aeb9e48	Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup() When the standby is passed as a PostgreSQL::Test::Cluster instance, use the WAIT FOR LSN command on the standby server to implement wait_for_catchup() for replay, write, and flush modes. This is more efficient than polling pg_stat_replication on the upstream, as the WAIT FOR LSN command uses a latch-based wakeup mechanism. The optimization applies when: - The standby is passed as a Cluster object (not just a name string) - The mode is 'replay', 'write', or 'flush' (not 'sent') Rather than pre-checking pg_is_in_recovery() on the standby (which would add an extra round-trip on every call), we issue WAIT FOR LSN directly and handle the 'not in recovery' result as a signal to fall back to polling. For 'sent' mode, when the standby is passed as a string (e.g., a subscription name for logical replication), when the standby has been promoted, or when WAIT FOR LSN is interrupted by a recovery conflict, the function falls back to the original polling-based approach using pg_stat_replication on the upstream. The recovery conflict fallback is necessary because some conflicts are unavoidable - for example, ResolveRecoveryConflictWithTablespace() kills all backends unconditionally, regardless of what they are doing. The recovery conflict detection matches the English error message "conflict with recovery", which is reliable because the test suite runs with LC_MESSAGES=C. Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>	2026-04-06 22:47:26 +03:00
Alexander Korotkov	834038c1f8	Avoid syscache lookup while building a WAIT FOR tuple descriptor Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when building the result tuple descriptor for the WAIT FOR command. This avoids a syscache access that could re-establish a catalog snapshot after we've explicitly released all snapshots before the wait. Discussion: https://postgr.es/m/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2026-04-06 22:47:26 +03:00
Nathan Bossart	775fe51daa	Remove recheck_relation_needs_vacanalyze(). This function is a thin wrapper around relation_needs_vacanalyze() that handles fetching and freeing the pgstat entry for the table. Since all callers of relation_needs_vacanalyze() do that anyway, we can teach that function to fetch/free the pgstat entry and use it instead. Suggested-by: Álvaro Herrera <alvherre@kurilemu.de> Author: Sami Imseih <samimseih@gmail.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com	2026-04-06 14:30:52 -05:00
Robert Haas	e972dff6c3	auto_explain: Add new GUC, auto_explain.log_extension_options. The associated value should look like something that could be part of an EXPLAIN options list, but restricted to EXPLAIN options added by extensions. For example, if pg_overexplain is loaded, you could set auto_explain.log_extension_options = 'DEBUG, RANGE_TABLE'. You can also specify arguments to these options in the same manner as normal e.g. 'DEBUG 1, RANGE_TABLE false'. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com	2026-04-06 15:19:42 -04:00
Tom Lane	d516974840	Support more object types within CREATE SCHEMA. Having rejected the principle that we should know how to re-order the sub-commands of CREATE SCHEMA, there is not really anything except a little coding to stop us from supporting more object types. This patch adds support for creating functions (including procedures and aggregates), operators, types (including domains), collations, and text search objects. SQL:2021 specifies that we should allow functions, procedures, types, domains, and collations, so this moves us a great deal closer to full SQL compatibility of CREATE SCHEMA. What remains missing from their list are casts, transforms, roles, and some object types we don't support yet (e.g. CREATE CHARACTER SET). Supporting casts or transforms would be problematic because they don't have names at all, let alone schema-qualified names, so it'd be quite a stretch to say that they belong to a schema. Roles likewise are not schema-qualified, plus they are global to a cluster, making it even less reasonable to consider them as belonging to a schema. So I don't see us trying to complete the list. User-defined aggregates and operators are outside the spec's ken, as are text search objects, so adding them does not do anything for spec compatibility. But they go along with these other object types, plus it takes no additional code to support them since they are represented as DefineStmts like some variants of CREATE TYPE. It would indeed take some effort to reject them. Author: Kirill Reshke <reshkekirill@gmail.com> Author: Jian He <jian.universality@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CALdSSPh4jUSDsWu3K58hjO60wnTRR0DuO4CKRcwa8EVuOSfXxg@mail.gmail.com	2026-04-06 15:16:25 -04:00
Tom Lane	404db8f9ed	Execute foreign key constraints in CREATE SCHEMA at the end. The previous patch simplified CREATE SCHEMA's behavior to "execute all subcommands in the order they are written". However, that's a bit too simple, as the spec clearly requires forward references in foreign key constraint clauses to work, see feature F311-01. (Most other SQL implementations seem to read more into the spec than that, but it's not clear that there's justification for more in the text, and this is the only case that doesn't introduce unresolvable issues.) We never implemented that before, but let's do so now. To fix it, transform FOREIGN KEY clauses into ALTER TABLE ... ADD FOREIGN KEY commands and append them to the end of the CREATE SCHEMA's subcommand list. This works because the foreign key constraints are independent and don't affect any other DDL that might be in CREATE SCHEMA. For simplicity, we do this for all FOREIGN KEY clauses even if they would have worked where they were. Author: Jian He <jian.universality@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us	2026-04-06 15:16:25 -04:00
Tom Lane	a9c350d9ee	Don't try to re-order the subcommands of CREATE SCHEMA. transformCreateSchemaStmtElements has always believed that it is supposed to re-order the subcommands of CREATE SCHEMA into a safe execution order. However, it is nowhere near being capable of doing that correctly. Nor is there reason to think that it ever will be, or that that is a well-defined requirement. (The SQL standard does say that it should be possible to do foreign-key forward references within CREATE SCHEMA, but it's not clear that the text requires anything more than that.) Moreover, the problem will get worse as we add more subcommand types. Let's just drop the whole idea and execute the commands in the order given, which seems like a much less astonishment-prone definition anyway. The foreign-key issue will be handled in a follow-up patch. This will result in a release-note-worthy incompatibility, which is that forward references like CREATE SCHEMA myschema CREATE VIEW myview AS SELECT * FROM mytable CREATE TABLE mytable (...); used to work and no longer will. Considering how many closely related variants never worked, this isn't much of a loss. Along the way, pass down a ParseState so that we can provide an error cursor for "wrong schema name" and related errors, and fix transformCreateSchemaStmtElements so that it doesn't scribble on the parsetree passed to it. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us	2026-04-06 15:16:25 -04:00
Masahiko Sawada	1ff3180ca0	Allow autovacuum to use parallel vacuum workers. Previously, autovacuum always disabled parallel vacuum regardless of the table's index count or configuration. This commit enables autovacuum workers to use parallel index vacuuming and index cleanup, using the same parallel vacuum infrastructure as manual VACUUM. Two new configuration options control the feature. The GUC autovacuum_max_parallel_workers sets the maximum number of parallel workers a single autovacuum worker may launch; it defaults to 0, preserving existing behavior unless explicitly enabled. The per-table storage parameter autovacuum_parallel_workers provides per-table limits. A value of 0 disables parallel vacuum for the table, a positive value caps the worker count (still bounded by the GUC), and -1 (the default) defers to the GUC. To handle cases where autovacuum workers receive a SIGHUP and update their cost-based vacuum delay parameters mid-operation, a new propagation mechanism is added to vacuumparallel.c. The leader stores its effective cost parameters in a DSM segment. Parallel vacuum workers poll for changes in vacuum_delay_point(); if an update is detected, they apply the new values locally via VacuumUpdateCosts(). A new test module, src/test/modules/test_autovacuum, is added to verify that parallel autovacuum workers are correctly launched and that cost-parameter updates are propagated as expected. The patch was originally proposed by Maxim Orlov, but the implementation has undergone significant architectural changes since then during the review process. Author: Daniil Davydov <3danissimo@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: zengman <zengman@halodbtech.com> Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com	2026-04-06 11:48:29 -07:00
Álvaro Herrera	c0b53ec063	Rename cluster.c to repack.c (and corresponding .h) CLUSTER is no longer the favored way to invoke this functionality, and the code is about to shift its focus to the REPACK more ambitiously. Rename the file to avoid leaving an unnecessary historical artifact around. Author: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/202603271635.owyhm7btgoic@alvherre.pgsql	2026-04-06 20:11:01 +02:00
Tom Lane	21c69dc73f	Disallow system columns in COPY FROM WHERE conditions. These columns haven't been computed yet when the filtering happens (since we've not written the candidate tuple into the table); so any check on them is wrong or useless. Worse, since `aa606b931` such a reference results in an access off the end of a TupleDesc, potentially causing a phony "generated columns are not supported in COPY FROM WHERE conditions" error; and since `c98ad086a` it throws an Assert instead. Actually we could allow tableoid, which has been set to the OID of the table named as the COPY target. However, plausible uses for tests of tableoid would involve a partitioned target table, and the user would wish it to read as the OID of the destination partition. There has been some discussion of changing things to make it work like that, but pending that happening we should just disallow tableoid along with other system columns. It seems best though to install this prohibition only in HEAD. In the back branches we'll just guard the unsafe TupleDesc access, and people will keep getting whatever semantics they got before. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6f435023-8ab6-47c2-ba07-035d0c4212f9@gmail.com	2026-04-06 14:05:01 -04:00
Tom Lane	f7da81f68b	Add missing .gitignore files. contrib/pg_stash_advice and src/test/modules/test_shmem missed these, leading to complaints from git after an in-tree check-world run. Use our standard boilerplate list of ignorable subdirectories, although the two modules presently create different subsets of that.	2026-04-06 13:25:29 -04:00
Tom Lane	6582010c80	Fix null-bitmap combining in array_agg_array_combine(). This code missed the need to update the combined state's nullbitmap if state1 already had a bitmap but state2 didn't. We need to extend the existing bitmap with 1's but didn't. This could result in wrong output from a parallelized array_agg(anyarray) calculation, if the input has a mix of null and non-null elements. The errors depended on timing of the parallel workers, and therefore would vary from one run to another. Also install guards against integer overflow when calculating the combined object's sizes, and make some trivial cosmetic improvements. Author: Dmytro Astapov <dastapov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com Backpatch-through: 16	2026-04-06 13:14:53 -04:00
Robert Haas	0442f1c9ef	Add a guc_check_handler to the EXPLAIN extension mechanism. It would be useful to be able to tell auto_explain to set a custom EXPLAIN option, but it would be bad if it tried to do so and the option name or value wasn't valid, because then every query would fail with a complaint about the EXPLAIN option. So add a guc_check_handler that auto_explain will be able to use to only try to set option name/value/type combinations that have been determined to be legal, and to emit useful messages about ones that aren't. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com	2026-04-06 12:31:47 -04:00
Nathan Bossart	e3481edfd1	Remove autoanalyze corner case. The restructuring in commit `53b8ca6881` revealed an interesting corner case: if a table needs vacuuming for wraparound prevention and autovacuum is disabled for it, we might still choose to analyze it. Research seems to indicate this was an accidental addition by commit `48188e1621`, and further discussion indicates there is consensus that it is unnecessary and can be removed. Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Discussion: https://postgr.es/m/adB9nSsm_S0D9708%40nathan	2026-04-06 11:28:46 -05:00
Robert Haas	e0e819cc08	Expose helper functions scan_quoted_identifier and scan_identifier. Previously, this logic was embedded within SplitIdentifierString, SplitDirectoriesString, and SplitGUCList. Factoring it out saves a bit of duplicated code, and also makes it available to extensions that might want to do similar things without necessarily wanting to do exactly the same thing. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com	2026-04-06 11:13:25 -04:00
Fujii Masao	ca2b5443e2	Add TAP tests for log_lock_waits This commit updates 011_lock_stats.pl to verify log_lock_waits behavior. The tests check that messages are emitted both when a wait occurs and when the lock is acquired, and that the "still waiting for" message is logged exactly once per wait, even if the backend wakes up during the wait. The latter covers the behavior introduced by commit `fd6ecbfa75`. Author: Hüseyin Demir <huseyin.d3r@gmail.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAB5wL7YB1my9W5k5i=SY+=sTjeozyJ0YkvGXrVfeDNzuRkoTPg@mail.gmail.com	2026-04-06 23:49:40 +09:00
Fujii Masao	93dc1ace20	Release postmaster working memory context in slotsync worker Child processes do not need the postmaster's working memory context and normally release it at the start of their main entry point. However, the slotsync worker forgot to do so. This commit makes the slotsync worker release the postmaster's working memory context at startup, preventing unintended use. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAHGQGwHO05JaUpgKF8FBDmPdBUJsK22axRRcgmAUc2Jyi8OK8g@mail.gmail.com	2026-04-06 23:04:18 +09:00
Heikki Linnakangas	ed71d7356e	Fix memory leaks introduced by commit `283e823f9d` When freeing pending_shmem_requests we should also free the ->options. Author: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://www.postgresql.org/message-id/CAJ7c6TN9tp8MTc0WXM0zfSWqjfBqU8gpe+o5KqHB1-cQ7409Kw@mail.gmail.com	2026-04-06 15:46:03 +03:00
Heikki Linnakangas	2670a0fcc6	Fix compilation without injection points with some compilers Some compilers didn't like the empty initializer when compiled without USE_INJECTION_POINTS. Per buildfarm member 'drongo', using Visual Studio 2019. Author: Michael Paquier <michael@paquier.xyz> Discussion: https://www.postgresql.org/message-id/adNHcBVJO5gIOp1l@paquier.xyz	2026-04-06 15:46:00 +03:00
Robert Haas	e8ec19aa32	Add pg_stash_advice contrib module. This module allows plan advice strings to be provided automatically from an in-memory advice stash. Advice stashes are stored in dynamic shared memory and must be recreated and repopulated after a server restart. If pg_stash_advice.stash_name is set to the name of an advice stash, and if query identifiers are enabled, the query identifier for each query will be looked up in the advice stash and the associated advice string, if any, will be used each time that query is planned. Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: http://postgr.es/m/CA+TgmoaeNuHXQ60P3ZZqJLrSjP3L1KYokW9kPfGbWDyt+1t=Ng@mail.gmail.com	2026-04-06 07:41:28 -04:00
Michael Paquier	404a17c155	Use single LWLock for lock statistics in pgstats Previously, one LWLock was used for each lock type, adding complexity without an observable performance benefit as data is gathered only for paths involving lock waits, at least currently. This commit replaces the per-type set of LWLocks with a single LWLock protecting the stats data of all the lock types, like the stats kinds for SLRU or WAL. A good chunk of the callbacks get simpler thanks to this change. The previous approach also had one bug in the flush callback when nowait was called with "true": a backend iterating over all entries could successfully flush some entries while skipping others due to contention, then unconditionally reset the pending data. This would cause some stats data loss. Oversight in `4019f725f5`. Reported-by: Tomas Vondra <tomas@vondra.me> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/1af63e6d-16d5-4d5b-9b03-11472ef1adf9@vondra.me	2026-04-06 14:01:04 +09:00
Michael Paquier	283c5fb22b	Improve more stability of worker_spi termination test Alexander Lakhin has noticed that it can be possible on machines with slow storage to have the spawned workers be stuck in initialize_worker_spi(), before they reach their main loop. Waiting for a flush to happen would block the interrupt attempts done by the database commands, causing the test to fail on timeout once the number of interrupt attempts is reached in CountOtherDBBackends(). This commit switches the test to wait for the spawned bgworkers to reach their main loops before attempting the database commands that would trigger the interrupts, napping for a time larger than the default, with worker_spi.naptime set at 10 minutes. Another thing that could be attempted is to enforce a larger number of tries in CountOtherDBBackends(), if what is done here is not enough. Let's see first if what this commit does is enough for the buildfarm members widowbird and jay. Analyzed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/f913fba1-da59-404c-9eb3-07c7304be637@gmail.com	2026-04-06 13:23:28 +09:00
Fujii Masao	d78a4f0bf0	Simplify redundant current_database() subqueries in stats.sql regression test Previously the stats.sql regression test used conditions like "datname = (SELECT current_database())" to check the current database name. The subquery is unnecessary, so this commit simplifies these expressions to "datname = current_database()". Author: Chao Li <lic@highgo.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/A1535A8F-65AF-4C3D-ACBE-25891CB5D38B@gmail.com	2026-04-06 13:19:45 +09:00
Richard Guo	3a08a2a8b4	Fix volatile function evaluation in eager aggregation Pushing aggregates containing volatile functions below a join can violate volatility semantics by changing the number of times the function is executed. Here we check the Aggref nodes in the targetlist and havingQual for volatile functions and disable eager aggregation when such functions are present. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com	2026-04-06 11:54:08 +09:00
Richard Guo	bd94845e8c	Fix collation handling for grouping keys in eager aggregation When determining if it is safe to use an expression as a grouping key for partial aggregation, eager aggregation relies on the B-tree equalimage support function to ensure that equality implies image equality. Previously, the code incorrectly passed the default collation of the expression's data type to the equalimage procedure, rather than the expression's actual collation. As a result, if a column used a non-deterministic collation but the base type's default collation was deterministic, eager aggregation would incorrectly assume that the column was safe for byte-level grouping. This could cause rows to be prematurely grouped and subsequently discarded by strict join conditions, resulting in incorrect query results. This patch fixes the issue by passing the expression's actual collation to the equalimage procedure. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com	2026-04-06 11:52:33 +09:00
Fujii Masao	a8f45dee91	Add wal_sender_shutdown_timeout GUC to limit shutdown wait for replication Previously, during shutdown, walsenders always waited until all pending data was replicated to receivers. This ensures sender and receiver stay in sync after shutdown, which is important for physical replication switchovers, but it can significantly delay shutdown. For example, in logical replication, if apply workers are blocked on locks, walsenders may wait until those locks are released, preventing shutdown from completing for a long time. This commit introduces a new GUC, wal_sender_shutdown_timeout, which specifies the maximum time a walsender waits during shutdown for all pending data to be replicated. When set, shutdown completes once all data is replicated or the timeout expires. A value of -1 (the default) disables the timeout. This can reduce shutdown time when replication is slow or stalled. However, if the timeout is reached, the sender and receiver may be left out of sync, which can be problematic for physical replication switchovers. Author: Andrey Silitskiy <a.silitskiy@postgrespro.ru> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru> Reviewed-by: Ronan Dunklau <ronan@dunklau.fr> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com	2026-04-06 11:35:03 +09:00
John Naylor	8194c4a9dd	Fix unportable use of __builtin_constant_p On MSVC Arm, USE_ARMV8_CRC32C is defined, but __builtin_constant_p is not available. Use pg_integer_constant_p and add appropriate guards. There is a similar potential hazard for the x86 path, but for now let's get the buildfarm green. Oversight in commit `fbc57f2bc`, per buildfarm member hoatzin.	2026-04-06 09:30:01 +07:00
Daniel Gustafsson	07009121c2	Test stabilization for online checksums Postcommit review and buildfarm/CI failures revealed a few issues in the test code which this commit attempts to resolve. These failures are verified using synthetic means. * Wait for launcher exit in enable/disable checksum tests When enabling or disabling data checksums in a test with waiting for an end state (on or off), the test typically want to perform more test against the cluster immediately. Make sure to wait for the launcher to exit in these cases before returning in order to know it can immediately be acted on. This is a more generic way of implementating `0036232ba8`. * Refactor injection point tests to use the injection_points test extension. Two injection points added for online checksums were better expressed using the injection_points extension with the test code embedded in datachecksum_state.c. * Make tests less timing dependent and allow transitions to "on" and not just "inprogress-on" in case a test manages to finish before it's checked for state. * When waiting on a blocking background psql keeping a temporary table open, the test first closed the background session abd then the server. This could cause data checksums to manage to get enabled in the brief window between dropping the temporary table and closing the server. Fix by closing the server first before the background session. * Remove a few superfluous duplicate checks and general cleanup of comments as well as making LSN logging consistent. These issues were reported by Andres as well as spotted in the buildfarm and on CI. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/92F25C14-801E-4198-994D-D83E31FEB0D8@yesql.se	2026-04-06 02:03:10 +02:00
Daniel Gustafsson	d771b0a907	Handle checksumworker startup wait race If the background worker for processing databases manages to finish before the launcher starts waiting for it, the launcher would treat it erroneously as an error. Fix by ensureing to check result state in this case. Identified on CI and synthetically reproduced during local testing. Also while, make sure to properly lock the shared memory structure before updating tje result state. Author: Daniel Gustafsson <daniel@yesql.seA Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/4fxw37ge47v5baeozla5phymi233hxbcjbwwsfwv3mpg3kyl2z@6jk4nkf6jp4	2026-04-06 01:55:06 +02:00
Michael Paquier	557a9f1e3e	Add tests for lock statistics, take two Commit `7c64d56fd9` has removed the isolation test providing coverage for lock statistics due to some instability in the CI, where the deadlock timeout may not have enough time to process, preventing the stats data to be updated. These also relied on a set of hardcoded sleeps. This commit switches the test suite to TAP, instead, that uses an injection point with a wait to avoid the sleeps. The injection point is added in ProcSleep(), once we know that the deadlock timeout has fired and that the stats have been updated. Multiple lock patterns are checked, all rely on the same workflow, with two sessions: - session 1 holds a given lock type. - session 2 attaches to the new injection point with the wait action. - session 2 attempts to acquire a lock conflicting with the lock of session 1, waiting for the injection point to be reached. - session 1 releases its lock, session 2 commits. - pg_stat_lock is polled until the counters are updated for the lock type. Bertrand's version of the patch introduced a new routine to BackgroundPsql() to detect the blocked background sessions. I have tweaked the test so as we use the same method as some of the other tests instead, based on some \echo commands. This test has been run multiple times in the CI, all passing, so I'd like to think that this is more stable than the first version attempted. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/acNTR1lLHwQJ0o+P@ip-10-97-1-34.eu-west-3.compute.internal	2026-04-06 08:51:30 +09:00
Heikki Linnakangas	9b5acad3f4	Convert all remaining subsystems to use the new shmem allocation API This removes all remaining uses of ShmemInitStruct() and ShmemInitHash() from built-in code. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:13:10 +03:00
Heikki Linnakangas	a4b6139dcc	Convert buffer manager to use the new shmem allocation functions This rectifies the initialization functions a little, making the "buffer strategy" stuff in freelist.c and buffer mapping hash table in buf_init.c top-level "subsystems" of their own, registered directly in subsystemlist.h. Previously they were called indirectly from BufferManagerShmemInit() and BufferManagerShmemSize() Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:13:08 +03:00
Heikki Linnakangas	dacfe81a0d	Add alignment option to ShmemRequestStruct() The buffer blocks, converted to use ShmemRequestStruct() in the next commit, are IO-aligned. This might come handy in other places too, so make it an explicit feature of ShmemRequestStruct(). Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:13:06 +03:00
Heikki Linnakangas	58a1573385	Convert AIO to use the new shmem allocation functions This replaces the "shmem_size" and "shmem_init" callbacks in the IO methods table with the same ShmemCallback struct that we now use in other subsystems Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:13:04 +03:00
Heikki Linnakangas	2e0943a859	Convert SLRUs to use the new shmem allocation functions I replaced the old SimpleLruInit() function without a backwards compatibility wrapper, because few extensions define their own SLRUs. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:13:02 +03:00
Heikki Linnakangas	4c9eca5afe	Refactor shmem initialization code in predicate.c This is in preparation to convert it to use the new shmem allocation functions, making the next commit that does that smaller. This inlines SerialInit() to the caller, and moves all the initialization steps within PredicateLockShmemInit() to happen after all the ShmemInit{Struct\|Hash}() calls. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:13:01 +03:00
Heikki Linnakangas	c6d55714ba	Use the new shmem allocation functions in a few core subsystems These subsystems have some complicating properties, making them slightly harder to convert than most: - The initialization callbacks of some of these subsystems have dependencies, i.e. they need to be initialized in the right order. - The ProcGlobal pointer still needs to be inherited by the BackendParameters mechanism on EXEC_BACKEND builds, because ProcGlobal is required by InitProcess() to get a PGPROC entry, and the PGPROC entry is required to use LWLocks, and usually attaching to shared memory areas requires the use of LWLocks. - Similarly, ProcSignal pointer still needs to be handled by BackendParameters, because query cancellation connections access it without calling InitProcess Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:12:59 +03:00
Heikki Linnakangas	a006bc7b16	Convert lwlock.c to use the new shmem allocation functions It seems like a good candidate to convert first because it needs to initialized before any other subsystem, but other than that it's nothing special. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:12:57 +03:00
Heikki Linnakangas	1fc2e9fbc0	Introduce a registry of built-in shmem subsystems To add a new built-in subsystem, add it to subsystemslist.h. That hooks up its shmem callbacks so that they get called at the right times during postmaster startup. For now this is unused, but will replace the current SubsystemShmemSize() and SubsystemShmemInit() calls in the next commits. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:12:55 +03:00
Heikki Linnakangas	6409994c7d	Add a test module to test after-startup shmem allocations The old ShmemInit{Struct/Hash}() functions could be used after postmaster statup, as long as the allocation is small enough to fit in spare shmem reserved at startup. I believe some extensions do that, although we hadn't really documented it and had not coverage for it. The new test module covers that after-startup usage with the new ShmemRequestStruct() functions. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:12:51 +03:00
Heikki Linnakangas	283e823f9d	Introduce a new mechanism for registering shared memory areas This replaces the [Subsystem]ShmemSize() and [Subsystem]ShmemInit() functions called at postmaster startup with a new set of callbacks. The new mechanism is designed to be more ergonomic. Notably, the size of each shmem area is specified in the same ShmemRequestStruct() call, together with its name. The same mechanism is used in extensions, replacing the shmem_{request/startup}_hooks. ShmemInitStruct() and ShmemInitHash() become backwards-compatibility wrappers around the new functions. In future commits, I will replace all ShmemInitStruct() and ShmemInitHash() calls with the new functions, although we'll still need to keep them around for extensions. Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:12:50 +03:00
Heikki Linnakangas	6ef9bee293	Move some code from shmem.c and shmem.h A little refactoring in preparation for the next commit, to make the material changes in that commit more clear. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com	2026-04-06 02:12:48 +03:00
Andres Freund	5a79e78501	instrumentation: Separate per-node logic from other uses Previously, different places (e.g. query "total time") were repurposing the Instrumentation struct initially introduced for capturing per-node statistics during execution. This overuse of the same struct is confusing, e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated code paths, and prevents future refactorings. Instead, simplify the Instrumentation struct to only track time and WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of per-node instrumentation - these calls were added without any apparent benefit since the relevant fields were never read. Introduce the NodeInstrumentation struct to carry forward the per-node instrumentation information. WorkerInstrumentation is renamed to WorkerNodeInstrumentation for clarity. In passing, clarify that InstrAggNode is expected to only run after InstrEndLoop (as it does in practice), and drop unused code. This also fixes a consequence-less bug: Previously ->async_mode was only set when a non-zero instrument_option was passed. That turns out to be harmless right now, as ->async_mode only affects a timing related field. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com	2026-04-05 19:04:24 -04:00
Andres Freund	7d9b74df53	instrumentation: Separate trigger logic from other uses Introduce TriggerInstrumentation to capture trigger timing and firings (previously counted in "ntuples"), to aid a future refactoring that splits out all Instrumentation fields beyond timing and WAL/buffers into more specific structs. In passing, drop the "n" argument to InstrAlloc, as all remaining callers need exactly one Instrumentation struct. The duplication between InstrAlloc() and InstrInit(), as well as the conditional initialization of async_mode will be addressed in a subsequent commit. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com	2026-04-05 16:56:50 -04:00
Andres Freund	6c7bce28c8	Fixups for `a4f774cf1c` The database name was warned about when building with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS, leading to BF and CI failures. It is somewhat confusing that the required prefix is different for databases than other object types. Also fix a pgindent violation that caused koel to start to fail. Discussion: https://postgr.es/m/ptyiexyhmtxf4lm524s7o7w64r26ra237uusv4tjav4yhpmeoo@vfwwllz7tivb	2026-04-05 15:36:34 -04:00
Andres Freund	df6949ccf7	Add tid_block() and tid_offset() accessor functions The two new functions allow to extract the block number and offset from a tid. There are existing ways to do so (e.g. by doing (ctid::text::point)[0]), but they are hard to remember and not pretty. tid_block() returns int8 (bigint) because BlockNumber is uint32, which exceeds the range of int4. tid_offset() returns int4 (integer) because OffsetNumber is uint16, which fits safely in int4. Bumps catversion. Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Discussion: https://postgr.es/m/CAJTYsWUzok2+mvSYkbVUwq_SWWg-GdHqCuYumN82AU97SjwjCA@mail.gmail.com	2026-04-05 15:17:05 -04:00
Heikki Linnakangas	f10b6be258	Check that the tranche name is unique in RequestNamedLWLockTranche You could request two tranches with same name, but things would get confusing when you called GetNamedLWLockTranche() to get the LWLocks allocated for them; it would always return the first tranche with the name. That doesn't make sense, so forbid duplicates. We still allow duplicates with LWLockNewTrancheId(). That works better as you don't use the name to look up the tranche ID later. It's still confusing in wait events, for example, but it's not dangerous in the same way. Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://www.postgresql.org/message-id/463a28db-0c0b-4af6-bac6-3891828bbbfe@iki.fi	2026-04-05 21:05:20 +03:00
Heikki Linnakangas	92a685e407	Improve test_lwlock_tranches While working on refactoring how shmem is allocated, I made a mistake where the main LWLock array did not reserve space for the LWLocks allocated with RequestNamedLWLockTranche(), and the test still passed. Matthias van de Meent spotted that before it got committed, but in order to catch such mistakes in the future, add checks in test_lwlock_tranches that the locks allocated with RequestNamedLWLockTranche() can be acquired and released. Another change is to stop requesting multiple tranches with the same name with RequestNamedLWLockTranche(). As soon as I started to test using the locks I realized that's bogus, and the next commit will forbid it. Keep test coverage for duplicates requested with LWLockNewTrancheId() for now, but make it more clear that that's what the test does. Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://www.postgresql.org/message-id/463a28db-0c0b-4af6-bac6-3891828bbbfe@iki.fi Discussion: https://www.postgresql.org/message-id/CAEze2WjgCROMMXY0+j8FFdm3iFcr7By-+6Mwiz=PgGSEydiW3A@mail.gmail.com	2026-04-05 21:05:15 +03:00
Andrew Dunstan	a4f774cf1c	Add pg_get_database_ddl() function Add a new SQL-callable function that returns the DDL statements needed to recreate a database. It takes a regdatabase argument and an optional VARIADIC text argument for options that are specified as alternating name/value pairs. The following options are supported: pretty (boolean) for formatted output, owner (boolean) to include OWNER and tablespace (boolean) to include TABLESPACE. The return is one or multiple rows where the first row is a CREATE DATABASE statement and subsequent rows are ALTER DATABASE statements to set some database properties. The caller must have CONNECT privilege on the target database. Author: Akshay Joshi <akshay.joshi@enterprisedb.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Quan Zongliang <quanzongliang@yeah.net> Discussion: https://postgr.es/m/CANxoLDc6FHBYJvcgOnZyS+jF0NUo3Lq_83-rttBuJgs9id_UDg@mail.gmail.com Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net	2026-04-05 10:54:54 -04:00
Andrew Dunstan	b99fd9fd7f	Add pg_get_tablespace_ddl() function Add a new SQL-callable function that returns the DDL statements needed to recreate a tablespace. It takes a tablespace name or OID and an optional VARIADIC text argument for options that are specified as alternating name/value pairs. The following options are supported: pretty (boolean) for formatted output and owner (boolean) to include OWNER. (It includes two variants because there is no regtablespace pseudotype.) The return is one or multiple rows where the first row is a CREATE TABLESPACE statement and subsequent rows are ALTER TABLESPACE statements to set some tablespace properties. The caller must have SELECT privilege on pg_tablespace. get_reloptions() in ruleutils.c is made non-static so it can be called from the new ddlutils.c file. Author: Nishant Sharma <nishant.sharma@enterprisedb.com> Author: Manni Wood <manni.wood@enterprisedb.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net	2026-04-05 10:54:54 -04:00
Andrew Dunstan	76e514ebb4	Add pg_get_role_ddl() function Add a new SQL-callable function that returns the DDL statements needed to recreate a role. It takes a regrole argument and an optional VARIADIC text argument for options that are specified as alternating name/value pairs. The following options are supported: pretty (boolean) for formatted output and memberships (boolean) to include GRANT statements for role memberships and membership options. The return is one or multiple rows where the first row is a CREATE ROLE statement and subsequent rows are ALTER ROLE statements to set some role properties. Password information is never included in the output. The caller must have SELECT privilege on pg_authid. Author: Mario Gonzalez <gonzalemario@gmail.com> Author: Bryan Green <dbryan.green@gmail.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Quan Zongliang <quanzongliang@yeah.net> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/4c5f895e-3281-48f8-b943-9228b7da6471@gmail.com Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net	2026-04-05 10:54:54 -04:00
Andrew Dunstan	4881981f92	Add infrastructure for pg_get__ddl functions Add parse_ddl_options(), append_ddl_option(), and append_guc_value() helper functions in a new ddlutils.c file that provide common option parsing and output formatting for the pg_get__ddl family of functions which will follow in later patches. These accept VARIADIC text arguments as alternating name/value pairs. Callers declare an array of DdlOption descriptors specifying the accepted option names and their types (boolean, text, or integer). parse_ddl_options() matches each supplied pair against the array, validates the value, and fills in the result fields. This descriptor-based scheme is based on an idea from Euler Taveira. This is placed in a new ddlutils.c file which will contain the pg_get_*_ddl functions. Author: Akshay Joshi <akshay.joshi@enterprisedb.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com Discussion: https://postgr.es/m/4c5f895e-3281-48f8-b943-9228b7da6471@gmail.com Discussion: https://postgr.es/m/CANxoLDc6FHBYJvcgOnZyS+jF0NUo3Lq_83-rttBuJgs9id_UDg@mail.gmail.com Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net	2026-04-05 10:54:54 -04:00
Álvaro Herrera	caec9d9fad	Allow index_create to suppress index_build progress reporting A future REPACK patch wants a way to suppress index_build doing its progress reports when building an index, because that would interfere with repack's own reporting; so add an INDEX_CREATE_SUPPRESS_PROGRESS bit that enables this. Furthermore, change the index_create_copy() API so that it takes flag bits for index_create() and passes them unchanged. This gives its callers more direct control, which eases the interface -- now its callers can pass the INDEX_CREATE_SUPPRESS_PROGRESS bit directly. We use it for the current caller in REINDEX CONCURRENTLY, since it's also not interested in progress reporting, since it doesn't want index_build() to be called at all in the first place. One thing to keep in mind, pointed out by Mihail, is that we're not suppressing the index-AM-specific progress report updates which happen during ambuild(). At present this is not a problem, because the values updated by those don't overlap with those used by commands other than CREATE INDEX; but maybe in the future we'll want the ability to suppress them also. (Alternatively we might want to display how each index-build-subcommand progresses during REPACK and others.) Author: Antonin Houska <ah@cybertec.at> Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Discussion: https://postgr.es/m/102906.1773668762@localhost	2026-04-05 13:34:08 +02:00
Etsuro Fujita	de28140ded	postgres_fdw: Inherit the local transaction's access/deferrable modes. READ ONLY transactions should prevent modifications to foreign data as well as local data, but postgres_fdw transactions declared as READ ONLY that reference foreign tables mapped to a remote view executing volatile functions would modify data on remote servers, as it would open remote transactions in READ WRITE mode. Similarly, DEFERRABLE transactions should not abort due to a serialization failure even when accessing foreign data, but postgres_fdw transactions declared as DEFERRABLE would abort due to that failure in a remote server, as it would open remote transactions in NOT DEFERRABLE mode. To fix, modify postgres_fdw to open remote transactions in the same access/deferrable modes as the local transaction. This commit also modifies it to open remote subtransactions in the same access mode as the local subtransaction. This commit changes the behavior of READ ONLY/DEFERRABLE transactions using postgres_fdw; in particular, it doesn't allow the READ ONLY transactions to modify data on remote servers anymore, so such transactions should be redeclared as READ WRITE or rewritten using other tools like dblink. The release notes should note this as an incompatibility. These issues exist since the introduction of postgres_fdw, but to avoid the incompatibility in the back branches, fix them in master only. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com Discussion: https://postgr.es/m/E1uLe9X-000zsY-2g%40gemulon.postgresql.org	2026-04-05 18:55:00 +09:00
Thomas Munro	fc44f10665	aio: Simplify pgaio_worker_submit(). Merge pgaio_worker_submit_internal() and pgaio_worker_submit(). The separation didn't serve any purpose. Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com	2026-04-05 18:07:21 +12:00
Andres Freund	f63ca33790	read_stream: Only increase read-ahead distance when waiting for IO This avoids increasing the distance to the maximum in cases where the I/O subsystem is already keeping up. This turns out to be important for performance for two reasons: - Pinning a lot of buffers is not cheap. If additional pins allow us to avoid IO waits, it's definitely worth it, but if we can already do all the necessary readahead at a distance of 16, reading ahead 512 buffers can increase the CPU overhead substantially. This is particularly noticeable when the to-be-read blocks are already in the kernel page cache. - If the read stream is read to completion, reading in data earlier than needed is of limited consequences, leaving aside the CPU costs mentioned above. But if the read stream will not be fully consumed, e.g. because it is on the inner side of a nested loop join, the additional IO can be a serious performance issue. This is not that commonly a problem for current read stream users, but the upcoming work, to use a read stream to fetch table pages as part of an index scan, frequently encounters this. Note that this commit would have substantial performance downsides without earlier commits: - Commit `6e36930f9a`, which avoids decreasing the readahead distance when there was recent IO, is crucial, as otherwise we very often would end up not reading ahead aggressively enough anymore with this commit, due to increasing the distance less often. - "read stream: Split decision about look ahead for AIO and combining" is important as we would otherwise not perform IO combining when the IO subsystem can keep up. - "aio: io_uring: Trigger async processing for large IOs" is important to continue to benefit from memory copy parallelism when using fewer IOs. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Tested-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com	2026-04-05 00:43:54 -04:00
Andres Freund	8ca147d582	read stream: Split decision about look ahead for AIO and combining In a subsequent commit the read-ahead distance will only be increased when waiting for IO. Without further work that would cause a regression: As IO combining and read-ahead are currently controlled by the same mechanism, we would end up not allowing IO combining when never needing to wait for IO (as the distance ends up too small to allow for full sized IOs), which can increase CPU overhead. A typical reason to not have to wait for IO completion at a low look-ahead distance is use of io_uring with the to-be-read data in the page cache. But even with worker the IO submission rate may be low enough for the worker to keep up. One might think that we could just always perform IO combining, but doing so at the start of a scan can cause performance regressions: 1) Performing a large IO commonly has a higher latency than smaller IOs. That is not a problem once reading ahead far enough, but at the start of a stream it can lead to longer waits for IO completion. 2) Sometimes read streams will not be read to completion. Immediately starting with full sized IOs leads to more wasted effort. This is not commonly an issue with existing read stream users, but the upcoming use of read streams to fetch table pages as part of an index scan frequently encounters this. Solve this issue by splitting ReadStream->distance into ->combine_distance and ->readahead_distance. Right now they are increased/decreased at the same time, but that will change in the next commit. One of the comments in read_stream_should_look_ahead() refers to a motivation that only really exists as of the next commit, but without it the code doesn't make sense on its own. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com	2026-04-05 00:43:54 -04:00
Andres Freund	434dab76ba	read_stream: Move logic about IO combining & issuing to helpers The long if statements were hard to read and hard to document. Splitting them into inline helpers makes it much easier to explain each part separately. This is done in preparation for making the logic more complicated... Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu	2026-04-05 00:43:54 -04:00
Andres Freund	a9ee668817	aio: io_uring: Trigger async processing for large IOs io_method=io_uring has a heuristic to trigger asynchronous processing of IOs once the IO depth is a bit larger. That heuristic is important when doing buffered IO from the kernel page cache, to allow parallelizing of the memory copy, as otherwise io_method=io_uring would be a lot slower than io_method=worker in that case. An upcoming commit will make read_stream.c only increase the read-ahead distance if we needed to wait for IO to complete. If to-be-read data is in the kernel page cache, io_uring will synchronously execute IO, unless the IO is flagged as async. Therefore the aforementioned change in read_stream.c heuristic would lead to a substantial performance regression with io_uring when data is in the page cache, as we would never reach a deep enough queue to actually trigger the existing heuristic. Parallelizing the copy from the page cache is mainly important when doing a lot of IO, which commonly is only possible when doing largely sequential IO. The reason we don't just mark all io_uring IOs as asynchronous is that the dispatch to a kernel thread has overhead. This overhead is mostly noticeable with small random IOs with a low queue depth, as in that case the gain from parallelizing the memory copy is small and the latency cost high. The facts from the two prior paragraphs show a way out: Use the size of the IO in addition to the depth of the queue to trigger asynchronous processing. One might think that just using the IO size might be enough, but experimentation has shown that not to be the case - with deep look-ahead distances being able to parallelize the memory copy is important even with smaller IOs. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com	2026-04-05 00:43:54 -04:00
John Naylor	2849fe4c97	Fix unused function warning on Arm platforms Guard definition pg_pmull_available() on compile-time availability of PMULL. Oversight in `fbc57f2bc`. In passing, remove "inline" hint for consistency. Reported-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/f153d5a4-a9be-4211-b0b2-7e99b56d68d5@vondra.me	2026-04-05 08:49:47 +07:00
Álvaro Herrera	69c11f0545	Modernize struct declarations in snapbuild.h Just a cosmetic cleanup.	2026-04-05 00:21:53 +02:00
Álvaro Herrera	33bf7318f9	Make index_concurrently_create_copy more general Also rename it to index_create_copy. Add a 'boolean concurrent' option, and make it work for both cases: in concurrent mode, just create the catalog entries; caller is responsible for the actual building later. In non-concurrent mode, the index is built right away. This allows it to be reused for other purposes -- specifically, for concurrent REPACK. (With the CONCURRENTLY option, REPACK cannot simply swap the heap file and rebuild its indexes. Instead, it needs to build a separate set of indexes, including their system catalog entries, before the actual swap, to reduce the time AccessExclusiveLock needs to be held for. This approach is different from what CREATE INDEX CONCURRENTLY does.) Per a suggestion from Mihail Nikalayeu. Author: Antonin Houska <ah@cybertec.at> Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/41104.1754922120@localhost	2026-04-04 20:38:26 +02:00
Peter Geoghegan	2d3490dd99	heapam: Keep buffer pins across index scan resets. Avoid dropping the heap page pin (xs_cbuf) and visibility map pin (xs_vmbuffer) within heapam_index_fetch_reset. Retaining these pins saves cycles during certain nested loop joins and merge joins that frequently restore a saved mark: cases where the next tuple fetched after a reset often falls on the same heap page will now avoid the cost of repeated pinning and unpinning. Avoiding dropping the scan's heap page buffer pin is preparation for an upcoming patch that will add I/O prefetching to index scans. Testing of that patch (which makes heapam tend to pin more buffers concurrently than was typical before now) shows that the aforementioned cases get a small but clearly measurable benefit from this optimization. Upcoming work to add a slot-based table AM interface for index scans (which is further preparation for prefetching) will move VM checks for index-only scans out of the executor and into heapam. That will expand the role of xs_vmbuffer to include VM lookups for index-only scans (the field won't just be used for setting pages all-visible during on-access pruning via the enhancement recently introduced by commit `b46e1e54`). Avoiding dropping the xs_vmbuffer pin will preserve the historical behavior of nodeIndexonlyscan.c, which always kept this pin on a rescan; that aspect of this commit isn't really new. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com	2026-04-04 13:49:37 -04:00
Heikki Linnakangas	fda5300132	Remove unnecessary #include "spin.h" from shmem.h Commit `6b8238cb6a` removed the last usage of slock_t from the file. proc.c was relying the indirect #include, so add it to proc.c directly.	2026-04-04 20:22:04 +03:00
Peter Geoghegan	c7d09595e4	heapam: Track heap block in IndexFetchHeapData. Add an explicit BlockNumber field (xs_blk) to IndexFetchHeapData that tracks which heap block is currently pinned in xs_cbuf. heapam_index_fetch_tuple now uses xs_blk to determine when buffer switching is needed, replacing the previous approach that compared buffer identities via ReleaseAndReadBuffer on every non-HOT-chain call. This is preparatory work for an upcoming commit that will add index prefetching using a read stream. Delegating the release of a currently pinned buffer to ReleaseAndReadBuffer won't work anymore -- at least not when the next buffer that the scan needs to pin is one returned by read_stream_next_buffer (not a buffer returned by ReadBuffer). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com	2026-04-04 11:45:33 -04:00
Peter Geoghegan	a29fdd6c8d	Move heapam_handler.c index scan code to new file. Move the heapam index fetch callbacks (index_fetch_begin, index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new dedicated file. Also move heap_hot_search_buffer over. This is a purely mechanical move with no functional impact. Upcoming work to add a slot-based table AM interface for index scans will substantially expand this code. Keeping it in heapam_handler.c would clutter a file whose primary role is to wire up the TableAmRoutine callbacks. Bitmap heap scans and sequential scans would benefit from similar separation in the future. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh	2026-04-04 11:30:41 -04:00
Peter Geoghegan	1adff1a0c5	Rename heapam_index_fetch_tuple argument for clarity. Rename heapam_index_fetch_tuple's call_again argument to heap_continue, for consistency with the pointed-to variable name (IndexScanDescData's xs_heap_continue field). Preparation for an upcoming commit that will move index scan related heapam functions into their own file. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh	2026-04-04 11:30:05 -04:00
John Naylor	519acd1be5	Fix indentation Per buildfarm member koel	2026-04-04 21:50:54 +07:00
John Naylor	fbc57f2bc2	Compute CRC32C on ARM using the Crypto Extension where available In similar vein to commit `3c6e8c123`, the ARMv8 cryptography extension has 64x64 -> 128-bit carryless multiplication instructions suitable for computing CRC. This was tested to be around twice as fast as scalar CRC instructions for longer inputs. We now do a runtime check, even for builds that target "armv8-a+crc", but those builds can still use a direct call for constant inputs, which we assume are short. As for x86, the MIT-licensed implementation was generated with the "generate" program from https://github.com/corsix/fast-crc32/ Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/CANWCAZaKhE+RD5KKouUFoxx1EbUNrNhcduM1VQ=DkSDadNEFng@mail.gmail.com	2026-04-04 20:47:01 +07:00
John Naylor	5e13b0f240	Use AVX2 for calculating page checksums where available We already rely on autovectorization for computing page checksums, but on x86 we can get a further several-fold performance increase by annotating pg_checksum_block() with a function target attribute for the AVX2 instruction set extension. Not only does that use 256-bit registers, it can also use vector multiplication rather than the vector shifts and adds used in SSE2. Similar to other hardware-specific paths, we set a function pointer on first use. We don't bother to avoid this on platforms without AVX2 since the overhead of indirect calls doesn't matter for multi-kilobyte inputs. However, we do arrange so that only core has the function pointer mechanism. External programs will continue to build a normal static function and don't need to be aware of this. This matters most when using io_uring since in that case the checksum computation is not done in parallel by IO workers. Co-authored-by: Matthew Sterrett <matthewsterrett2@gmail.com> Co-authored-by: Andrew Kim <andrew.kim@intel.com> Reviewed-by: Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru> Tested-by: Ants Aasma <ants.aasma@cybertec.at> Tested-by: Stepan Neretin <slpmcf@gmail.com> (earlier version) Discussion: https://postgr.es/m/CA+vA85_5GTu+HHniSbvvP+8k3=xZO=WE84NPwiKyxztqvpfZ3Q@mail.gmail.com Discussion: https://postgr.es/m/20250911054220.3784-1-root%40ip-172-31-36-228.ec2.internal	2026-04-04 18:07:15 +07:00
Heikki Linnakangas	c06443063f	Add missing shmem size estimate for fast-path locking struct It's been missing ever since fast-path locking was introduced. It's a small discrepancy, about 4 kB, but let's be tidy. This doesn't seem worth backpatching, however; in stable branches we were less precise about the estimates and e.g. added a 10% margin to the hash table estimates, which is usually much bigger than this discrepancy.	2026-04-04 11:46:11 +03:00
Thomas Munro	bab656bb87	More tar portability adjustments. For the three implementations that have caused problems so far: * GNU and BSD (libarchive) tar both understand --format=ustar * ustar doesn't support large UID/GID values, so set them to 0 to avoid a hard error from at least GNU tar * OpenBSD tar needs -F ustar, and it appears to warn but carry on with "nobody" if a UID is too large * -f /dev/null is a more portable way to throw away the output, since the default destination might be a tape device depending on build options that a distribution might change * Windows ships BSD tar but lacks /dev/null, so ask perl for its name Based on their manuals, the other two implementations the tests are likely to encounter in the wild don't seem to need any special handling: * Solaris/illumos tar uses ustar and replaces large UIDs with 60001 * AIX tar uses ustar (unless --format=pax) and truncates large UIDs Backpatch-through: 18 Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Sami Imseih <samimseih@gmail.com> (large UIDs) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier version) Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> (OpenBSD) Reviewed-by: Andrew Dunstan <andrew@dunslane.net> (Windows) Discussion: https://postgr.es/m/3676229.1775170250%40sss.pgh.pa.us Discussion: https://postgr.es/m/CAA5RZ0tt89MgNi4-0F4onH%2B-TFSsysFjMM-tBc6aXbuQv5xBXw%40mail.gmail.com	2026-04-04 13:54:21 +13:00
Heikki Linnakangas	4953a25b7f	Remove HASH_DIRSIZE, always use the default algorithm to select it It's not very useful to specify a non-standard directory size. The HASH_DIRSIZE option was only used for shared memory hash tables, and those always used hash_select_dirsize() to choose the size, which in turn just uses the default algorithm anyway. That assumption was ingrained in hash_estimate_size(), too. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi	2026-04-04 02:40:28 +03:00
Heikki Linnakangas	9fe9ecd516	Allocate all parts of shmem hash table from a single contiguous area Previously, the shared header (HASHHDR) and the directory were allocated by the caller, and passed to hash_create(), while the actual elements were allocated separately with ShmemAlloc(). After this commit, all the memory needed by the header, the directory, and all the elements is allocated using a single ShmemInitStruct() call, and the different parts are carved out of that allocation. This way the ShmemIndex entries (and thus pg_shmem_allocations) reflect the size of the whole hash table, rather than just the directories. Commit `f5930f9a98` attempted this earlier, but it had to be reverted. The new strategy is to let dynahash.c perform all the allocations with the alloc function, but have the alloc function carve out the parts from the one larger allocation. The shared header and the directory are now also allocated with alloc calls, instead of passing the area for those directly from the caller. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi	2026-04-04 02:40:25 +03:00
Heikki Linnakangas	999e9ebb51	Prevent shared memory hash tables from growing beyond initial size Set HASH_FIXED_SIZE on all shared memory hash tables, to prevent them from growing after the initial allocation. It was always weirdly indeterministic that if one hash table used up all the unused shared memory, you could not use that space for other things anymore until restart. We just got rid of that behavior for the LOCK and PROCLOCK tables, but it's similarly weird for all other hash tables. Increase SHMEM_INDEX_SIZE because we were already above the max size, on that one, and it's now a hard limit. Some callers of ShmemInitHash() still pass HASH_FIXED_SIZE, but that's now unnecessary. They should perhaps now be removed, but it doesn't do any harm either to pass it. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi	2026-04-04 02:40:24 +03:00
Heikki Linnakangas	9ebe1c4f2c	Merge init and max size options on shmem hash tables Replace the separate init and max size options with a single size option. We didn't make much use of the feature, all callers except the ones in wait_event.c already used the same size for both, and the hash tables in wait_event.c are small so there's little harm in just allocating them to the max size. The only reason why you might want to not reserve the max size upfront is to make the memory available for other hash tables to grow beyond their max size. Letting hash tables grow much beyond their max size is bad for performance, however, because we cannot resize the directory, and we never had very much "wiggle room" to grow to anyway so you couldn't really rely on it. We recently marked the LOCK and PROCLOCK tables with HAS_FIXED_SIZE, so there's nothing left in core that would benefit from more unallocated shared memory. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi	2026-04-04 02:40:20 +03:00
Jacob Champion	d438a36591	oauth: Let validators provide failure DETAILs At the moment, the only way for a validator module to report error details on failure is to log them separately before returning from validate_cb. Independently of that problem, the ereport() calls that we make during validation failure partially duplicate some of the work of auth_failed(). The end result is overly verbose and confusing for readers of the logs: [768233] LOG: [my_validator] bad signature in bearer token [768233] LOG: OAuth bearer authentication failed for user "jacob" [768233] DETAIL: Validator failed to authorize the provided token. [768233] FATAL: OAuth bearer authentication failed for user "jacob" [768233] DETAIL: Connection matched file ".../pg_hba.conf" line ... Solve both problems by making use of the existing logdetail pointer that's provided by ClientAuthentication. Validator modules may set ValidatorModuleResult->error_detail to override our default generic message. The end result looks something like [242284] FATAL: OAuth bearer authentication failed for user "jacob" [242284] DETAIL: [my_validator] bad signature in bearer token Connection matched file ".../pg_hba.conf" line ... Reported-by: Álvaro Herrera <alvherre@kurilemu.de> Reported-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/202601241015.y5uvxd7oxnfs%40alvherre.pgsql	2026-04-03 16:05:33 -07:00
Daniel Gustafsson	0036232ba8	Make data checksum tests more resilient for slow machines The test for re-running checksum enabling was only checking for the data checksum state to transition to 'on', but didn't account for the launcher process having had time to exit, thus getting an error instead of the expected no-op. Adding a pg_stat_activity check for the launcher exiting resolves the error, verified by inducing delay in the launcher. Also wrap a variable only used in injection point tests within the correct USE macros to avoid warning for an unused variable. All per the buildfarm. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Buildfarm Discussion: https://postgr.es/m/1CB288C9-564B-4664-B096-C2F4377D17AB@yesql.se	2026-04-04 00:25:07 +02:00
Nathan Bossart	01876ace13	Add elevel parameter to relation_needs_vacanalyze(). This will be used in a follow-up commit to avoid emitting debug logs from this function. Author: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com	2026-04-03 17:04:28 -05:00
Nathan Bossart	53b8ca6881	Teach relation_needs_vacanalyze() to always compute scores. Presently, this function only computes component scores when the corresponding threshold is reached. A follow-up commit will add a view that shows tables' autovacuum scores, and we anticipate that users will want to use this view to discover tables that are nearing autovacuum eligibility. This commit teaches this function to always compute autovacuum scores, even when a threshold has not been reached or autovacuum is disabled. The restructuring in this commit revealed an interesting edge case. If the table needs vacuuming for wraparound prevention and autovacuum is disabled for it, we might still choose to analyze it. It's not clear if this is intentional, but it has been this way for nearly 20 years, so it seems best to avoid changing it without further discussion. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com	2026-04-03 16:44:41 -05:00
Daniel Gustafsson	f19c0eccae	Online enabling and disabling of data checksums This allows data checksums to be enabled, or disabled, in a running cluster without restricting access to the cluster during processing. Data checksums could prior to this only be enabled during initdb or when the cluster is offline using the pg_checksums app. This commit introduce functionality to enable, or disable, data checksums while the cluster is running regardless of how it was initialized. A background worker launcher process is responsible for launching a dynamic per-database background worker which will mark all buffers dirty for all relation with storage in order for them to have data checksums calculated on write. Once all relations in all databases have been processed, the data_checksums state will be set to on and the cluster will at that point be identical to one which had data checksums enabled during initialization or via offline processing. When data checksums are being enabled, concurrent I/O operations from backends other than the data checksums worker will write the checksums but not verify them on reading. Only when all backends have absorbed the procsignalbarrier for setting data_checksums to on will they also start verifying checksums on reading. The same process is repeated during disabling; all backends write checksums but do not verify them until the barrier for setting the state to off has been absorbed by all. This in-progress state is used to ensure there are no false negatives (or positives) due to reading a checksum which is not in sync with the page. A new testmodule, test_checksums, is introduced with an extensive set of tests covering both online and offline data checksum mode changes. The tests which run concurrent pgbdench during online processing are gated behind the PG_TEST_EXTRA flag due to being very expensive to run. Two levels of PG_TEST_EXTRA flags exist to turn on a subset of the expensive tests, or the full suite of multiple runs. This work is based on an earlier version of this patch which was reviewed by among others Heikki Linnakangas, Robert Haas, Andres Freund, Tomas Vondra, Michael Banck and Andrey Borodin. During the work on this new version, Tomas Vondra has given invaluable assistance with not only coding and reviewing but very in-depth testing. Author: Daniel Gustafsson <daniel@yesql.se> Author: Magnus Hagander <magnus@hagander.net> Co-authored-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CABUevExz9hUUOLnJVr2kpw9Cx=o4MCr1SVKwbupzuxP7ckNutA@mail.gmail.com Discussion: https://postgr.es/m/20181030051643.elbxjww5jjgnjaxg@alap3.anarazel.de Discussion: https://postgr.es/m/CABUevEwE3urLtwxxqdgd5O2oQz9J717ZzMbh+ziCSa5YLLU_BA@mail.gmail.com	2026-04-03 22:58:51 +02:00
Nathan Bossart	8261ee24fe	Refactor relation_needs_vacanalyze(). This commit adds an early return to this function, allowing us to remove a level of indentation on a decent chunk of code. This is preparatory work for follow-up commits that will add a new system view to show tables' autovacuum scores. Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com	2026-04-03 14:03:12 -05:00
Heikki Linnakangas	79534f9065	Change default of max_locks_per_transactions to 128 The previous commits reduced the amount of memory available for locks by eliminating the "safety margins" and by settling the split between LOCK and PROCLOCK tables at startup. The allocation is now more deterministic, but it also means that you often hit one of the limits sooner than before. To compensate for that, bump up max_locks_per_transactions from 64 to 128. With that there is a little more space in the both hash tables than what was the effective maximum size for either table before the previous commits. This only changes the default, so if you had changed max_locks_per_transactions in postgresql.conf, you will still have fewer locks available than before for the same setting value. This should be noted in the release notes. A good rule of thumb is that if you double max_locks_per_transactions, you should be able to get as many locks as before. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi	2026-04-03 20:27:46 +03:00
Heikki Linnakangas	e1ad034809	Make the lock hash tables fixed-sized This prevents the LOCK table from "stealing" space that was originally calculated for the PROLOCK table, and vice versa. That was weirdly indeterministic so that if you e.g. took a lot of locks consuming all the available shared memory for the LOCK table, subsequent transactions that needed the more space for the PROCLOCK table would fail, but if you restarted the system then the space would be available for PROCLOCK again. Better to be strict and predictable, even though that means that in many cases you can acquire far fewer locks than before. This also prevents the lock hash tables from using up the general-purpose 100 kB reserve we set aside for "stuff that's too small to bother estimating" in CalculateShmemSize(). We are pretty good at accounting for everything nowadays, so we could probably make that reservation smaller, but I'll leave that for another commit. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi	2026-04-03 20:27:16 +03:00
Heikki Linnakangas	3e854d2ff1	Remove 10% safety margin from lock manager hash table estimates As the comment says, the hash table sizes are just estimates, but that doesn't mean we need a "safety margin" here. hash_estimate_size() estimates the needed size in bytes pretty accurately for the given number of elements, so if we wanted room for more elements in the table, we should just use larger max_table_size in the hash_estimate_size() call. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi	2026-04-03 20:26:18 +03:00
Heikki Linnakangas	feb03dfecd	Remove bogus "safety margin" from predicate.c shmem estimates The 10% safety margin was copy-pasted from lock.c when the predicate locking code was originally added. However, we later (commit `7c797e7194`) added the HASH_FIXED_SIZE flag to the hash tables, which means that they cannot actually use the safety margin that we're calculating for them. The extra memory was mainly used by the main lock manager, which is the only shmem hash table of non-trivial size that does not use the HASH_FIXED_SIZE flag. If we wanted to have more space for the lock manager, we should reserve it directly in lock.c. After this commit, the lock manager will just have less memory available than before. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi	2026-04-03 20:25:57 +03:00
Amit Langote	b7b27eb41a	Optimize fast-path FK checks with batched index probes Instead of probing the PK index on each trigger invocation, buffer FK rows in a new per-constraint cache entry (RI_FastPathEntry) and flush them as a batch. On each trigger invocation, the new ri_FastPathBatchAdd() buffers the FK row in RI_FastPathEntry. When the buffer fills (64 rows) or the trigger-firing cycle ends, the new ri_FastPathBatchFlush() probes the index for all buffered rows, sharing a single CommandCounterIncrement, snapshot, permission check, and security context switch across the batch, rather than repeating each per row as the SPI path does. Per-flush CCI is safe because all AFTER triggers for the buffered rows have already fired by flush time. For single-column foreign keys, the new ri_FastPathFlushArray() builds an ArrayType from the buffered FK values (casting to the PK-side type if needed) and constructs a scan key with the SK_SEARCHARRAY flag. The index AM sorts and deduplicates the array internally, then walks matching leaf pages in one ordered traversal instead of descending from the root once per row. A matched[] bitmap tracks which batch items were satisfied; the first unmatched item is reported as a violation. Multi-column foreign keys fall back to per-row probing via the new ri_FastPathFlushLoop(). The fast path introduced in the previous commit (`2da86c1ef9`) yields ~1.8x speedup. This commit adds ~1.6x on top of that, for a combined ~2.9x speedup over the unpatched code (int PK / int FK, 1M rows, PK table and index cached in memory). FK tuples are materialized via ExecCopySlotHeapTuple() into a new purpose-specific memory context (flush_cxt), child of TopTransactionContext, which is also used for per-flush transient work: cast results, the search array, and index scan allocations. It is reset after each flush and deleted in teardown. The PK relation, index, tuple slots, and fast-path metadata are cached in RI_FastPathEntry across trigger invocations within a trigger-firing batch, avoiding repeated open/close overhead. The snapshot and IndexScanDesc are taken fresh per flush. The entry is not subject to cache invalidation: cached relations are held with locks for the transaction duration, and the entry's lifetime is bounded by the trigger-firing cycle. Lifecycle management for RI_FastPathEntry relies on three new mechanisms: - AfterTriggerBatchCallback: A new general-purpose callback mechanism in trigger.c. Callbacks registered via RegisterAfterTriggerBatchCallback() fire at the end of each trigger-firing batch (AfterTriggerEndQuery for immediate constraints, AfterTriggerFireDeferred at COMMIT, and AfterTriggerSetState for SET CONSTRAINTS IMMEDIATE). The RI code registers ri_FastPathEndBatch as a batch callback. - Batch callbacks only fire at the outermost query level (checked inside FireAfterTriggerBatchCallbacks), so nested queries from SPI inside other AFTER triggers do not tear down the cache mid-batch. - XactCallback: ri_FastPathXactCallback NULLs the static cache pointer at transaction end, handling the abort path where the batch callback never fired. - SubXactCallback: ri_FastPathSubXactCallback NULLs the static cache pointer on subtransaction abort, preventing the batch callback from accessing already-released resources. - AfterTriggerBatchIsActive(): A new exported accessor that returns true when afterTriggers.query_depth >= 0. During ALTER TABLE ... ADD FOREIGN KEY validation, RI triggers are called directly outside the after-trigger framework, so batch callbacks would never fire. The fast-path code uses this to fall back to the non-cached per-invocation path in that context. ri_FastPathEndBatch() flushes any partial batch before tearing down cached resources. Since the FK relation may already be closed by flush time (e.g. for deferred constraints at COMMIT), it reopens the relation using entry->fk_relid if needed. The existing ALTER TABLE validation path bypasses batching and continues to call ri_FastPathCheck() directly per row, because RI triggers are called outside the after-trigger framework there and batch callbacks would never fire to flush the buffer. Suggested-by: David Rowley <dgrowleyml@gmail.com> Author: Amit Langote <amitlangote09@gmail.com> Co-authored-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Tested-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com	2026-04-03 14:33:53 +09:00
Thomas Munro	be21341e13	jit: No backport::SectionMemoryManager for LLVM 22. LLVM 22 has the fix that we copied into our tree in commit `9044fc1d` and a new function to reach it[1][2], so we only need to use our copy for Aarch64 + LLVM < 22. The only change to the final version that our copy didn't get is a new LLVM_ABI macro, but that isn't appropriate for us. Our copy is hopefully now frozen and would only need maintenance if bugs are found in the upstream code. Non-Aarch64 systems now also use the new API with LLVM 22. It allocates all sections with one contiguous mmap() instead of one per section. We could have done that earlier, but commit `9044fc1d` wanted to limit the blast radius to the affected systems. We might as well benefit from that small improvement everywhere now that it is available out of the box. We can't delete our copy until LLVM 22 is our minimum supported version, or we switch to the newer JITLink API for at least Aarch64. [1] https://github.com/llvm/llvm-project/pull/71968 [2] https://github.com/llvm/llvm-project/pull/174307 Backpatch-through: 14 Discussion: https://postgr.es/m/CA%2BhUKGJTumad75o8Zao-LFseEbt%3DenbUFCM7LZVV%3Dc8yg2i7dg%40mail.gmail.com	2026-04-03 14:55:11 +13:00
Tom Lane	ebba64c08d	Further harden tests that might use not-so-compatible tar versions. Buildfarm testing shows that OpenSUSE (and perhaps related platforms?) configures GNU tar in such a way that it'll archive sparse WAL files by default, thus triggering the pax-extension detection code added by `bc30c704a`. Thus, we need something similar to `852de579a` but for GNU tar's option set. "--format=ustar" seems to do the trick. Moreover, the buildfarm shows that pg_verifybackup's 003_corruption.pl test script is also triggering creation of pax-format tar files on that platform. We had not noticed because those test cases all fail (intentionally) before getting to the point of trying to verify WAL data. Since that means two TAP scripts need this option-selection logic, and plausibly more will do so in future, factor it out into a subroutine in Test::Utils. We also need to back-patch the 003_corruption.pl fix into v18, where it's also failing. While at it, clean up some places where guards for $tar being empty or undefined were incomplete or even outright backwards. Presumably, we missed noticing because the set of machines that run TAP tests and don't have tar installed is empty. But if we're going to try to handle that scenario, we should do it correctly. Reported-by: Tomas Vondra <tomas@vondra.me> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/02770bea-b3f3-4015-8a43-443ae345379c@vondra.me Backpatch-through: 18	2026-04-02 17:21:27 -04:00
Andrew Dunstan	bd4f879a9c	Add additional jsonpath string methods Add the following jsonpath methods: * l/r/btrim() * lower(), upper() * initcap() * replace() * split_part() Each simply dispatches to the standard string processing functions. These depend on the locale, but since it's set at `initdb`, they can be considered immutable and therefore allowed in any jsonpath expression. Author: Florents Tselai <florents.tselai@gmail.com> Co-authored-by: David E. Wheeler <david@justatheory.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Discussion: https://postgr.es/m/CA+v5N40sJF39m0v7h=QN86zGp0CUf9F1WKasnZy9nNVj_VhCZQ@mail.gmail.com	2026-04-02 15:19:49 -04:00
Andrew Dunstan	a35c9d524e	Rename jsonpath method arg tokens This is just cleanup in the jsonpath grammar. Rename the `csv_` tokens to `int_`, because they represent signed or unsigned integers, as follows: * `csv_elem` => `int_elem` * `csv_list` => `int_list` * `opt_csv_list` => `opt_int_list` Rename the `datetime_precision` tokens to `uint_arg`, as they represent unsigned integers and will be useful for other methods in the future, as follows: * `datetime_precision` => `uint_elem` * `opt_datetime_precision` => `opt_uint_arg` Rename the `datetime_template` tokens to `str_arg`, as they represent strings and will be useful for other methods in the future, as follows: * `datetime_template` => `str_elem` * `opt_datetime_template` => `opt_str_arg` Author: David E. Wheeler <david@justatheory.com> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Discussion: https://postgr.es/m/CA+v5N40sJF39m0v7h=QN86zGp0CUf9F1WKasnZy9nNVj_VhCZQ@mail.gmail.com	2026-04-02 15:19:49 -04:00
Masahiko Sawada	fd7a25af11	Add target_relid parameter to pg_get_publication_tables(). When a tablesync worker checks whether a specific table is published, it previously issued a query to the publisher calling pg_get_publication_tables() and filtering the result by relid via a WHERE clause. Because the function itself was fully evaluated before the filter was applied, this forced the publisher to enumerate all tables in the publication. For publications covering a large number of tables, this resulted in expensive catalog scans and unnecessary CPU overhead on the publisher. This commit adds a new overloaded form of pg_get_publication_tables() that accepts an array of publication names and a target table OID. Instead of enumerating all published tables, it evaluates membership for the specified relation via syscache lookups, using the new is_table_publishable_in_publication() helper. This helper correctly accounts for publish_via_partition_root, ALL TABLES with EXCEPT clauses, schema publications, and partition inheritance, while avoiding the overhead of building the complete published table list. The existing VARIADIC array form of pg_get_publication_tables() is preserved for backward compatibility. Tablesync workers use the new two-argument form when connected to a publisher running PostgreSQL 19 or later. Bump catalog version. Reported-by: Marcos Pegoraro <marcos@f10.com.br> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Haoyan Wang <wanghaoyan20@163.com> Discussion: https://postgr.es/m/CAB-JLwbBFNuASyEnZWP0Tck9uNkthBZqi6WoXNevUT6+mV8XmA@mail.gmail.com	2026-04-02 11:34:50 -07:00
Tom Lane	bc30c704ad	Harden astreamer tar parsing logic against archives it can't handle. Previously, there was essentially no verification in this code that the input is a tar file at all, let alone that it fits into the subset of valid tar files that we can handle. This was exposed by the discovery that we couldn't handle files that FreeBSD's tar makes, because it's fairly aggressive about converting sparse WAL files into sparse tar entries. To fix: * Bail out if we find a pax extension header. This covers the sparse-file case, and also protects us against scenarios where the pax header changes other file properties that we care about. (Eventually we may extend the logic to actually handle such headers, but that won't happen in time for v19.) * Be more wary about tar file type codes in general: do not assume that anything that's neither a directory nor a symlink must be a regular file. Instead, we just ignore entries that are none of the three supported types. * Apply pg_dump's isValidTarHeader to verify that a purported header block is actually in tar format. To make this possible, move isValidTarHeader into src/port/tar.c, which is probably where it should have been since that file was created. I also took the opportunity to const-ify the arguments of isValidTarHeader and tarChecksum, and to use symbols not hard-wired constants inside tarChecksum. Back-patch to v18 but not further. Although this code exists inside pg_basebackup in older branches, it's not really exposed in that usage to tar files that weren't generated by our own code, so it doesn't seem worth back-porting these changes across `3c9056981` and `f80b09bac`. I did choose to include a back-patch of `5868372bb` into v18 though, to minimize cosmetic differences between these two branches. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/3049460.1775067940@sss.pgh.pa.us> Backpatch-through: 18	2026-04-02 12:20:36 -04:00

1 2 3 4 5 ...

47855 commits