Commit graph

63889 commits

Author SHA1 Message Date
John Naylor
fbc57f2bc2 Compute CRC32C on ARM using the Crypto Extension where available
In similar vein to commit 3c6e8c123, the ARMv8 cryptography extension
has 64x64 -> 128-bit carryless multiplication instructions suitable
for computing CRC. This was tested to be around twice as fast as
scalar CRC instructions for longer inputs.

We now do a runtime check, even for builds that target "armv8-a+crc",
but those builds can still use a direct call for constant inputs,
which we assume are short.

As for x86, the MIT-licensed implementation was generated with the
"generate" program from

https://github.com/corsix/fast-crc32/

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CANWCAZaKhE+RD5KKouUFoxx1EbUNrNhcduM1VQ=DkSDadNEFng@mail.gmail.com
2026-04-04 20:47:01 +07:00
John Naylor
5e13b0f240 Use AVX2 for calculating page checksums where available
We already rely on autovectorization for computing page checksums,
but on x86 we can get a further several-fold performance increase by
annotating pg_checksum_block() with a function target attribute for
the AVX2 instruction set extension. Not only does that use 256-bit
registers, it can also use vector multiplication rather than the
vector shifts and adds used in SSE2.

Similar to other hardware-specific paths, we set a function pointer
on first use. We don't bother to avoid this on platforms without AVX2
since the overhead of indirect calls doesn't matter for multi-kilobyte
inputs. However, we do arrange so that only core has the function
pointer mechanism. External programs will continue to build a normal
static function and don't need to be aware of this.

This matters most when using io_uring since in that case the checksum
computation is not done in parallel by IO workers.

Co-authored-by: Matthew Sterrett <matthewsterrett2@gmail.com>
Co-authored-by: Andrew Kim <andrew.kim@intel.com>
Reviewed-by: Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru>
Tested-by: Ants Aasma <ants.aasma@cybertec.at>
Tested-by: Stepan Neretin <slpmcf@gmail.com> (earlier version)
Discussion: https://postgr.es/m/CA+vA85_5GTu+HHniSbvvP+8k3=xZO=WE84NPwiKyxztqvpfZ3Q@mail.gmail.com
Discussion: https://postgr.es/m/20250911054220.3784-1-root%40ip-172-31-36-228.ec2.internal
2026-04-04 18:07:15 +07:00
Heikki Linnakangas
c06443063f Add missing shmem size estimate for fast-path locking struct
It's been missing ever since fast-path locking was introduced. It's a
small discrepancy, about 4 kB, but let's be tidy. This doesn't seem
worth backpatching, however; in stable branches we were less precise
about the estimates and e.g. added a 10% margin to the hash table
estimates, which is usually much bigger than this discrepancy.
2026-04-04 11:46:11 +03:00
Thomas Munro
bab656bb87 More tar portability adjustments.
For the three implementations that have caused problems so far:

* GNU and BSD (libarchive) tar both understand --format=ustar
* ustar doesn't support large UID/GID values, so set them to 0 to
  avoid a hard error from at least GNU tar
* OpenBSD tar needs -F ustar, and it appears to warn but carry
  on with "nobody" if a UID is too large
* -f /dev/null is a more portable way to throw away the output, since
  the default destination might be a tape device depending on build
  options that a distribution might change
* Windows ships BSD tar but lacks /dev/null, so ask perl for its name

Based on their manuals, the other two implementations the tests are
likely to encounter in the wild don't seem to need any special handling:

* Solaris/illumos tar uses ustar and replaces large UIDs with 60001
* AIX tar uses ustar (unless --format=pax) and truncates large UIDs

Backpatch-through: 18
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Sami Imseih <samimseih@gmail.com> (large UIDs)
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier version)
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> (OpenBSD)
Reviewed-by: Andrew Dunstan <andrew@dunslane.net> (Windows)
Discussion: https://postgr.es/m/3676229.1775170250%40sss.pgh.pa.us
Discussion: https://postgr.es/m/CAA5RZ0tt89MgNi4-0F4onH%2B-TFSsysFjMM-tBc6aXbuQv5xBXw%40mail.gmail.com
2026-04-04 13:54:21 +13:00
Heikki Linnakangas
4953a25b7f Remove HASH_DIRSIZE, always use the default algorithm to select it
It's not very useful to specify a non-standard directory size. The
HASH_DIRSIZE option was only used for shared memory hash tables, and
those always used hash_select_dirsize() to choose the size, which in
turn just uses the default algorithm anyway. That assumption was
ingrained in hash_estimate_size(), too.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-04 02:40:28 +03:00
Heikki Linnakangas
9fe9ecd516 Allocate all parts of shmem hash table from a single contiguous area
Previously, the shared header (HASHHDR) and the directory were
allocated by the caller, and passed to hash_create(), while the actual
elements were allocated separately with ShmemAlloc(). After this
commit, all the memory needed by the header, the directory, and all
the elements is allocated using a single ShmemInitStruct() call, and
the different parts are carved out of that allocation. This way the
ShmemIndex entries (and thus pg_shmem_allocations) reflect the size of
the whole hash table, rather than just the directories.

Commit f5930f9a98 attempted this earlier, but it had to be reverted.
The new strategy is to let dynahash.c perform all the allocations with
the alloc function, but have the alloc function carve out the parts
from the one larger allocation. The shared header and the directory
are now also allocated with alloc calls, instead of passing the area
for those directly from the caller.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-04 02:40:25 +03:00
Heikki Linnakangas
999e9ebb51 Prevent shared memory hash tables from growing beyond initial size
Set HASH_FIXED_SIZE on all shared memory hash tables, to prevent them
from growing after the initial allocation. It was always weirdly
indeterministic that if one hash table used up all the unused shared
memory, you could not use that space for other things anymore until
restart. We just got rid of that behavior for the LOCK and PROCLOCK
tables, but it's similarly weird for all other hash tables.

Increase SHMEM_INDEX_SIZE because we were already above the max size,
on that one, and it's now a hard limit.

Some callers of ShmemInitHash() still pass HASH_FIXED_SIZE, but that's
now unnecessary. They should perhaps now be removed, but it doesn't do
any harm either to pass it.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-04 02:40:24 +03:00
Heikki Linnakangas
9ebe1c4f2c Merge init and max size options on shmem hash tables
Replace the separate init and max size options with a single size
option. We didn't make much use of the feature, all callers except the
ones in wait_event.c already used the same size for both, and the hash
tables in wait_event.c are small so there's little harm in just
allocating them to the max size.

The only reason why you might want to not reserve the max size upfront
is to make the memory available for other hash tables to grow beyond
their max size. Letting hash tables grow much beyond their max size is
bad for performance, however, because we cannot resize the directory,
and we never had very much "wiggle room" to grow to anyway so you
couldn't really rely on it. We recently marked the LOCK and PROCLOCK
tables with HAS_FIXED_SIZE, so there's nothing left in core that would
benefit from more unallocated shared memory.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-04 02:40:20 +03:00
Jacob Champion
d438a36591 oauth: Let validators provide failure DETAILs
At the moment, the only way for a validator module to report error
details on failure is to log them separately before returning from
validate_cb. Independently of that problem, the ereport() calls that we
make during validation failure partially duplicate some of the work of
auth_failed().

The end result is overly verbose and confusing for readers of the logs:

    [768233] LOG:  [my_validator] bad signature in bearer token
    [768233] LOG:  OAuth bearer authentication failed for user "jacob"
    [768233] DETAIL:  Validator failed to authorize the provided token.
    [768233] FATAL:  OAuth bearer authentication failed for user "jacob"
    [768233] DETAIL:  Connection matched file ".../pg_hba.conf" line ...

Solve both problems by making use of the existing logdetail pointer
that's provided by ClientAuthentication. Validator modules may set
ValidatorModuleResult->error_detail to override our default generic
message.

The end result looks something like

    [242284] FATAL:  OAuth bearer authentication failed for user "jacob"
    [242284] DETAIL:  [my_validator] bad signature in bearer token
        Connection matched file ".../pg_hba.conf" line ...

Reported-by: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/202601241015.y5uvxd7oxnfs%40alvherre.pgsql
2026-04-03 16:05:33 -07:00
Daniel Gustafsson
0036232ba8 Make data checksum tests more resilient for slow machines
The test for re-running checksum enabling was only checking for the
data checksum state to transition to 'on', but didn't account for
the launcher process having had time to exit, thus getting an error
instead of the expected no-op.  Adding a pg_stat_activity check for
the launcher exiting resolves the error, verified by inducing delay
in the launcher.

Also wrap a variable only used in injection point tests within the
correct USE macros to avoid warning for an unused variable.

All per the buildfarm.

Author: Daniel Gustafsson <daniel@yesql.se>
Reported-by: Buildfarm
Discussion: https://postgr.es/m/1CB288C9-564B-4664-B096-C2F4377D17AB@yesql.se
2026-04-04 00:25:07 +02:00
Nathan Bossart
01876ace13 Add elevel parameter to relation_needs_vacanalyze().
This will be used in a follow-up commit to avoid emitting debug
logs from this function.

Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
2026-04-03 17:04:28 -05:00
Nathan Bossart
53b8ca6881 Teach relation_needs_vacanalyze() to always compute scores.
Presently, this function only computes component scores when the
corresponding threshold is reached.  A follow-up commit will add a
view that shows tables' autovacuum scores, and we anticipate that
users will want to use this view to discover tables that are
nearing autovacuum eligibility.  This commit teaches this function
to always compute autovacuum scores, even when a threshold has not
been reached or autovacuum is disabled.

The restructuring in this commit revealed an interesting edge case.
If the table needs vacuuming for wraparound prevention and
autovacuum is disabled for it, we might still choose to analyze it.
It's not clear if this is intentional, but it has been this way for
nearly 20 years, so it seems best to avoid changing it without
further discussion.

Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
2026-04-03 16:44:41 -05:00
Daniel Gustafsson
f19c0eccae Online enabling and disabling of data checksums
This allows data checksums to be enabled, or disabled, in a running
cluster without restricting access to the cluster during processing.

Data checksums could prior to this only be enabled during initdb or
when the cluster is offline using the pg_checksums app. This commit
introduce functionality to enable, or disable, data checksums while
the cluster is running regardless of how it was initialized.

A background worker launcher process is responsible for launching a
dynamic per-database background worker which will mark all buffers
dirty for all relation with storage in order for them to have data
checksums calculated on write.  Once all relations in all databases
have been processed, the data_checksums state will be set to on and
the cluster will at that point be identical to one which had data
checksums enabled during initialization or via offline processing.

When data checksums are being enabled, concurrent I/O operations
from backends other than the data checksums worker will write the
checksums but not verify them on reading.  Only when all backends
have absorbed the procsignalbarrier for setting data_checksums to
on will they also start verifying checksums on reading.  The same
process is repeated during disabling; all backends write checksums
but do not verify them until the barrier for setting the state to
off has been absorbed by all.  This in-progress state is used to
ensure there are no false negatives (or positives) due to reading
a checksum which is not in sync with the page.

A new testmodule, test_checksums, is introduced with an extensive
set of tests covering both online and offline data checksum mode
changes.  The tests which run concurrent pgbdench during online
processing are gated behind the PG_TEST_EXTRA flag due to being
very expensive to run.  Two levels of PG_TEST_EXTRA flags exist
to turn on a subset of the expensive tests, or the full suite of
multiple runs.

This work is based on an earlier version of this patch which was
reviewed by among others Heikki Linnakangas, Robert Haas, Andres
Freund, Tomas Vondra, Michael Banck and Andrey Borodin.  During
the work on this new version, Tomas Vondra has given invaluable
assistance with not only coding and reviewing but very in-depth
testing.

Author: Daniel Gustafsson <daniel@yesql.se>
Author: Magnus Hagander <magnus@hagander.net>
Co-authored-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CABUevExz9hUUOLnJVr2kpw9Cx=o4MCr1SVKwbupzuxP7ckNutA@mail.gmail.com
Discussion: https://postgr.es/m/20181030051643.elbxjww5jjgnjaxg@alap3.anarazel.de
Discussion: https://postgr.es/m/CABUevEwE3urLtwxxqdgd5O2oQz9J717ZzMbh+ziCSa5YLLU_BA@mail.gmail.com
2026-04-03 22:58:51 +02:00
Nathan Bossart
8261ee24fe Refactor relation_needs_vacanalyze().
This commit adds an early return to this function, allowing us to
remove a level of indentation on a decent chunk of code.  This is
preparatory work for follow-up commits that will add a new system
view to show tables' autovacuum scores.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
2026-04-03 14:03:12 -05:00
Heikki Linnakangas
79534f9065 Change default of max_locks_per_transactions to 128
The previous commits reduced the amount of memory available for locks
by eliminating the "safety margins" and by settling the split between
LOCK and PROCLOCK tables at startup. The allocation is now more
deterministic, but it also means that you often hit one of the limits
sooner than before. To compensate for that, bump up
max_locks_per_transactions from 64 to 128. With that there is a little
more space in the both hash tables than what was the effective maximum
size for either table before the previous commits.

This only changes the default, so if you had changed
max_locks_per_transactions in postgresql.conf, you will still have
fewer locks available than before for the same setting value. This
should be noted in the release notes. A good rule of thumb is that if
you double max_locks_per_transactions, you should be able to get as
many locks as before.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
2026-04-03 20:27:46 +03:00
Heikki Linnakangas
e1ad034809 Make the lock hash tables fixed-sized
This prevents the LOCK table from "stealing" space that was originally
calculated for the PROLOCK table, and vice versa. That was weirdly
indeterministic so that if you e.g. took a lot of locks consuming all
the available shared memory for the LOCK table, subsequent
transactions that needed the more space for the PROCLOCK table would
fail, but if you restarted the system then the space would be
available for PROCLOCK again. Better to be strict and predictable,
even though that means that in many cases you can acquire far fewer
locks than before.

This also prevents the lock hash tables from using up the
general-purpose 100 kB reserve we set aside for "stuff that's too
small to bother estimating" in CalculateShmemSize(). We are pretty
good at accounting for everything nowadays, so we could probably make
that reservation smaller, but I'll leave that for another commit.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
2026-04-03 20:27:16 +03:00
Heikki Linnakangas
3e854d2ff1 Remove 10% safety margin from lock manager hash table estimates
As the comment says, the hash table sizes are just estimates, but that
doesn't mean we need a "safety margin" here. hash_estimate_size()
estimates the needed size in bytes pretty accurately for the given
number of elements, so if we wanted room for more elements in the
table, we should just use larger max_table_size in the
hash_estimate_size() call.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
2026-04-03 20:26:18 +03:00
Heikki Linnakangas
feb03dfecd Remove bogus "safety margin" from predicate.c shmem estimates
The 10% safety margin was copy-pasted from lock.c when the predicate
locking code was originally added. However, we later (commit
7c797e7194) added the HASH_FIXED_SIZE flag to the hash tables, which
means that they cannot actually use the safety margin that we're
calculating for them.

The extra memory was mainly used by the main lock manager, which is
the only shmem hash table of non-trivial size that does not use the
HASH_FIXED_SIZE flag. If we wanted to have more space for the lock
manager, we should reserve it directly in lock.c. After this commit,
the lock manager will just have less memory available than before.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
2026-04-03 20:25:57 +03:00
Amit Langote
b7b27eb41a Optimize fast-path FK checks with batched index probes
Instead of probing the PK index on each trigger invocation, buffer
FK rows in a new per-constraint cache entry (RI_FastPathEntry) and
flush them as a batch.

On each trigger invocation, the new ri_FastPathBatchAdd() buffers
the FK row in RI_FastPathEntry.  When the buffer fills (64 rows)
or the trigger-firing cycle ends, the new ri_FastPathBatchFlush()
probes the index for all buffered rows, sharing a single
CommandCounterIncrement, snapshot, permission check, and security
context switch across the batch, rather than repeating each per row
as the SPI path does.  Per-flush CCI is safe because all AFTER
triggers for the buffered rows have already fired by flush time.

For single-column foreign keys, the new ri_FastPathFlushArray()
builds an ArrayType from the buffered FK values (casting to the
PK-side type if needed) and constructs a scan key with the
SK_SEARCHARRAY flag.  The index AM sorts and deduplicates the array
internally, then walks matching leaf pages in one ordered traversal
instead of descending from the root once per row.  A matched[] bitmap
tracks which batch items were satisfied; the first unmatched item is
reported as a violation.  Multi-column foreign keys fall back to
per-row probing via the new ri_FastPathFlushLoop().

The fast path introduced in the previous commit (2da86c1ef9) yields
~1.8x speedup.  This commit adds ~1.6x on top of that, for a combined
~2.9x speedup over the unpatched code (int PK / int FK, 1M rows, PK
table and index cached in memory).

FK tuples are materialized via ExecCopySlotHeapTuple() into a new
purpose-specific memory context (flush_cxt), child of
TopTransactionContext, which is also used for per-flush transient
work: cast results, the search array, and index scan allocations.
It is reset after each flush and deleted in teardown.

The PK relation, index, tuple slots, and fast-path metadata are
cached in RI_FastPathEntry across trigger invocations within a
trigger-firing batch, avoiding repeated open/close overhead.  The
snapshot and IndexScanDesc are taken fresh per flush.  The entry is
not subject to cache invalidation: cached relations are held with
locks for the transaction duration, and the entry's lifetime is
bounded by the trigger-firing cycle.

Lifecycle management for RI_FastPathEntry relies on three new
mechanisms:

  - AfterTriggerBatchCallback: A new general-purpose callback
    mechanism in trigger.c.  Callbacks registered via
    RegisterAfterTriggerBatchCallback() fire at the end of each
    trigger-firing batch (AfterTriggerEndQuery for immediate
    constraints, AfterTriggerFireDeferred at COMMIT, and
    AfterTriggerSetState for SET CONSTRAINTS IMMEDIATE).  The RI
    code registers ri_FastPathEndBatch as a batch callback.

  - Batch callbacks only fire at the outermost query level
    (checked inside FireAfterTriggerBatchCallbacks), so nested
    queries from SPI inside other AFTER triggers do not tear down
    the cache mid-batch.

  - XactCallback: ri_FastPathXactCallback NULLs the static cache
    pointer at transaction end, handling the abort path where the
    batch callback never fired.

  - SubXactCallback: ri_FastPathSubXactCallback NULLs the static
    cache pointer on subtransaction abort, preventing the batch
    callback from accessing already-released resources.

  - AfterTriggerBatchIsActive(): A new exported accessor that
    returns true when afterTriggers.query_depth >= 0.  During
    ALTER TABLE ... ADD FOREIGN KEY validation, RI triggers are
    called directly outside the after-trigger framework, so batch
    callbacks would never fire.  The fast-path code uses this to
    fall back to the non-cached per-invocation path in that
    context.

ri_FastPathEndBatch() flushes any partial batch before tearing
down cached resources.  Since the FK relation may already be
closed by flush time (e.g. for deferred constraints at COMMIT),
it reopens the relation using entry->fk_relid if needed.

The existing ALTER TABLE validation path bypasses batching and
continues to call ri_FastPathCheck() directly per row, because
RI triggers are called outside the after-trigger framework there
and batch callbacks would never fire to flush the buffer.

Suggested-by: David Rowley <dgrowleyml@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Co-authored-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com
2026-04-03 14:33:53 +09:00
Thomas Munro
be21341e13 jit: No backport::SectionMemoryManager for LLVM 22.
LLVM 22 has the fix that we copied into our tree in commit 9044fc1d and
a new function to reach it[1][2], so we only need to use our copy for
Aarch64 + LLVM < 22.  The only change to the final version that our copy
didn't get is a new LLVM_ABI macro, but that isn't appropriate for us.
Our copy is hopefully now frozen and would only need maintenance if bugs
are found in the upstream code.

Non-Aarch64 systems now also use the new API with LLVM 22.  It allocates
all sections with one contiguous mmap() instead of one per
section.  We could have done that earlier, but commit 9044fc1d wanted to
limit the blast radius to the affected systems.  We might as well
benefit from that small improvement everywhere now that it is available
out of the box.

We can't delete our copy until LLVM 22 is our minimum supported version,
or we switch to the newer JITLink API for at least Aarch64.

[1] https://github.com/llvm/llvm-project/pull/71968
[2] https://github.com/llvm/llvm-project/pull/174307

Backpatch-through: 14
Discussion: https://postgr.es/m/CA%2BhUKGJTumad75o8Zao-LFseEbt%3DenbUFCM7LZVV%3Dc8yg2i7dg%40mail.gmail.com
2026-04-03 14:55:11 +13:00
Tom Lane
ebba64c08d Further harden tests that might use not-so-compatible tar versions.
Buildfarm testing shows that OpenSUSE (and perhaps related platforms?)
configures GNU tar in such a way that it'll archive sparse WAL files
by default, thus triggering the pax-extension detection code added by
bc30c704a.  Thus, we need something similar to 852de579a but for
GNU tar's option set.  "--format=ustar" seems to do the trick.

Moreover, the buildfarm shows that pg_verifybackup's 003_corruption.pl
test script is also triggering creation of pax-format tar files on
that platform.  We had not noticed because those test cases all fail
(intentionally) before getting to the point of trying to verify WAL
data.

Since that means two TAP scripts need this option-selection logic, and
plausibly more will do so in future, factor it out into a subroutine
in Test::Utils.  We also need to back-patch the 003_corruption.pl fix
into v18, where it's also failing.

While at it, clean up some places where guards for $tar being empty
or undefined were incomplete or even outright backwards.  Presumably,
we missed noticing because the set of machines that run TAP tests
and don't have tar installed is empty.  But if we're going to try
to handle that scenario, we should do it correctly.

Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/02770bea-b3f3-4015-8a43-443ae345379c@vondra.me
Backpatch-through: 18
2026-04-02 17:21:27 -04:00
Andrew Dunstan
bd4f879a9c Add additional jsonpath string methods
Add the following jsonpath methods:

*   l/r/btrim()
*   lower(), upper()
*   initcap()
*   replace()
*   split_part()

Each simply dispatches to the standard string processing functions.
These depend on the locale, but since it's set at `initdb`, they can be
considered immutable and therefore allowed in any jsonpath expression.

Author: Florents Tselai <florents.tselai@gmail.com>
Co-authored-by: David E. Wheeler <david@justatheory.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CA+v5N40sJF39m0v7h=QN86zGp0CUf9F1WKasnZy9nNVj_VhCZQ@mail.gmail.com
2026-04-02 15:19:49 -04:00
Andrew Dunstan
a35c9d524e Rename jsonpath method arg tokens
This is just cleanup in the jsonpath grammar.

Rename the `csv_` tokens to `int_`, because they represent signed or
unsigned integers, as follows:

*   `csv_elem` => `int_elem`
*   `csv_list` => `int_list`
*   `opt_csv_list` => `opt_int_list`

Rename the `datetime_precision` tokens to `uint_arg`, as they represent
unsigned integers and will be useful for other methods in the future, as
follows:

*   `datetime_precision` => `uint_elem`
*   `opt_datetime_precision` => `opt_uint_arg`

Rename the `datetime_template` tokens to `str_arg`, as they represent
strings and will be useful for other methods in the future, as follows:

*   `datetime_template` => `str_elem`
*   `opt_datetime_template` => `opt_str_arg`

Author: David E. Wheeler <david@justatheory.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CA+v5N40sJF39m0v7h=QN86zGp0CUf9F1WKasnZy9nNVj_VhCZQ@mail.gmail.com
2026-04-02 15:19:49 -04:00
Masahiko Sawada
fd7a25af11 Add target_relid parameter to pg_get_publication_tables().
When a tablesync worker checks whether a specific table is published,
it previously issued a query to the publisher calling
pg_get_publication_tables() and filtering the result by relid via a
WHERE clause. Because the function itself was fully evaluated before
the filter was applied, this forced the publisher to enumerate all
tables in the publication. For publications covering a large number of
tables, this resulted in expensive catalog scans and unnecessary CPU
overhead on the publisher.

This commit adds a new overloaded form of pg_get_publication_tables()
that accepts an array of publication names and a target table
OID. Instead of enumerating all published tables, it evaluates
membership for the specified relation via syscache lookups, using the
new is_table_publishable_in_publication() helper. This helper
correctly accounts for publish_via_partition_root, ALL TABLES with
EXCEPT clauses, schema publications, and partition inheritance, while
avoiding the overhead of building the complete published table list.

The existing VARIADIC array form of pg_get_publication_tables() is
preserved for backward compatibility. Tablesync workers use the new
two-argument form when connected to a publisher running PostgreSQL 19
or later.

Bump catalog version.

Reported-by: Marcos Pegoraro <marcos@f10.com.br>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Haoyan Wang <wanghaoyan20@163.com>
Discussion: https://postgr.es/m/CAB-JLwbBFNuASyEnZWP0Tck9uNkthBZqi6WoXNevUT6+mV8XmA@mail.gmail.com
2026-04-02 11:34:50 -07:00
Tom Lane
bc30c704ad Harden astreamer tar parsing logic against archives it can't handle.
Previously, there was essentially no verification in this code that
the input is a tar file at all, let alone that it fits into the
subset of valid tar files that we can handle.  This was exposed by
the discovery that we couldn't handle files that FreeBSD's tar
makes, because it's fairly aggressive about converting sparse WAL
files into sparse tar entries.  To fix:

* Bail out if we find a pax extension header.  This covers the
sparse-file case, and also protects us against scenarios where
the pax header changes other file properties that we care about.
(Eventually we may extend the logic to actually handle such
headers, but that won't happen in time for v19.)

* Be more wary about tar file type codes in general: do not assume
that anything that's neither a directory nor a symlink must be a
regular file.  Instead, we just ignore entries that are none of the
three supported types.

* Apply pg_dump's isValidTarHeader to verify that a purported
header block is actually in tar format.  To make this possible,
move isValidTarHeader into src/port/tar.c, which is probably where
it should have been since that file was created.

I also took the opportunity to const-ify the arguments of
isValidTarHeader and tarChecksum, and to use symbols not hard-wired
constants inside tarChecksum.

Back-patch to v18 but not further.  Although this code exists inside
pg_basebackup in older branches, it's not really exposed in that
usage to tar files that weren't generated by our own code, so it
doesn't seem worth back-porting these changes across 3c9056981
and f80b09bac.  I did choose to include a back-patch of 5868372bb
into v18 though, to minimize cosmetic differences between these
two branches.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/3049460.1775067940@sss.pgh.pa.us>
Backpatch-through: 18
2026-04-02 12:20:36 -04:00
Fujii Masao
5770679918 Remove redundant SetLatch() calls in interrupt handling functions
Interrupt handling functions (e.g., HandleCatchupInterrupt(),
HandleParallelApplyMessageInterrupt()) are called only by
procsignal_sigusr1_handler(), which already calls SetLatch()
for the current process at the end of its processing.
Therefore, these interrupt handling functions do not need to
call SetLatch() themselves.

However, previously, some of these functions redundantly
called SetLatch(). This commit removes those unnecessary
calls.

While duplicate SetLatch() calls are redundant, they are
harmless, so this change is not backpatched.

Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/CALj2ACWd5apddj6Cd885WwJ6LquYu_G81C4GoR4xSoDV1x-FEA@mail.gmail.com
2026-04-02 23:55:30 +09:00
John Naylor
effaa464af Check for __cpuidex and __get_cpuid_count separately
Previously we would only check for the availability of __cpuidex if
the related __get_cpuid_count was not available on a platform.

Future commits will need to access hypervisor information about
the TSC frequency of x86 CPUs. For that case __cpuidex is the only
viable option for accessing a high leaf (e.g. 0x40000000), since
__get_cpuid_count does not allow that.

__cpuidex is defined in cpuid.h for gcc/clang, but in intrin.h
for MSVC, so adjust tests to suite. We also need to cast the array
of unsigned ints to signed, since gcc (with -Wall) and clang emit
warnings otherwise.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: John Naylor <john.naylor@postgresql.org>
Discussion: https://postgr.es/m/CAP53PkyooCeR8YV0BUD_xC7oTZESHz8OdA=tP7pBRHFVQ9xtKg@mail.gmail.com
2026-04-02 19:39:57 +07:00
Andrew Dunstan
bb6ae9707c Use command_ok for pg_regress calls in 002_pg_upgrade and 027_stream_regress
Now that command_ok() captures and displays failure output, use it
instead of system() plus manual diff-dumping in these two tests.  This
simplifies both scripts and produces consistent, truncated output on
failure.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
2026-04-02 08:13:44 -04:00
Andrew Dunstan
b8da9869b8 perl tap: Use croak instead of die in our helper modules
Replace die with croak throughout Cluster.pm and Utils.pm (except in
INIT blocks and signal handlers, where die is correct) so that error
messages report the test script's line number rather than the helper
module's.

Add @CARP_NOT in Utils.pm listing PostgreSQL::Test::Cluster, so that
when a Utils function is called through a Cluster.pm wrapper, croak
skips both packages and reports the actual test-script caller.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
2026-04-02 08:13:44 -04:00
Andrew Dunstan
76540fdedf perl tap: Show die reason in TAP output
Install a $SIG{__DIE__} handler in the INIT block of Utils.pm that emits
the die message as a TAP diagnostic.  Previously, an unexpected die
(e.g. from safe_psql) produced only "no plan was declared" with no
indication of the actual error.  The handler also calls done_testing()
to suppress that confusing message.

Dies during compilation ($^S undefined) and inside eval ($^S == 1) are
left alone.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
Discussion: https://postgr.es/m/20220222181924.eehi7o4pmneeb4hm%40alap3.anarazel.de
2026-04-02 08:13:44 -04:00
Andrew Dunstan
1402b8d2fc perl tap: Show failed command output
Capture stdout and stderr from command_ok() and command_fails() and emit
them as TAP diagnostics on failure.  Output is truncated to the first
and last 30 lines per channel to avoid flooding.

A new helper _diag_command_output() is introduced in Utils.pm so
both functions share the same truncation and formatting logic.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
2026-04-02 08:13:44 -04:00
Andrew Dunstan
5720ae0143 pg_regress: Include diffs in TAP output
When pg_regress fails it is often tedious to find the actual diffs,
especially in CI where you must navigate a file browser.  Emit the first
80 lines of the combined regression.diffs as TAP diagnostics so the
failure reason is visible directly in the test output.

The line limit is across all failing tests in a single pg_regress run to
avoid flooding when a crash causes every subsequent test to fail.

New DIAG_DETAIL / DIAG_END tap output types are added, mirroring the
existing NOTE_DETAIL / NOTE_END pair, so that long diff lines can be
emitted without spurious '#' prefixes on continuation lines.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
2026-04-02 08:13:44 -04:00
Tomas Vondra
7f8c88c2b8 jit: Change the default to off.
While JIT can speed up large analytical queries, it can also cause
serious performance issues on otherwise very fast queries. Compiling
and optimizing the expressions may be so expensive, it completely
outweighs the JIT benefits for shorter queries.

Ideally, we'd address this in the cost model, but the part deciding
whether to enable JIT for a query is rather simple, partially because we
don't have any reliable estimates of how expensive the LLVM compilation
and optimization is.

Sometimes seemingly unrelated changes (for example a couple additional
INSERTs into a table) increase the cost just enough to enable JIT,
resulting in a performance cliff.

Because of these risks, most large-scale deployments already disable JIT
by default. Notably, this includes all hyperscalers.

This commit changes our default to align with that established practice.
If we improve the JIT (be it better costing or cheaper execution), we
can consider enabling it by default again.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/DG1VZJEX1AQH.2EH4OKGRUDB71@jeltef.nl
2026-04-02 13:40:29 +02:00
Heikki Linnakangas
148fe2b05d Test pg_stat_statements across crash restart
Add 'pg_stat_statements' to the crash restart test, to test that
shared memory and LWLock initialization works across crash restart in
a library listed in shared_preload_libraries. We had no test coverage
for that.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-02 13:33:06 +03:00
Amit Kapila
4441d6b2e4 Doc: Fix oversight in commit 55cefadde8.
pg_publication_rel.prrelid refers to sequences whereas stores information only of tables.

Author: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pv1UKR_bxmN7wcCCpQveHoYprvH-hbdFq8gsaH1Ye7B_w@mail.gmail.com
2026-04-02 10:16:53 +05:30
Thomas Munro
de6b80e5ff jit: Stop emitting lifetime.end for LLVM 22.
The lifetime.end intrinsic can now only be used for stack memory
allocated with alloca[1][2][3].  We use it to tell LLVM about the
lifetime of function arguments/isnull values that we keep in palloc'd
memory, so that it can avoid spilling registers to memory.

We might need to rearrange things and put them on the stack, but that'll
take some research.  In the meantime, unbreak the build on LLVM 22.

[1] https://github.com/llvm/llvm-project/pull/149310
[2] https://llvm.org/docs/LangRef.html#llvm-lifetime-end-intrinsic
[3] https://llvm.org/docs/LangRef.html#i-alloca

Backpatch-through: 14
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> (earlier attempt)
Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> (earlier attempt)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier attempt)
Discussion: https://postgr.es/m/CA%2BhUKGJTumad75o8Zao-LFseEbt%3DenbUFCM7LZVV%3Dc8yg2i7dg%40mail.gmail.com
2026-04-02 15:52:48 +13:00
David Rowley
331d829e62 Fix nocachegetattr() so it again supports deforming cstrings
c456e3911 added various optimizations to the tuple deformation routines.
One optimization assumed that heap tuples would never contain cstrings.
That optimization also made its way into nocachegetattr(), which isn't
correct as ROW() types get formed into HeapTuples by ExecEvalRow() and
those can contain cstring Datums.  nocachegetattr() gets used to extract
Datums from those tuples.

Here we remove the pg_assume(), which was there to instruct the compiler
to omit the attlen == -2 related code in att_addlength_pointer().

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/80aeac57-8f50-4732-a5b4-c2373c3f8149@gmail.com
2026-04-02 14:11:17 +13:00
Andres Freund
82c0cb4e67 pg_test_timing: Reduce per-loop overhead
The pg_test_timing program was previously using INSTR_TIME_GET_NANOSEC on an
absolute instr_time value in order to do a diff, which goes against the spirit
of how the GET_* macros are supposed to be used, and will cause overhead in a
future change that assumes these macros are typically used on intervals only.

Additionally the program was doing unnecessary work in the test loop by
measuring the time elapsed, instead of checking the existing current time
measurement against a target end time. To support that, introduce a new
INSTR_TIME_ADD_NANOSEC macro that allows adding user-defined nanoseconds
to an instr_time variable.

While modifying the relevant code anyway, simplify it by not handling
durations <= 0 in test_timing(), since duration is unsigned and 0 is
disallowed by the caller.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53Pkyxv3-3gX+aOxC5tX0p2v9RHU+XH0iyvb64+ZnBXj92vg@mail.gmail.com
2026-04-01 20:07:38 -04:00
Andres Freund
6e36930f9a read_stream: Prevent distance from decaying too quickly
Until now we reduced the look-ahead distance by 1 on every hit, and doubled it
on every miss. That is problematic because there are very common IO patterns
where this prevents us from ever reaching a sufficiently high distance (e.g. a
miss followed by a hit will never have the distance grow beyond 2). In many
such cases, if we had ever reached a sufficient look-ahead distance, things
would have been fine, because we grow the distance faster than we decrease it.

One might think that the most obvious answer to this problem would be to never
reduce the distance. However, that would not work well, as (particularly with
upcoming users of read streams), it is reasonably common to at first have a
lot of misses and then to transition to a fully cached workload, e.g. because
the same blocks are needed repeatedly within one stream. Doing unnecessarily
deep readahead can be costly, due to having to pin a lot more buffers, which
increases CPU overhead.

Because the cost of a synchronously handled miss can be very high (multiple
milliseconds for every IO with commonly used storage) compared to the CPU
overhead of keeping the distance too high, we want to err on the side of not
reducing the distance too early.

The insight that a decrease of the distance by 1 at ever hit may be ok at
large distances, but not at low distances, shows a way out: If we only allow
decreasing the distance once there were no misses for our maximum look-ahead
distance, we will keep the distance high as long as readahead has a chance to
do IO asynchronously, but not commonly when not.

Several folks have written variants of this patch, including at least Thomas
Munro, Melanie Plageman and I.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 19:50:03 -04:00
Andres Freund
cceb1bf45e read_stream: Issue IO synchronously while in fast path
While in fast-path, execute any IO that we might encounter synchronously.
Because we are, in that moment, not reading ahead, dispatching any occasional
IO to workers has the dispatch overhead, without any realistic chance of the
IO completing before we need it.

This helps io_method=worker performance for workloads that have only
occasional cache misses, but where those occasional misses still take long
enough to matter.  It is likely this is only measurable with fast local
storage or workloads with the data in the kernel page cache, as with remote
storage the IO latency, not the dispatch-to-worker latency, is the determining
factor.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 19:22:44 -04:00
Heikki Linnakangas
1bdbb211bb Make ShmemIndex visible in the pg_shmem_allocations view
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-01 23:56:51 +03:00
Álvaro Herrera
db89a47115
Give an 'options' parameter to tuple_delete/_update
The tuple_insert() method already has an equivalent argument, so this
makes sense just on consistency grounds, for future growth.

table_delete() can immediately use it to carry the 'changingPart'
boolean; for table_update we don't have any options at present.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> (older version)
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/202603171606.kf6pmhscqbqz@alvherre.pgsql
2026-04-01 20:26:57 +02:00
Peter Eisentraut
8e72d914c5 Add UPDATE/DELETE FOR PORTION OF
This is an extension of the UPDATE and DELETE commands to do a
"temporal update/delete" based on a range or multirange column.  The
user can say UPDATE t FOR PORTION OF valid_at FROM '2001-01-01' TO
'2002-01-01' SET ... (or likewise with DELETE) where valid_at is a
range or multirange column.

The command is automatically limited to rows overlapping the targeted
portion, and only history within those bounds is changed.  If a row
represents history partly inside and partly outside the bounds, then
the command truncates the row's application time to fit within the
targeted portion, then it inserts one or more "temporal leftovers":
new rows containing all the original values, except with the
application-time column changed to only represent the untouched part
of history.

To compute the temporal leftovers that are required, we use the *_minus_multi
set-returning functions defined in 5eed8ce50c.

- Added bison support for FOR PORTION OF syntax.  The bounds must be
  constant, so we forbid column references, subqueries, etc. We do
  accept functions like NOW().
- Added logic to executor to insert new rows for the "temporal
  leftover" part of a record touched by a FOR PORTION OF query.
- Documented FOR PORTION OF.
- Added tests.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com
2026-04-01 19:06:03 +02:00
Álvaro Herrera
ec2f81766a
Fix vicinity of tuple_insert to use uint32, not int, for options
Oversight in commit 1bd6f22f43: I was way too optimistic about the
compiler letting me know what variables needed to be updated, and missed
a few of them.  Clean it up.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/40E570EE-5A60-49D8-B8F7-2F8F2B7C8DFA@gmail.com
2026-04-01 18:14:51 +02:00
Dean Rasheed
f7f4052a4e Add support for extended statistics on virtual generated columns.
This allows both univariate and multivariate statistics to be built on
virtual generated columns and expressions that refer to virtual
generated columns. The restriction disallowing extended statistics on
a single column is lifted in the case of a single virtual generated
column, since it is treated as a single expression.

In the catalogs, references to virtual generated columns are stored
as-is. They are expanded at ANALYZE time to build the statistics, and
at planning time to allow the optimizer to make use of the statistics.
This allows the statistics to be correctly rebuilt using ANALYZE, if a
column's generation expression is altered (which causes any existing
statistics data to be deleted).

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/20250422181006.dd6f9d1d81299f5b2ad55e1a@sraoss.co.jp
2026-04-01 17:02:24 +01:00
Nathan Bossart
196bf448e0 doc: Add missing description for DROP SUBSCRIPTION IF EXISTS.
Oversight in commit 665d1fad99.

Author: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHut%2BPv72haFerrCdYdmF6hu6o2jKcGzkXehom%2BsP-JBBmOVDg%40mail.gmail.com
Backpatch-through: 14
2026-04-01 09:48:48 -05:00
Andres Freund
513374a47a bufmgr: Return whether WaitReadBuffers() needed to wait
Thanks to the previous commit, pgaio_wref_check_done() will now detect whether
IO has completed even if userspace has not yet consumed the kernel completion.
This knowledge can be useful for callers of WaitReadBuffers() to know whether
it needed to wait or not, e.g. for adjusting read-ahead aggressiveness or for
instrumentation.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
Discussion: https://postgr.es/m/a177a6dd-240b-455a-8f25-aca0b1c08c6e@vondra.me
2026-04-01 09:26:43 -04:00
Andres Freund
6e648e353f aio: io_uring: Allow IO methods to check if IO completed in the background
Until now pgaio_wref_check_done() with io_method=io_uring would not detect if
IOs are known to have completed to the kernel, but the completion has not yet
been consumed by userspace.  This can lead to inferior performance and also
makes it harder to use smarter feedback logic in read_stream, because we
cannot use knowledge about whether an IO completed to control the readahead
distance.

This commit just adds the io_uring specific infrastructure. Later commits will
return whether a wait was needed from WaitReadBuffers() and then use that
knowledge.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 09:26:43 -04:00
Amit Langote
edee563456 Make FastPathMeta self-contained by copying FmgrInfo structs
FastPathMeta stored pointers into ri_compare_cache entries via
compare_entries[], creating a dependency on that cache remaining
stable.  If ri_compare_cache entries were invalidated after fpmeta
was populated, the pointers would dangle.

Replace compare_entries[] with inline copies of the two FmgrInfo
fields actually needed (cast_func_finfo and eq_opr_finfo), copied
at populate time via fmgr_info_copy().  fpmeta now depends only on
riinfo remaining valid, which is already handled by the invalidation
callback.

Introduced by commit 2da86c1ef9 ("Add fast path for foreign key
constraint checks"), noticed while reviewing code for robustness
under CLOBBER_CACHE_ALWAYS.

Discussion: https://postgr.es/m/CA+HiwqFQ+ZA7hSOygv4uv_t75B3r0_gosjadetCsAEoaZwTu6g@mail.gmail.com
2026-04-01 18:43:40 +09:00
Amit Langote
e484b0eea6 Fix two issues in fast-path FK check introduced by commit 2da86c1ef9
First, under CLOBBER_CACHE_ALWAYS, the RI_ConstraintInfo entry can
be invalidated by relcache callbacks triggered inside table_open()
or index_open(), leaving ri_FastPathCheck() calling
ri_populate_fastpath_metadata() with a stale entry whose valid flag
is false.  Fix by moving the fpmeta initialization to after
ri_CheckPermissions(), reloading riinfo first to ensure it is
valid, then calling ri_ExtractValues() and build_index_scankeys()
immediately after before any further operations that could trigger
invalidation.

Second, fpmeta allocated in TopMemoryContext was not freed when the
entry was invalidated in InvalidateConstraintCacheCallBack(),
leaking memory each time the constraint cache entry was recycled.
Fix by freeing and NULLing fpmeta at invalidation time.

Noticed locally when testing with CLOBBER_CACHE_ALWAYS.

Discussion: https://postgr.es/m/CA+HiwqGBU__7-VZZhQWQ3EQuwLYNPd9==ngnzduhGWKHMj9mvw@mail.gmail.com
2026-04-01 17:30:33 +09:00