Plus a similar fix to the README.
Backpatch as far back as the sgml issue exists. The README issue does
exist in v14, but that seems unlikely to harm anyone.
Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/ed3db7ea-55b4-4809-86af-81ad3bb2c7d3@gmail.com
Backpatch-through: 15
Make _bt_readpage pass down the current scan direction to various
utility functions within its pstate variable. Also have _bt_readpage
work off of a local copy of scan->ignore_killed_tuples within its
per-tuple loop (rather than using scan->ignore_killed_tuples directly).
Testing has shown that this significantly benefits large range scans,
which are naturally able to take full advantage of the pstate.startikey
optimization added by commit 8a510275. Running a pgbench script with a
"SELECT abalance FROM pgbench_accounts WHERE aid BETWEEN ..." query
shows an increase in transaction throughput of over 5%. There also
appears to be a small performance benefit when running pgbench's
built-in select-only script.
Follow-up to commit 65d6acbc.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzmwMwcwKFgaf+mYPwiz3iL4AqpXnwtW_O0vqpWPXRom9Q@mail.gmail.com
Quite a bit of code within nbtutils.c is only called by _bt_readpage.
Move _bt_readpage and all of the nbtutils.c functions it depends on into
a new .c file, nbtreadpage.c. Also reorder some of the functions within
the new file for clarity.
This commit has no functional impact. It is strictly mechanical.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzmwMwcwKFgaf+mYPwiz3iL4AqpXnwtW_O0vqpWPXRom9Q@mail.gmail.com
The code in BootStrapXLOG() and in pg_test_fsync.c tried to align WAL
buffers in complicated ways. Also, they still used XLOG_BLCKSZ for
the alignment, even though that should now be PG_IO_ALIGN_SIZE. This
can now be simplified and made more consistent by using
PGAlignedXLogBlock, either directly in BootStrapXLOG() and using
alignas in pg_test_fsync.c.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/f462a175-b608-44a1-b428-bdf351e914f4%40eisentraut.org
A race condition could cause a newly created replication slot to become
invalidated between WAL reservation and a checkpoint.
Previously, if the required WAL was removed, we retried the reservation
process. However, the slot could still be invalidated before the retry if
the WAL was not yet removed but the checkpoint advanced the redo pointer
beyond the slot's intended restart LSN and computed the minimum LSN that
needs to be preserved for the slots.
The fix is to acquire an exclusive lock on ReplicationSlotAllocationLock
during WAL reservation to serialize WAL reservation and checkpoint's
minimum restart_lsn computation. This ensures that, if WAL reservation
occurs first, the checkpoint waits until restart_lsn is updated before
removing WAL. If the checkpoint runs first, subsequent WAL reservations
pick a position at or after the latest checkpoint's redo pointer.
We can't use the same fix for branch 17 and prior because commit
2090edc6f3 changed to compute to the minimum restart_LSN among slot's at
the beginning of checkpoint (or restart point). The fix for 17 and prior
branches is under discussion and will be committed separately.
Reported-by: suyu.cmj <mengjuan.cmj@alibaba-inc.com>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/5e045179-236f-4f8f-84f1-0f2566ba784c.mengjuan.cmj@alibaba-inc.com
The error details updated in this commit can be reached in the
regression tests. They did not follow the project style, and they
should be written them as full sentences.
Some of the errors are switched to use an elog(), for cases that involve
paths that cannot be reached based on the previous state of the parser
processing the input data (array start, object end, etc.). The error
messages for these cases use now a more consistent style across the
board, with the state of the parser reported for debugging.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Michael Paquier <michael@paquier.xyz>
Co-authored-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/1353179.1764901790@sss.pgh.pa.us
The general case for converting to a JSONB numeric value is to run the
source datatype's output function and then numeric_in, but we can do
substantially better than that for integer and numeric source values.
This patch improves the speed of jsonb_agg by 30% for integer input,
and nearly 2X for numeric input.
Sadly, the obvious idea of using float4_numeric and float8_numeric
to speed up those cases doesn't work: they are actually slower than
the generic coerce-via-I/O method, and not by a small amount.
They might round off differently than this code has historically done,
too. Leave that alone pending possible changes in those functions.
We can also do better than the existing code for text/varchar/bpchar
source data; this optimization is similar to one that already exists
in the json_agg() code. That saves 20% or so for such inputs.
Also make a couple of other minor improvements, such as not giving
JSONTYPE_CAST its own special case outside the switch when it could
perfectly well be handled inside, and not using dubious string hacking
to detect infinity and NaN results.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us
The various variants of jsonb_agg() operate as follows,
for each aggregate input value:
1. Build a JsonbValue tree representation of the input value.
2. Flatten the JsonbValue tree into a Jsonb in on-disk format.
3. Iterate through the Jsonb, building a JsonbValue that is part
of the aggregate's state stored in aggcontext, but is otherwise
identical to what phase 1 built.
This is very slightly less silly than it sounds, because phase 1
involves calling non-JSONB code such as datatype output functions,
which are likely to leak memory, and we don't want to leak into the
aggcontext. Nonetheless, phases 2 and 3 are accomplishing exactly
nothing that is useful if we can make phase 1 put the JsonbValue
tree where we need it. We could probably do that with a bunch of
MemoryContextSwitchTo's, but what seems more robust is to give
pushJsonbValue the responsibility of building the JsonbValue tree
in a specified non-current memory context. The previous patch
created the infrastructure for that, and this patch simply makes
the aggregate functions use it and then rips out phases 2 and 3.
For me, this makes jsonb_agg() with a text column as input run
about 2X faster than before. It's not yet on par with json_agg(),
but this removes a whole lot of the difference.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us
Instead of passing "JsonbParseState **" to pushJsonbValue(),
pass a pointer to a JsonbInState, which will contain the
parseState stack pointer as well as other useful fields.
Also, instead of returning a JsonbValue pointer that is often
meaningless/ignored, return the top-level JsonbValue pointer
in the "result" field of the JsonbInState.
This involves a lot of (mostly mechanical) edits, but I think
the results are notationally cleaner and easier to understand.
Certainly the business with sometimes capturing the result of
pushJsonbValue() and sometimes not was bug-prone and incapable of
mechanical verification. In the new arrangement, JsonbInState.result
remains null until we've completed a valid sequence of pushes, so
that an incorrect sequence will result in a null-pointer dereference,
not mistaken use of a partial result.
However, this isn't simply an exercise in prettier notation.
The real reason for doing it is to provide a mechanism whereby
pushJsonbValue() can be told to construct the JsonbValue tree
in a context that is not CurrentMemoryContext. That happens
when a non-null "outcontext" is specified in the JsonbInState.
No callers exercise that option in this patch, but the next
patch in the series will make use of it.
I tried to improve the comments in this area too.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us
The SQL standard says that corr() and friends should return NULL in
the mathematically-undefined case where all the inputs in one of
the columns have the same value. We were checking that by seeing
if the sums Sxx and Syy were zero, but that approach is very
vulnerable to roundoff error: if a sum is close to zero but not
exactly that, we'd come out with a pretty silly non-NULL result.
Instead, directly track whether the inputs are all equal by
remembering the common value in each column. Once we detect
that a new input is different from before, represent that by
storing NaN for the common value. (An objection to this scheme
is that if the inputs are all NaN, we will consider that they
were not all equal. But under IEEE float arithmetic rules,
one NaN is never equal to another, so this behavior is arguably
correct. Moreover it matches what we did before in such cases.)
Then, leave the sums at their exact value of zero for as long
as we haven't detected different input values.
This solution requires the aggregate transition state to contain
8 float values not 6, which is not problematic, and it seems to add
less than 1% to the aggregates' runtime, which seems acceptable.
While we're here, improve corr()'s final function to cope with
overflow/underflow in the final calculation, and to clamp its
result to [-1, 1] in case of roundoff error.
Although this is arguably a bug fix, it requires a catversion bump
due to the change in aggregates' initial states, so it can't be
back-patched.
Patch written by me, but many of the ideas are due to Dean Rasheed,
who also did a deal of testing.
Bug: #19340
Reported-by: Oleg Ivanov <o15611@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Co-authored-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/19340-6fb9f6637f562092@postgresql.org
Due to an off-by-one error, the code failed to find matches at the
end of the haystack. Fix by rewriting the loop.
While at it, fix a comment that claimed that the function could find
a zero-length match. Such a match could send a caller into an endless
loop. However, zero-length matches only make sense with an empty
search string, and that case is explicitly excluded by all callers.
To make sure it stays that way, add an Assert and a comment.
Bug: #19341
Reported-by: Adam Warland <adam.warland@infor.com>
Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19341-1d9a22915edfec58@postgresql.org
Backpatch-through: 18
apply_scanjoin_target_to_paths wants to avoid useless work and
platform-specific dependencies by throwing away the path list created
prior to applying the final scan/join target and constructing a whole
new one using the final scan/join target. However, this is only valid
when we'll consider all the same strategies after the pathlist reset
as before.
After resetting the path list, we reconsider Append and MergeAppend
paths with the modified target list; therefore, it's only valid for a
partitioned relation. However, what the previous coding missed is that
it cannot be a partitioned join relation, because that also has paths
that are not Append or MergeAppend paths and will not be reconsidered.
Thus, before this patch, we'd sometimes choose a partitionwise strategy
with a higher total cost than cheapest non-partitionwise strategy,
which is not good.
We had a surprising number of tests cases that were relying on this
bug to work as they did. A big part of the reason for this is that row
counts in regression test cases tend to be low, which brings the cost
of partitionwise and non-partitionwise strategies very close together,
especially for merge joins, where the real and perceived advantages of
a partitionwise approach are minimal. In addition, one test case
included a row-count-inflating join. In such cases, a partitionwise
join can easily be a loser on cost, because the total number of tuples
passing through an Append node is much higher than it is with a
non-partitionwise strategy. That test case is adjusted by adding
additional join clauses to avoid the row count inflation.
Although the failure of the planner to choose the lowest-cost path is a
bug, we generally do not back-patch fixes of this type, because planning
is not an exact science and there is always a possibility that some user
will end up with a plan that has a lower estimated cost but actually
runs more slowly. Hence, no backpatch here, either.
The code change here is exactly what was originally proposed by
Ashutosh, but the changes to the comments and test cases have been
very heavily rewritten by me, helped along by some very useful advice
from Richard Guo.
Reported-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Author: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Arne Roland <arne.roland@malkut.net>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: http://postgr.es/m/CAExHW5toze58+jL-454J3ty11sqJyU13Sz5rJPQZDmASwZgWiA@mail.gmail.com
Newest versions of gcc are able to detect cases where code implicitly
casts away const by assigning the result of strchr() or a similar
function applied to a "const char *" value to a target variable
that's just "char *". This of course creates a hazard of not getting
a compiler warning about scribbling on a string one was not supposed
to, so fixing up such cases is good.
This patch fixes a dozen or so places where we were doing that.
Most are trivial additions of "const" to the target variable,
since no actually-hazardous change was occurring. There is one
place in ecpg.trailer where we were indeed violating the intention
of not modifying a string passed in as "const char *". I believe
that's harmless not a live bug, but let's fix it by copying the
string before modifying it.
There is a remaining trouble spot in ecpg/preproc/variable.c,
which requires more complex surgery. I've left that out of this
commit because I want to study that code a bit more first.
We probably will want to back-patch this once compilers that detect
this pattern get into wider circulation, but for now I'm just
going to apply it to master to see what the buildfarm says.
Thanks to Bertrand Drouvot for finding a couple more spots than
I had.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/1324889.1764886170@sss.pgh.pa.us
Commit 76b78721ca introduced two new columns in pg_stat_replication_slots
to improve monitoring of slot synchronization. One of these columns was
named slotsync_skip_at, which is inconsistent with the naming convention
used for similar columns in other system views.
Columns that store timestamps of the most recent event typically use the
'last_' in the column name (e.g., last_autovacuum, checksum_last_failure).
Renaming slotsync_skip_at to slotsync_last_skip aligns with this pattern,
making the purpose of the column clearer and improving overall consistency
across the views.
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Discussion: https://postgr.es/m/20251128091552.GB13635@p46.dedyn.io;lightning.p46.dedyn.io
Discussion: https://postgr.es/m/CAE9k0PkhfKrTEAsGz4DjOhEj1nQ+hbQVfvWUxNacD38ibW3a1g@mail.gmail.com
Some of the buildfarm members with some old gcc versions have been
complaining about an always-true test for a NULL pointer caused by a
combination of SOFT_ERROR_OCCURRED() and a local ErrorSaveContext
variable.
These warnings are taken care of by removing SOFT_ERROR_OCCURRED(),
switching to a direct variable check, like 56b1e88c80.
Oversights in e1405aa5e3 and 44eba8f06e.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1341064.1764895052@sss.pgh.pa.us
Adjust the prune_freeze_setup() parameter types of new_relfrozen_xid and
new_relmin_mxid to prevent misleading Coverity analysis.
heap_page_prune_and_freeze() compared these values against NULL when
passing them to prune_freeze_setup(), causing Coverity to assume they
could be NULL and flag a possible null-pointer dereference later, even
though it occurs inside a directly related conditional.
Reported-by: Coverity
Author: Melanie Plageman <melanieplageman@gmail.com>
The key is the first member of PrivateRefCountEntry, which has type
Buffer. This commit changes the key size from sizeof(int32) to
sizeof(Buffer). This appears to be an oversight in commit
4b4b680c3d, but it's of no consequence because Buffer has been a
signed 32-bit integer for a long time.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aS77DTpl0fOkIKSZ%40ip-10-97-1-34.eu-west-3.compute.internal
The assertion checking MyProcNumber used MaxBackends as the upper
bound, but the procInfos array is allocated with size
MaxBackends + NUM_AUXILIARY_PROCS. This inconsistency would cause
a false assertion failure if an auxiliary process calls WaitForLSN().
Author: Xuneng Zhou <xunengzhou@gmail.com>
In an upcoming patch more wait events will be added to the wait event
class (for buffer locking), making the current name too
specific. Alternatively we could introduce a dedicated wait event class for
those, but it seems somewhat confusing to have a BUFFERPIN and a BUFFER wait
event class.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.
The waiting mechanism was broken in a few scenarios:
- When nextMulti was advanced without WAL-logging the next
multixid. For example, if a later multixid was already assigned and
WAL-logged before the previous one was WAL-logged, and then the
server crashed. In that case the next offset would never be set in
the offsets SLRU, and a query trying to read it would get stuck
waiting for it. Same thing could happen if pg_resetwal was used to
forcibly advance nextMulti.
- In hot standby mode, a deadlock could happen where one backend waits
for the next multixid assignment record, but WAL replay is not
advancing because of a recovery conflict with the waiting backend.
The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.
Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.
Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
The comment for the Pointer type says 'XXX Pointer arithmetic is done
with this, so it can't be void * under "true" ANSI compilers.'. This
fixes that. Change from Pointer to use char * explicitly where
pointer arithmetic is needed. This makes the meaning of the code
clearer locally and removes a dependency on the actual definition of
the Pointer type. (The definition of the Pointer type is not changed
in this commit.)
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
Both dsa_get_total_size() and dsa_get_total_size_from_handle() take
an exclusive lock just to read a variable. This commit reduces the
lock level to LW_SHARED in those functions.
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aS8fMzWs9e8iHxk2%40nathan
To increase our test coverage in general, and because I will add onto
this in the next commit to also test amcheck with incomplete splits.
This is copied from the similar test we had for GIN indexes. B-tree's
incomplete splits work similarly to GIN's, so with small changes, the
same test works for B-tree too.
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/abd65090-5336-42cc-b768-2bdd66738404@iki.fi
Presently, this view reports NULL for the size of DSAs and dshash
tables because 1) the current backend might not be attached to them
and 2) the registry doesn't save the pointers to the dsa_area or
dshash_table in local memory. Also, the view doesn't show
partially-initialized entries to avoid ambiguity, since those
entries would report a NULL size as well.
This commit introduces a function that looks up the size of a DSA
given its handle (transiently attaching to the control segment if
needed) and teaches pg_dsm_registry_allocations to use it to show
the size of successfully-initialized DSA and dshash entries.
Furthermore, the view now reports partially-initialized entries
with a NULL size.
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aSeEDeznAsHR1_YF%40nathan
This idea (implemented in commits and bc32a12e0d and 9e8fa05d34) of
using notices to detect that a session is sleeping was unreliable, so
simplify the concurrency controller session to just look at
pg_stat_activity for a process sleeping on the injection point we want
it to hit. This change allows us to remove a secondary injection point
and the alternative expected output files.
Reproduced by Alexander Lakhin following a report in buildfarm member
skink (which runs the server under valgrind).
Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/3e302c96-cdd2-45ec-af84-03dbcdccde4a@gmail.com
When planning queries with ON CONFLICT on partitioned tables, the
indexes to consider as arbiters for each partition are determined based
on those found in the parent table. However, it's possible for an index
on a partition to be reindexed, and in that case, the auxiliary indexes
created on the partition must be considered as arbiters as well; failing
to do that may result in spurious "duplicate key" errors given
sufficient bad luck.
We fix that in this commit by matching every index that doesn't have a
parent to each initially-determined arbiter index. Every unparented
matching index is considered an additional arbiter index.
Closely related to the fixes in bc32a12e0d and 2bc7e886fc, and for
identical reasons, not backpatched (for now) even though it's a
longstanding issue.
Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com
This removes some casts where the input already has the same type as
the type specified by the cast. Their presence could cause risks of
hiding actual type mismatches in the future or silently discarding
qualifiers. It also improves readability. Same kind of idea as
7f798aca1d and ef8fe69360. (This does not change all such
instances, but only those hand-picked by the author.)
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal
Instead of complicated pointer arithmetic, overlay a uint32 array and
just access the array members. That's safe thanks to
XLogRecGetBlockData() returning a MAXALIGNed buffer.
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal
This commit updates two functions that convert "timestamptz" to
"timestamp", and vice-versa, to use the soft error reporting rather than
a their own logic to do the same. These are now named as follows:
- timestamp2timestamptz_safe()
- timestamptz2timestamp_safe()
These functions were suffixed with "_opt_overflow", previously.
This shaves some code, as it is possible to detect how a timestamp[tz]
overflowed based on the returned value rather than a custom state. It
is optionally possible for the callers of these functions to rely on the
error generated internally by these functions, depending on the error
context.
Similar work has been done in d03668ea05 and 4246a977ba.
Reviewed-by: Amul Sul <sulamul@gmail.com>
Discussion: https://postgr.es/m/aS09YF2GmVXjAxbJ@paquier.xyz
The regex mechanism scans through the first "max_chr" character values
to cache character property ranges (isalpha, etc.). For single-byte
encodings, there's no sense in scanning beyond UCHAR_MAX; but for
UTF-8 it makes sense to cache higher code point values (though not all
of them; only up to MAX_SIMPLE_CHR).
Prior to 5a38104b36, the logic about how many character values to scan
was based on the pg_regex_strategy, which was dependent on the
provider. Commit 5a38104b36 preserved that logic exactly, allowing
different providers to define the "max_chr".
Now, change it to depend only on the encoding and whether
ctype_is_c. For this specific calculation, distinguishing between
providers creates more complexity than it's worth.
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
When REINDEX CONCURRENTLY is processing the index that supports a
constraint, there are periods during which multiple indexes match the
constraint index's definition. Those must all be included in the set of
inferred index for INSERT ON CONFLICT, in order to avoid spurious
"duplicate key" errors.
To fix, we set things up to match all indexes against attributes,
expressions and predicates of the constraint index, then return all
indexes that match those, rather than just the one constraint index.
This is more onerous than before, where we would just test the named
constraint for validity, but it's not more onerous than processing
"conventional" inference (where a list of attribute names etc is given).
This is closely related to the misbehaviors fixed by bc32a12e0d, for a
different situation. We're not backpatching this one for now either,
for the same reasons.
Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com
This one is almost a textbook example of an aliasing violation, and it
is straightforward to fix, so clean it up. (The warning only shows up
if you remove the -fno-strict-aliasing option.) Also, move the code
after the error checking. Doesn't make a difference technically, but
it seems strange to do actions before errors are checked.
Reported-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/20240724.155525.366150353176322967.ishii%40postgresql.org
This split exists for most of the other RMGRs, and makes cleaner the
separation between the WAL code, the redo code and the record
description code (already in its own file) when it comes to the sequence
RMGR. The redo and masking routines are moved to a new file,
sequence_xlog.c. All the RMGR routines are now located in a new header,
sequence_xlog.h.
This separation is useful for a different patch related to sequences
that I have been working on, where it makes a refactoring of sequence.c
easier if its RMGR routines and its core routines are split.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/aSfTxIWjiXkTKh1E@paquier.xyz
This commit changes some functions related to the data types date and
timestamp to use the soft error reporting rather than a custom boolean
flag called "overflow", used to let the callers of these functions know
if an overflow happens.
This results in the removal of some boilerplate code, as it is possible
to rely on an error context rather than a custom state, with the
possibility to use the error generated inside the functions updated
here, if necessary.
These functions were suffixed with "_opt_overflow". They are now
renamed to use "_safe" as suffix.
This work is similar to 4246a977ba.
Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAAJ_b95HEmFyzHZfsdPquSHeswcopk8MCG1Q_vn4tVkZ+xxofw@mail.gmail.com
42473b3b3 added prosupport infrastructure to allow simplification of
Aggrefs during constant-folding. In some cases the context->root that's
given to eval_const_expressions_mutator() can be NULL. 42473b3b3 failed
to take that into account, which could result in a crash.
To fix, add a check and only call simplify_aggref() when the PlannerInfo
is set.
Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Birler, Altan <altan.birler@tum.de>
Discussion: https://postgr.es/m/132d4da23b844d5ab9e352d34096eab5@tum.de
We have some limited ability to detect redundant and contradictory
conditions involving an nbtree row comparison key following commits
f09816a0 and bd3f59fd: we can do so in simple cases involving IS NULL
and IS NOT NULL keys on a row compare key's first column. We can
likewise determine that a scan's qual is unsatisfiable given a row
compare whose first subkey's arg is NULL. Update obsolete comments that
claimed that we merely copied row compares into the output key array
"without any editorialization".
Also update another _bt_preprocess_keys header comment paragraph: add a
parenthetical remark that points out that preprocessing will generate a
skip array for the preceding example qual. That will ultimate lead to
preprocessing marking the example's lower-order y key required -- which
is exactly what the example supposes cannot happen. Keep the original
comment, though, since it accurately describes the mechanical rules that
determine which keys get marked required in the absence of skip arrays
(which can occasionally still matter). This fixes an oversight in
commit 92fe23d9, which added the nbtree skip scan optimization.
Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
Formerly, when updating an auto-updatable view, or a relation with
rules, if the original query had any data-modifying CTEs, the rewriter
would rewrite those CTEs multiple times as RewriteQuery() recursed
into the product queries. In most cases that was harmless, because
RewriteQuery() is mostly idempotent. However, if the CTE involved
updating an always-generated column, it would trigger an error because
any subsequent rewrite would appear to be attempting to assign a
non-default value to the always-generated column.
This could perhaps be fixed by attempting to make RewriteQuery() fully
idempotent, but that looks quite tricky to achieve, and would probably
be quite fragile, given that more generated-column-type features might
be added in the future.
Instead, fix by arranging for RewriteQuery() to rewrite each CTE
exactly once (by tracking the number of CTEs already rewritten as it
recurses). This has the advantage of being simpler and more efficient,
but it does make RewriteQuery() dependent on the order in which
rewriteRuleAction() joins the CTE lists from the original query and
the rule action, so care must be taken if that is ever changed.
Reported-by: Bernice Southey <bernice.southey@gmail.com>
Author: Bernice Southey <bernice.southey@gmail.com>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CAEDh4nyD6MSH9bROhsOsuTqGAv_QceU_GDvN9WcHLtZTCYM1kA@mail.gmail.com
Backpatch-through: 14
There was a pg_isblank() function that claimed to be a replacement for
the standard isblank() function, which was thought to be "not very
portable yet". We can now assume that it's portable (it's in C99).
But pg_isblank() actually diverged from the standard isblank() by also
accepting '\r', while the standard one only accepts space and tab.
This was added to support parsing pg_hba.conf under Windows. But the
hba parsing code now works completely differently and already handles
line endings before we get to pg_isblank(). The other user of
pg_isblank() is for ident protocol message parsing, which also handles
'\r' separately. So this behavior is now obsolete and confusing.
To improve clarity, I separated those concerns. The ident parsing now
gets its own function that hardcodes the whitespace characters
mentioned by the relevant RFC. pg_isblank() is now static in hba.c
and is a wrapper around the standard isblank(), with some extra logic
to ensure robust treatment of non-ASCII characters.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/170308e6-a7a3-4484-87b2-f960bb564afa%40eisentraut.org
Introduce a new column, slotsync_skip_reason, in the pg_replication_slots
view. This column records the reason why the last slot synchronization was
skipped. It is primarily relevant for logical replication slots on standby
servers where the 'synced' field is true. The value is NULL when
synchronization succeeds.
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAE9k0PkhfKrTEAsGz4DjOhEj1nQ+hbQVfvWUxNacD38ibW3a1g@mail.gmail.com
The solution used in 0ca3b1697 to determine the Parallel TID Range
Scan's start location was to modify the signature of
table_block_parallelscan_startblock_init() to allow the startblock
to be passed in as a parameter. This allows the scan limits to be
adjusted before that function is called so that the limits are picked up
when the parallel scan starts. The commit made it so the call to
table_block_parallelscan_startblock_init uses the HeapScanDesc's
rs_startblock to pass the startblock to the parallel scan. That all
works ok for Parallel TID Range scans as the HeapScanDesc rs_startblock
gets set by heap_setscanlimits(), but for Parallel Seq Scans, initscan()
does not initialize rs_startblock, and that results in passing an
uninitialized value to table_block_parallelscan_startblock_init() as
noted by the buildfarm member skink, running Valgrind.
To fix this issue, make it so initscan() sets the rs_startblock for
parallel scans unless we're doing a rescan. This makes it so
table_block_parallelscan_startblock_init() will be called with the
startblock set to InvalidBlockNumber, and that'll allow the syncscan
code to find the correct start location (when enabled). For Parallel
TID Range Scans, this InvalidBlockNumber value will be overwritten in
the call to heap_setscanlimits().
initscan() is a bit light on documentation on what's meant to get
initialized where for parallel scans. From what I can tell, it looks like
it just didn't matter prior to 0ca3b1697 that rs_startblock was left
uninitialized for parallel scans. To address the light documentation,
I've also added some comments to mention that the syncscan location for
parallel scans is figured out in table_block_parallelscan_startblock_init.
I've also taken the liberty to adjust the if/else if/else code in
initscan() to make it clearer which parts apply to parallel scans and
which parts are for the serial scans.
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvqALm+k7FyfdQdCw1yF_8HojvR61YRrNhwRQPE=zSmnQA@mail.gmail.com
This commit introduces new internal bufmgr routines for marking shared
buffers as dirty:
* MarkDirtyUnpinnedBuffer()
* MarkDirtyRelUnpinnedBuffers()
* MarkDirtyAllUnpinnedBuffers()
These functions provide an efficient mechanism to respectively mark one
buffer, all the buffers of a relation, or the entire shared buffer pool
as dirty, something that can be useful to force patterns for the
checkpointer. MarkDirtyUnpinnedBufferInternal(), an extra routine, is
used by these three, to mark as dirty an unpinned buffer.
They are intended as developer tools to manipulate buffer dirtiness in
bulk, and will be used in a follow-up commit.
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Joseph Koshakow <koshy44@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw@mail.gmail.com
Normally, if a WHERE clause is implied by the predicate of a partial
index, we drop that clause from the set of quals used with the index,
since it's redundant to test it if we're scanning that index.
However, if it's a hash index (or any !amoptionalkey index), this
could result in dropping all available quals for the index's first
key, preventing us from generating an indexscan.
It's fair to question the practical usefulness of this case. Since
hash only supports equality quals, the situation could only arise
if the index's predicate is "WHERE indexkey = constant", implying
that the index contains only one hash value, which would make hash
a really poor choice of index type. However, perhaps there are
other !amoptionalkey index AMs out there with which such cases are
more plausible.
To fix, just don't filter the candidate indexquals this way if
the index is !amoptionalkey. That's a bit hokey because it may
result in testing quals we didn't need to test, but to do it
more accurately we'd have to redundantly identify which candidate
quals are actually usable with the index, something we don't know
at this early stage of planning. Doesn't seem worth the effort.
Reported-by: Sergei Glukhov <s.glukhov@postgrespro.ru>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/e200bf38-6b45-446a-83fd-48617211feff@postgrespro.ru
Backpatch-through: 14
transformJsonFuncExpr() used exprType()/exprLocation() on the
possibly coerced path expression, which could be NULL when
coercion to jsonpath failed, leading to "cache lookup failed
for type 0" errors.
Preserve the original expression node so that type and location
in the "must be of type jsonpath" error are reported correctly.
Add regression tests to cover these cases.
Reported-by: Jian He <jian.universality@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CACJufxHunVg81JMuNo8Yvv_hJD0DicgaVN2Wteu8aJbVJPBjZA@mail.gmail.com
Backpatch-through: 17
In v14, bb437f995 added support for scanning for ranges of TIDs using a
dedicated executor node for the purpose. Here, we allow these scans to
be parallelized. The range of blocks to scan is divvied up similarly to
how a Parallel Seq Scans does that, where 'chunks' of blocks are
allocated to each worker and the size of those chunks is slowly reduced
down to 1 block per worker by the time we're nearing the end of the
scan. Doing that means workers finish at roughly the same time.
Allowing TID Range Scans to be parallelized removes the dilemma from the
planner as to whether a Parallel Seq Scan will cost less than a
non-parallel TID Range Scan due to the CPU concurrency of the Seq Scan
(disk costs are not divided by the number of workers). It was possible
the planner could choose the Parallel Seq Scan which would result in
reading additional blocks during execution than the TID Scan would have.
Allowing Parallel TID Range Scans removes the trade-off the planner
makes when choosing between reduced CPU costs due to parallelism vs
additional I/O from the Parallel Seq Scan due to it scanning blocks from
outside of the required TID range. There is also, of course, the
traditional parallelism performance benefits to be gained as well, which
likely doesn't need to be explained here.
Author: Cary Huang <cary.huang@highgo.ca>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Steven Niu <niushiji@gmail.com>
Discussion: https://postgr.es/m/18f2c002a24.11bc2ab825151706.3749144144619388582@highgo.ca
This adds SupportRequestSimplifyAggref to allow pg_proc.prosupport
functions to receive an Aggref and allow them to determine if there is a
way that the Aggref call can be optimized.
Also added is a support function to allow transformation of COUNT(ANY)
into COUNT(*). This is possible to do when the given "ANY" cannot be
NULL and also that there are no ORDER BY / DISTINCT clauses within the
Aggref. This is a useful transformation to do as it is common that
people write COUNT(1), which until now has added unneeded overhead.
When counting a NOT NULL column. The overheads can be worse as that
might mean deforming more of the tuple, which for large fact tables may
be many columns in.
It may be possible to add prosupport functions for other aggregates. We
could consider if ORDER BY could be dropped for some calls, e.g. the
ORDER BY is quite useless in MAX(c ORDER BY c).
There is a little bit of passing fallout from adjusting
expr_is_nonnullable() to handle Const which results in a plan change in
the aggregates.out regression test. Previously, nothing was able to
determine that "One-Time Filter: (100 IS NOT NULL)" was always true,
therefore useless to include in the plan.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAApHDvqGcPTagXpKfH=CrmHBqALpziThJEDs_MrPqjKVeDF9wA@mail.gmail.com
If DSM registry entry initialization fails, backends could try to
use an uninitialized DSM segment, DSA, or dshash table (since the
entry is still added to the registry). To fix, restructure the
code so that the registry retries initialization as needed. This
commit also modifies pg_get_dsm_registry_allocations() to leave out
partially-initialized entries, as they shouldn't have any allocated
memory.
DSM registry entry initialization shouldn't fail often in practice,
but retrying was deemed better than leaving entries in a
permanently failed state (as was done by commit 1165a933aa, which
has since been reverted).
Suggested-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/E1vJHUk-006I7r-37%40gemulon.postgresql.org
Backpatch-through: 17
Previously, the caller needed to check ctype_is_c first for some
routines and not others. Now, the APIs consistently work, and the
caller can just check ctype_is_c for optimization purposes.
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
This reverts commit 1165a933aa (and the corresponding commits on
the back-branches). In a follow-up commit, we'll teach the
registry to retry entry initialization instead of leaving it in a
permanently failed state.
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/E1vJHUk-006I7r-37%40gemulon.postgresql.org
Backpatch-through: 17
Refactor the setup and planning phases of pruning and freezing into
helpers. This streamlines heap_page_prune_and_freeze() and makes it more
clear when the examination of tuples ends and page modifications begin.
No code change beyond what was required to extract the code into helper
functions.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/mhf4vkmh3j57zx7vuxp4jagtdzwhu3573pgfpmnjwqa6i6yj5y%40sy4ymcdtdklo
There is no straightforward way to determine if a cluster is running
in EXEC_BACKEND mode or not, which is useful for tests to know. This
adds a GUC debug_exec_backend similar to debug_assertions which will
be true when the server is running in EXEC_BACKEND mode.
Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/5F301096-921A-427D-8EC1-EBAEC2A35082@yesql.se
The code comment said, "It is expected that this routine will
eventually be replaced with the C99 hypot() function.", so let's do
that now.
This function is tested via the geometry regression test, so if it is
faulty on any platform, it will show up there.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/170308e6-a7a3-4484-87b2-f960bb564afa%40eisentraut.org
pg_dependencies is used as data type for the contents of dependencies
extended statistics. This new input function consumes the format that
has been established by e76defbcf0 for the output function of
pg_dependencies, enforcing some sanity checks for:
- Checks for the input object, which should be a one-dimension array
with correct attributes and values.
- The key names: "attributes", "dependency", "degree". All are
required, other key names are blocked.
- Value types for each key: "attributes" requires an array of integers,
"dependency" an attribute number, "degree" a float.
- List of attributes. In this case, it is possible that some
dependencies are not listed in the statistics data, as items with a
degree of 0 are discarded when building the statistics. This commit
includes checks for simple scenarios, like duplicated attributes, or
overlapping values between the list of "attributes" and the "dependency"
value. Even if the input function considers the input as valid, a value
still needs to be cross-checked with the attributes defined in a
statistics object at import.
- Based on the discussion, the checks on the values are loose, as there
is also an argument for potentially stats injection. For example,
"degree" should be defined in [0.0,1.0], but a check is not enforced.
This is required for a follow-up patch that aims to implement the import
of extended statistics. Some tests are added to check the code paths of
the JSON parser checking the shape of the pg_dependencies inputs, with
91% of code coverage reached. The tests are located in their own new
test file, for clarity.
Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yuefei Shi <shiyuefei1004@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
pg_ndistinct is used as data type for the contents of ndistinct extended
statistics. This new input function consumes the format that has been
established by 1f927cce44 for the output function of pg_ndistinct,
enforcing some sanity checks for:
- Checks for the input object, which should be a one-dimension array
with correct attributes and values.
- The key names: "attributes", "ndistinct". Both are required, other
key names are blocked.
- Value types for each key: "attributes" requires an array of integers,
and "ndistinct" an integer.
- List of attributes. Note that this enforces a check so as an
attribute list has to be a subset of the longest attribute list found.
This does not enforce that a full group of attribute sets exist, based
on how the groups are generated when the ndistinct objects are
generated, making the list of ndistinct items a bit loose. Note a check
would still be required at import to see if the attributes listed match
with the attribute numbers set in the definition of a statistics object.
- Based on the discussion, the checks on the values are loose, as there
is also an argument for potentially stats injection. The relation and
attribute level stats follow the same line of argument for the values.
This is required for a follow-up patch that aims to implement the import
of extended statistics. Some tests are added to check the code paths of
the JSON parser checking the shape of the pg_ndistinct inputs, with 90%
of code coverage reached. The tests are located in their own new test
file, for clarity.
Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yuefei Shi <shiyuefei1004@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
heap_page_prune_and_freeze() requires the caller to initialize
PruneFreezeParams->cutoffs so that the function can correctly evaluate
whether tuples should be frozen. This requirement previously existed
only in comments and was easy to miss, especially after “cutoffs” was
converted from a direct function parameter to a field of the newly
introduced PruneFreezeParams struct (added in 1937ed7062). Adding an
assert makes this requirement explicit and harder to violate.
Also, fix a minor typo while we're at it.
Author: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/0AC177F5-5E26-45EE-B273-357C51212AC5%40gmail.com
Previously, gen_guc_tables.pl would emit "Use of uninitialized value"
warnings if required fields were missing in guc_parameters.dat (for
example, when an integer or real GUC omitted the 'max' value). The
resulting error messages were unclear and did not identify which GUC
entry was problematic.
Add explicit validation of required fields depending on the parameter
type, and fail with a clear and specific message such as:
guc_parameters.dat:1909: error: entry "max_index_keys" of type "int" is missing required field "max"
No changes to generated guc_tables.c.
Author: Chao Li <lic@highgo.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2%3DoP4LgHi771_OKhPPUS7B-CTqCs%3D%3DuQcNXWrwBoAm5Vg%40mail.gmail.com
The issue occurred because the replication slot was not released in the
slotsync worker when a slot synchronization cycle was skipped. This skip
happened because the required WAL was not received and flushed on the
standby server. As a result, in the next cycle, when attempting to acquire
the slot, an assertion failure was triggered.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Discussion: https://postgr.es/m/CAA4eK1KMwYUYy=oAVHu9mam+vX50ixxfhO4_C=kgQC8VCQHEfw@mail.gmail.com
This patch adds two new columns to the pg_stat_replication_slots view:
slotsync_skip_count - the total number of times a slotsync operation was
skipped.
slotsync_skip_at - the timestamp of the most recent skip.
These additions provide better visibility into replication slot
synchronization behavior.
A future patch will introduce the slotsync_skip_reason column in
pg_replication_slots to capture the reason for skip.
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAE9k0PkhfKrTEAsGz4DjOhEj1nQ+hbQVfvWUxNacD38ibW3a1g@mail.gmail.com
This comment should probably have been moved to pg_locale_libc.c in
commit 66ac94cdc7 (2024), but upon closer examination it was already
completely obsolete then.
The first part of the comment has been obsolete since commit
85feb77aa0 (2017), which required that the system provides the
required wide-character functions.
The second part has been obsolete since commit e9931bfb75 (2024),
which eliminated code paths depending on the global LC_CTYPE setting.
Discussion: https://www.postgresql.org/message-id/flat/170308e6-a7a3-4484-87b2-f960bb564afa%40eisentraut.org
This commit renames write_chunk and read_chunk to respectively
pgstat_write_chunk() and pgstat_read_chunk(), along with the *_s
convenience macros.
These are made available for plug-ins, so as any code that decides to
write and/or read stats data can rely on a single code path for this
work.
Extracted from a larger patch by the same author.
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s9SDOu+Z6veoJCHWk+kDeTktAtC-KY9fQ9Z6BJdDUirQ@mail.gmail.com
Accidentally the code in LWLockWakeup() checked the list of to-be-woken up
processes to see if LW_FLAG_HAS_WAITERS should be unset. That means that
HAS_WAITERS would not get unset immediately, but only during the next,
unnecessary, call to LWLockWakeup().
Luckily, as the code stands, this is just a small efficiency issue.
However, if there were (as in a patch of mine) a case in which LWLockWakeup()
would not find any backend to wake, despite the wait list not being empty,
we'd wrongly unset LW_FLAG_HAS_WAITERS, leading to potentially hanging.
While the consequences in the backbranches are limited, the code as-is
confusing, and it is possible that there are workloads where the additional
wait list lock acquisitions hurt, therefore backpatch.
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Backpatch-through: 14
We've long had a practice of making views temporary by default if they
reference any temporary tables. However the implementation was pretty
incomplete, in that it only searched for RangeTblEntry references to
temp relations. Uses of temporary types, regclass constants, etc
were not detected even though the dependency mechanism considers them
grounds for dropping the view. Thus a view not believed to be temp
could silently go away at session exit anyhow.
To improve matters, replace the ad-hoc isQueryUsingTempRelation()
logic with use of the dependency-based infrastructure introduced by
commit 572c40ba9. This is complete by definition, and it's less code
overall.
While we're at it, we can also extend the warning NOTICE (or ERROR
in the case of a materialized view) to mention one of the temp
objects motivating the classification of the view as temp, as was
done for functions in 572c40ba9.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/19cf6ae1-04cd-422c-a760-d7e75fe6cba9@uni-muenster.de
Previously, we would only consider indexes marked indisvalid as usable
for INSERT ON CONFLICT. But that's problematic during CREATE INDEX
CONCURRENTLY and REINDEX CONCURRENTLY, because concurrent transactions
would end up with inconsistents lists of inferred indexes, leading to
deadlocks and spurious errors about unique key violations (because two
transactions are operating on different indexes for the speculative
insertion tokens). Change this function to return indexes even if
invalid. This fixes the spurious errors and deadlocks.
Because such indexes might not be complete, we still need uniqueness to
be verified in a different way. We do that by requiring that at least
one index marked valid is part of the set of indexes returned. It is
that index that is going to help ensure that the inserted tuple is
indeed unique.
This does not fix similar problems occurring with partitioned tables or
with named constraints. These problems will be fixed in follow-up
commits.
We have no user report of this problem, even though it exists in all
branches. Because of that and given that the fix is somewhat tricky, I
decided not to backpatch for now.
Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CANtu0ogv+6wqRzPK241jik4U95s1pW3MCZ3rX5ZqbFdUysz7Qw@mail.gmail.com
We don't have an official concept of temporary functions. (You can
make one explicitly in pg_temp, but then you have to explicitly
schema-qualify it on every call.) However, until now we were quite
laissez-faire about whether a non-temporary function could depend on
a temporary object, such as a temp table or view. If one does,
it will silently go away at end of session, due to the automatic
DROP ... CASCADE on the session's temporary objects. People have
complained that that's surprising; however, we can't really forbid
it because other people (including our own regression tests) rely
on being able to do it. Let's compromise by emitting a NOTICE
at CREATE FUNCTION time. This is somewhat comparable to our
ancient practice of emitting a NOTICE when forcing a view to
become temp because it depends on temp tables.
Along the way, refactor recordDependencyOnExpr() so that the
dependencies of an expression can be combined with other
dependencies, instead of being emitted separately and perhaps
duplicatively.
We should probably make the implementation of temp-by-default
views use the same infrastructure used here, but that's for
another patch. It's unclear whether there are any other object
classes that deserve similar treatment.
Author: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19cf6ae1-04cd-422c-a760-d7e75fe6cba9@uni-muenster.de
This request allows a support function to replace a function call
appearing in FROM (typically a set-returning function) with an
equivalent SELECT subquery. The subquery will then be subject
to the planner's usual optimizations, potentially allowing a much
better plan to be generated. While the planner has long done this
automatically for simple SQL-language functions, it's now possible
for extensions to do it for functions outside that group.
Notably, this could be useful for functions that are presently
implemented in PL/pgSQL and work by generating and then EXECUTE'ing
a SQL query.
Author: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/09de6afa-c33d-4d94-a5cb-afc6cea0d2bb@illuminatedcomputing.com
The existing range_minus function raises an exception when the range is
"split", because then the result can't be represented by a single range.
For example '[0,10)'::int4range - '[4,5)' would be '[0,4)' and '[5,10)'.
This commit adds new set-returning functions so that callers can get
results even in the case of splits. There is no risk of an exception for
multiranges, but a set-returning function lets us handle them the same
way we handle ranges.
Both functions return zero results if the subtraction would give an
empty range/multirange.
The main use-case for these functions is to implement UPDATE/DELETE FOR
PORTION OF, which must compute the application-time of "temporal
leftovers": the part of history in an updated/deleted row that was not
changed. To preserve the untouched history, we will implicitly insert
one record for each result returned by range/multirange_minus_multi.
Using a set-returning function will also let us support user-defined
types for application-time update/delete in the future.
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com
This changes a few union members that only existed to ensure
alignments and replaces them with the C11 alignas specifier.
This change only uses fundamental alignments (meaning approximately
alignments of basic types), which all C11 compilers must support.
There are opportunities for similar changes using extended alignments,
for example in PGIOAlignedBlock, but these are not necessarily
supported by all compilers, so they are kept as a separate change.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/46f05236-d4d4-4b4e-84d4-faa500f14691%40eisentraut.org
Commit 4bea91f added support for "COPY table TO" with partitioned
tables. This commit enhances initial table synchronization in logical
replication to use "COPY table TO" for partitioned tables if possible,
instead of "COPY (SELECT ...) TO" variant, improving performance.
Author: Ajin Cherian <itsajin@gmail.com>
Discussion: https://postgr.es/m/CAFPTHDY=w+xmEof=yyjhbDzaLxhBkoBzKcksEofXcT6EcjMbtQ@mail.gmail.com
This conforms more closely with the style of other struct initializers
in the code base. Initializing multiple fields on a single line is
unpopular in part because pgindent won't permit a space after the comma
before the next field's period.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/87see87fnq.fsf%40wibble.ilmari.org
During pruning and freezing in phase I of vacuum, we delay clearing
all_visible and all_frozen in the presence of dead items. This allows
opportunistic freezing if the page would otherwise be fully frozen,
since those dead items are later removed in vacuum phase III.
To move the VM update into the same WAL record that
prunes and freezes tuples, we must know whether the page will
be marked all-visible/all-frozen before emitting WAL. Previously we
waited until after emitting WAL to update all_visible/all_frozen to
their correct values.
The only barrier to updating these flags immediately after deciding
whether to opportunistically freeze was that while emitting WAL for a
record freezing tuples, we use the pre-corrected value of all_frozen to
compute the snapshot conflict horizon. By determining the conflict
horizon earlier, we can update the flags immediately after making the
opportunistic freeze decision.
This is required to set the VM in the XLOG_HEAP2_PRUNE_VACUUM_SCAN
record emitted by pruning and freezing.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
Previously, we relied on all_visible and all_frozen being used together
to ensure that all_frozen was correct, but it is better to keep both
fields updated.
Future changes will separate their usage, so we should not depend on
all_visible for the validity of all_frozen.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
heap_page_prune_and_freeze() had accumulated an unwieldy number of input
parameters and upcoming work to handle VM updates in this function will
add even more.
Introduce a new PruneFreezeParams struct to group the function’s input
parameters, improving readability and maintainability.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/yn4zp35kkdsjx6wf47zcfmxgexxt4h2og47pvnw2x5ifyrs3qc%407uw6jyyxuyf7
This commit updates encode() and decode() so that when an invalid encoding
is specified, their error message includes a HINT listing all valid encodings.
This helps users quickly see which encodings are supported without needing
to consult the documentation.
Author: Shinya Sugamoto <shinya34892@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAAe3y+99sfPv8UDF1VM-rC1i5HBdqxUh=2HrbJJFm2+i=1OwOw@mail.gmail.com
If both sides of the operator have most-common-value statistics,
eqjoinsel wants to check which MCVs have matches on the other side.
Formerly it did this with a dumb compare-all-the-entries loop,
which had O(N^2) behavior for long MCV lists. When that code was
written, twenty-plus years ago, that seemed tolerable; but nowadays
people frequently use much larger statistics targets, so that the
O(N^2) behavior can hurt quite a bit.
To add insult to injury, when asked for semijoin semantics, the
entire comparison loop was done over, even though we frequently
know that it will yield exactly the same results.
To improve matters, switch to using a hash table to perform the
matching. Testing suggests that depending on the data type, we may
need up to about 100 MCVs on each side to amortize the extra costs
of setting up the hash table and performing hash-value computations;
so continue to use the old looping method when there are fewer MCVs
than that.
Also, refactor so that we don't repeat the matching work unless
we really need to, which occurs only in the uncommon case where
eqjoinsel_semi decides to truncate the set of inner MCVs it
considers. The refactoring also got rid of the need to use the
presented operator's commutator. Real-world operators that are
using eqjoinsel should pretty much always have commutators, but
at the very least this saves a few syscache lookups.
Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Co-authored-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20ea8bf5-3569-4e46-92ef-ebb2666debf6@tantorlabs.com
The fix for bug #19055 (commit b0cc0a71e) allowed CTE references in
sub-selects within aggregate functions to affect the semantic levels
assigned to such aggregates. It turns out this broke some related
cases, leading to assertion failures or strange planner errors such
as "unexpected outer reference in CTE query". After experimenting
with some alternative rules for assigning the semantic level in
such cases, we've come to the conclusion that changing the level
is more likely to break things than be helpful.
Therefore, this patch undoes what b0cc0a71e changed, and instead
installs logic to throw an error if there is any reference to a
CTE that's below the semantic level that standard SQL rules would
assign to the aggregate based on its contained Var and Aggref nodes.
(The SQL standard disallows sub-selects within aggregate functions,
so it can't reach the troublesome case and hence has no rule for
what to do.)
Perhaps someone will come along with a legitimate query that this
logic rejects, and if so probably the example will help us craft
a level-adjustment rule that works better than what b0cc0a71e did.
I'm not holding my breath for that though, because the previous
logic had been there for a very long time before bug #19055 without
complaints, and that bug report sure looks to have originated from
fuzzing not from real usage.
Like b0cc0a71e, back-patch to all supported branches, though
sadly that no longer includes v13.
Bug: #19106
Reported-by: Kamil Monicz <kamil@monicz.dev>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19106-9dd3668a0734cd72@postgresql.org
Backpatch-through: 14
This file is written for 8-space tabs, since we expect that most
users who edit their configuration files use 8-space tabs.
However, most of PostgreSQL is written for 4-space tabs, and at
least one popular web interface defaults to 4-space tabs. Rather
than trying to standardize on a particular tab width for this file,
let's just switch to spaces.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aReNUKdMgKxLqmq7%40nathan
We need separate pairing heaps for different WaitLSNType's, because there
might be waiters for different LSN's at the same time. However, one process
can wait only for one type of LSN at a time. So, no need for inHeap
and heapNode fields to be arrays.
Discussion: https://postgr.es/m/CAPpHfdsBR-7sDtXFJ1qpJtKiohfGoj%3DvqzKVjWxtWsWidx7G_A%40mail.gmail.com
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
This patch renames the sync_error_count column to sync_table_error_count
in the pg_stat_subscription_stats view. The new name makes the purpose
explicit now that a separate column exists to track sequence
synchronization errors.
Additionally, the column seq_sync_error_count is renamed to
sync_seq_error_count to maintain a consistent naming pattern, making it
easier for users to group, and query synchronization related counters.
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3WwJmz=-4ybTkhniB-Nf3qmFG9Zx1uKjyLLoPF5NYYXA@mail.gmail.com
Remove bogus stripping of RelabelTypes: that can result in building
an output SAOP tree with incorrect exposed exprType for the operands,
which might confuse polymorphic operators. Moreover it demonstrably
prevents folding some OR-trees to SAOPs when the RHS expressions
have different base types that were coerced to the same type by
RelabelTypes.
Reduce prohibition on type_is_rowtype to just disallow type RECORD.
We need that because otherwise we would happily fold multiple RECORD
Consts into a RECORDARRAY Const even if they aren't the same record
type. (We could allow that perhaps, if we checked that they all have
the same typmod, but the case doesn't seem worth that much effort.)
However, there is no reason at all to disallow the transformation
for named composite types, nor domains over them: as long as we can
find a suitable array type we're good.
Remove some assertions that seem rather out of place (it's not
this code's duty to verify that the RestrictInfo structure is
sane). Rewrite some comments.
The issues with RelabelType stripping seem severe enough to
back-patch this into v18 where the code was introduced.
Author: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAHewXN=aH7GQBk4fXU-WaEeVmQWUmBAeNyBfJ3VKzPphyPKUkQ@mail.gmail.com
Backpatch-through: 18
PostgreSQL 18 deprecated password_encryption='md5', but the
comments for this GUC in the sample configuration file did
not mention the deprecation. Update comments with a notice
to make as many users as possible aware of it. Also add a
comment to the related md5_password_warnings GUC while there.
Author: Michael Banck <mbanck@gmx.net>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Backpatch-through: 18
The existing format of pg_dependencies uses a single-object JSON
structure, with each key value embedding all the knowledge about the
set attributes tracked, like:
{"1 => 5": 1.000000, "5 => 1": 0.423130}
While this is a very compact format, it is confusing to read and it is
difficult to manipulate the values within the object, particularly when
tracking multiple attributes.
The new output format introduced in this commit is a JSON array of
objects, with:
- A key named "degree", with a float value.
- A key named "attributes", with an array of attribute numbers.
- A key named "dependency", with an attribute number.
The values use the same underlying type as previously when printed, with
a new output format that shows now as follows:
[{"degree": 1.000000, "attributes": [1], "dependency": 5},
{"degree": 0.423130, "attributes": [5], "dependency": 1}]
This new format will become handy for a follow-up set of changes, so as
it becomes possible to inject extended statistics rather than require an
ANALYZE, like in a dump/restore sequence or after pg_upgrade on a new
cluster.
This format has been suggested by Tomas Vondra. The key names are
defined in the header introduced by 1f927cce44, to ease the
integration of frontend-specific changes that are still under
discussion. (Again a personal note: if anybody comes up with better
name for the keys, of course feel free.)
The bulk of the changes come from the regression tests, where
jsonb_pretty() is now used to make the outputs generated easier to
parse.
Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
The existing format of pg_ndistinct uses a single-object JSON structure
where each key is itself a comma-separated list of attnums, like:
{"3, 4": 11, "3, 6": 11, "4, 6": 11, "3, 4, 6": 11}
While this is a very compact format, it is confusing to read and it is
difficult to manipulate the values within the object.
The new output format introduced in this commit is an array of objects,
with:
- A key named "attributes", that contains an array of attribute numbers.
- A key named "ndistinct", represented as an integer.
The values use the same underlying type as previously when printed, with
a new output format that shows now as follows:
[{"ndistinct": 11, "attributes": [3,4]},
{"ndistinct": 11, "attributes": [3,6]},
{"ndistinct": 11, "attributes": [4,6]},
{"ndistinct": 11, "attributes": [3,4,6]}]
This new format will become handy for a follow-up set of changes, so as
it becomes possible to inject extended statistics rather than require an
ANALYZE, like in a dump/restore sequence or after pg_upgrade on a new
cluster.
This format has been suggested by Tomas Vondra. The key names are
defined in a new header, to ease with the integration of
frontend-specific changes that are still under discussion. (Personal
note: I am not specifically wedded to these key names, but if there are
better name suggestions for this release, feel free.)
The bulk of the changes come from the regression tests, where
jsonb_pretty() is now used to make the outputs generated easier to
parse.
Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
Until d2ea2d310d, the PS_USE_PS_STRINGS
option was used on the GNU/Hurd. As this option got removed and
PS_USE_CLOBBER_ARGV appears to work fine nowadays on the Hurd, define
this one to re-enable process title changes on this platform.
In the 14 and 15 branches, the existing test for __hurd__ (added 25
years ago by commit 209aa77d, removed in 16 by the above commit) is left
unchanged for now as it was activating slightly different code paths and
would need investigation by a Hurd user.
Author: Michael Banck <mbanck@debian.org>
Discussion: https://postgr.es/m/CA%2BhUKGJMNGUAqf27WbckYFrM-Mavy0RKJvocfJU%3DJ2XcAZyv%2Bw%40mail.gmail.com
Backpatch-through: 16
"long" is 32 bits on Windows 64-bit. Switch to a datatype that's 64-bit
on all platforms. While we're there, use an unsigned type as these
fields count things that have occurred, of which it's not possible to
have negative numbers of.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/CAApHDvoGFjSA3aNyVQ3ivbyc4ST=CC5L-_VjEUQ92HbE2Cxovg@mail.gmail.com
When instrumenting a MERGE command containing both WHEN NOT MATCHED BY
SOURCE and WHEN NOT MATCHED BY TARGET actions using EXPLAIN ANALYZE, a
concurrent update of the target relation could lead to an Assert
failure in show_modifytable_info(). In a non-assert build, this would
lead to an incorrect value for "skipped" tuples in the EXPLAIN output,
rather than a crash.
This could happen if the concurrent update caused a matched row to no
longer match, in which case ExecMerge() treats the single originally
matched row as a pair of not matched rows, and potentially executes 2
not-matched actions for the single source row. This could then lead to
a state where the number of rows processed by the ModifyTable node
exceeds the number of rows produced by its source node, causing
"skipped_path" in show_modifytable_info() to be negative, triggering
the Assert.
Fix this in ExecMergeMatched() by incrementing the instrumentation
tuple count on the source node whenever a concurrent update of this
kind is detected, if both kinds of merge actions exist, so that the
number of source rows matches the number of actions potentially
executed, and the "skipped" tuple count is correct.
Back-patch to v17, where support for WHEN NOT MATCHED BY SOURCE
actions was introduced.
Bug: #19111
Reported-by: Dilip Kumar <dilipbalaut@gmail.com>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/19111-5b06624513d301b3@postgresql.org
Backpatch-through: 17
WaitLSNWakeup() incorrectly returned early when called with
InvalidXLogRecPtr (meaning "wake all waiters"), because the fast-path
check compared minWaitedLSN > 0 without validating currentLSN first.
This caused WAIT FOR LSN commands to wait indefinitely during standby
promotion until random signals woke them.
Add an XLogRecPtrIsValid() check before the comparison so that
InvalidXLogRecPtr bypasses the fast-path and wakes all waiters immediately.
Discussion: https://postgr.es/m/CABPTF7UieOYbOgH3EnQCasaqcT1T4N6V2wammwrWCohQTnD_Lw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
All settings in this file should be commented out. In addition to
fixing that, also fix the indentation for this line.
Oversight in commit c758119e5b.
Reported-by: Daniel Gustafsson <daniel@yesql.se>
Author: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/19727040-3EE4-4719-AF4F-2548544113D7%40yesql.se
Backpatch-through: 18
Commit 5e4fcbe531 added a check_rights parameter to this function
for use by ALTER TABLE commands that re-create statistics objects.
However, we intentionally ignore check_rights when verifying
relation ownership because this function's lookup could return a
different answer than the caller's. This commit adds a note to
this effect so that we remember it down the road.
Reviewed-by: Noah Misch <noah@leadboat.com>
Backpatch-through: 14
Path expansion might expose characters like spaces which would cause
command failure, so double-quote the examples. While %f doesn't need
quoting since it uses a fixed character set, it is best to be
consistent.
Discussion: https://postgr.es/m/aROPCQCfvKp9Htk4@momjian.us
Backpatch-through: master
This reverts commit 1fd981f053, based on concerns that the logging
improvements do not justify the protocol breakage of dropping an unnamed
portal once its execution has completed.
It seems unlikely that one would try to send an execute or describe
message after the portal has been used, but if they do such
post-completion messages would not be able to process as the previous
versions. Let's revert this change for now so as we keep compatibility
and consider a different solution.
The tests added by 76bba03312 track the pre-1fd981f05369 behavior, and
are still valid.
Discussion: https://postgr.es/m/CA+TgmoYFJyJNQw3RT7veO3M2BWRE9Aw4hprC5rOcawHZti-f8g@mail.gmail.com
PostgreSQL's Windows port has never been able to handle files larger
than 2GB due to the use of off_t for file offsets, only 32-bit on
Windows. This causes signed integer overflow at exactly 2^31 bytes when
trying to handle files larger than 2GB, for the routines touched by this
commit.
Note that large files are forbidden by ./configure (3c6248a828) and
meson (recent change, see 79cd66f28c). This restriction also exists
in v16 and older versions for the now-dead MSVC scripts.
The code base already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
internals still rely on off_t. This commit switches more routines to
use pgoff_t, offering more portability, for areas mainly related to file
extensions and storage.
These are not critical for WAL segments yet, which have currently a
maximum size allowed of 1GB (well, this opens the door at allowing a
larger size for them). This matters more for segment files if we want
to lift the large file restriction in ./configure and meson in the
future, which would make sense to remove once/if all traces of off_t are
gone from the tree. This can additionally matter for out-of-core code
that may want files larger than 2GB in places where off_t is four bytes
in size.
Note that off_t is still used in other parts of the tree like
buffile.c, WAL sender/receiver, base backup, pg_combinebackup, etc.
These other code paths can be addressed separately, and their update
will be required if we want to remove the large file restriction in the
future. This commit is a good first cut in itself towards more
portability, hopefully.
On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.
Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/0f238ff4-c442-42f5-adb8-01b762c94ca1@gmail.com
pg_logical_slot_get_changes_guts() previously assigned InvalidXLogRecPtr to
the local variable upto_nchanges, which is of type int32, not XLogRecPtr.
While this caused no functional issue since InvalidXLogRecPtr is defined as 0,
it was semantically incorrect.
This commit fixes the issue by updating pg_logical_slot_get_changes_guts()
to set upto_nchanges to 0 instead of InvalidXLogRecPtr.
No backpatch is needed, as the previous behavior was harmless.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Steven Niu <niushiji@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHKHuR5NGnGxU3+ebz7cbC1ZAR=AgG4Bueq==Lj6iX8Sw@mail.gmail.com
This comment seems to refer to some stuff that was removed during
development in 2005.
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/aRJFDxKJLFE_1Iai%40nathan
If DSM entry initialization fails, backends could try to use an
uninitialized DSM segment, DSA, or dshash table (since the entry is
still added to the registry). To fix, keep track of whether
initialization completed, and ERROR if a backend tries to attach to
an uninitialized entry. We could instead retry initialization as
needed, but that seemed complicated, error prone, and unlikely to
help most cases. Furthermore, such problems probably indicate a
coding error.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/dd36d384-55df-4fc2-825c-5bc56c950fa9%40gmail.com
Backpatch-through: 17
Before we started to freeze async notify entries (commit 8eeb4a0f7c),
no one looked at the 'xid' on an entry with invalid 'dboid'. But now
we might actually need to freeze it later. Initialize them with
InvalidTransactionId to begin with, to avoid that work later.
Álvaro pointed this out in review of commit 8eeb4a0f7c, but I forgot
to include this change there.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/202511071410.52ll56eyixx7@alvherre.pgsql
Backpatch-through: 14
Previous commit fixed a bug where VACUUM would truncate the CLOG
that's still needed to check the commit status of XIDs in the async
notify queue, but as mentioned in the commit message, it wasn't a full
fix. If a backend is executing asyncQueueReadAllNotifications() and
has just made a local copy of an async SLRU page which contains old
XIDs, vacuum can concurrently truncate the CLOG covering those XIDs,
and the backend still gets an error when it calls
TransactionIdDidCommit() on those XIDs in the local copy. This commit
fixes that race condition.
To fix, hold the SLRU bank lock across the TransactionIdDidCommit()
calls in NOTIFY processing.
Per Tom Lane's idea. Backpatch to all supported versions.
Reviewed-by: Joel Jacobson <joel@compiler.org>
Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Discussion: https://www.postgresql.org/message-id/2759499.1761756503@sss.pgh.pa.us
Backpatch-through: 14
The async notification queue contains the XID of the sender, and when
processing notifications we call TransactionIdDidCommit() on the
XID. But we had no safeguards to prevent the CLOG segments containing
those XIDs from being truncated away. As a result, if a backend didn't
for some reason process its notifications for a long time, or when a
new backend issued LISTEN, you could get an error like:
test=# listen c21;
ERROR: 58P01: could not access status of transaction 14279685
DETAIL: Could not open file "pg_xact/000D": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087
To fix, make VACUUM "freeze" the XIDs in the async notification queue
before truncating the CLOG. Old XIDs are replaced with
FrozenTransactionId or InvalidTransactionId.
Note: This commit is not a full fix. A race condition remains, where a
backend is executing asyncQueueReadAllNotifications() and has just
made a local copy of an async SLRU page which contains old XIDs, while
vacuum concurrently truncates the CLOG covering those XIDs. When the
backend then calls TransactionIdDidCommit() on those XIDs from the
local copy, you still get the error. The next commit will fix that
remaining race condition.
This was first reported by Sergey Zhuravlev in 2021, with many other
people hitting the same issue later. Thanks to:
- Alexandra Wang, Daniil Davydov, Andrei Varashen and Jacques Combrink
for investigating and providing reproducable test cases,
- Matheus Alcantara and Arseniy Mukhin for review and earlier proposed
patches to fix this,
- Álvaro Herrera and Masahiko Sawada for reviews,
- Yura Sokolov aka funny-falcon for the idea of marking transactions
as committed in the notification queue, and
- Joel Jacobson for the final patch version. I hope I didn't forget
anyone.
Backpatch to all supported versions. I believe the bug goes back all
the way to commit d1e027221d, which introduced the SLRU-based async
notification queue.
Discussion: https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org
Discussion: https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2@postgresql.org
Discussion: https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
Backpatch-through: 14
Previously, if async notify processing encountered an error, we would
report the error to the client and advance our read position past the
offending entry to prevent trying to process it over and over
again. Trying to continue after an error has a few problems however:
- We have no way of telling the client that a notification was
lost. They get an ERROR, but that doesn't tell you much. As such,
it's not clear if keeping the connection alive after losing a
notification is a good thing. Depending on the application logic,
missing a notification could cause the application to get stuck
waiting, for example.
- If the connection is idle, PqCommReadingMsg is set and any ERROR is
turned into FATAL anyway.
- We bailed out of the notification processing loop on first error
without processing any subsequent notifications. The subsequent
notifications would not be processed until another notify interrupt
arrives. For example, if there were two notifications pending, and
processing the first one caused an ERROR, the second notification
would not be processed until someone sent a new NOTIFY.
This commit changes the behavior so that any ERROR while processing
async notifications is turned into FATAL, causing the client
connection to be terminated. That makes the behavior more consistent
as that's what happened in idle state already, and terminating the
connection is a clear signal to the application that it might've
missed some notifications.
The reason to do this now is that the next commits will change the
notification processing code in a way that would make it harder to
skip over just the offending notification entry on error.
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Discussion: https://www.postgresql.org/message-id/fedbd908-4571-4bbe-b48e-63bfdcc38f64@iki.fi
Backpatch-through: 14
This creates a src/backend/catalog/pg_tablespace.c supporting file
containing a new function get_tablespace_location(), which lets the code
underlying pg_tablespace_location() be reused for other purposes.
Author: Manni Wood <manni.wood@enterprisedb.com>
Author: Nishant Sharma <nishant.sharma@enterprisedb.com>
Reviewed-by: Vaibhav Dalvi <vaibhav.dalvi@enterprisedb.com>
Reviewed-by: Ian Lawrence Barwick <barwick@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com
The range for commit_siblings was incorrectly listed as starting on 1
instead of 0 in the sample configuration file. Backpatch down to all
supported branches.
Author: Man Zeng <zengman@halodbtech.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/tencent_53B70BA72303AE9C6889E78E@qq.com
Backpatch-through: 14
In order to make the errorhandling code in backend libpq be thread-
safe the global variable used by the certificate verification call-
back need to be replaced with passing private data.
This moves the threadsafety needle a little but forwards, the call
to strerror_r also needs to be replaced with the error buffer made
thread local. This is left as future work for when add the thread
primitives required for this to the tree.
Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/353226C7-97A1-4507-A380-36AA92983AE6@yesql.se
This new file is named pg_dependencies.c and includes all the code
directly related to the data type pg_dependencies, extracted from the
extended statistics code.
Some patches are under discussion to change its input and output
functions, and this separation makes the follow-up changes cleaner by
separating the logic related to the data type and the functional
dependencies statistics core logic in dependencies.c.
Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aQ2k8--a0FfwSwX9@paquier.xyz
This new file is named pg_ndistinct.c and includes all the code directly
related to the data type pg_ndistinct, extracted from the extended
statistics code.
Some patches are under discussion to change its input and output
functions, and this separation makes the follow-up changes cleaner by
separating the logic related to the data type and the multivariate
ndistinct coefficient core logic in mvdistinct.c.
Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aQ2k8--a0FfwSwX9@paquier.xyz
Previously we had both in code and comments. Keep the more common and
accepted variant.
Author: Chao Li <lic@highgo.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/5EBF1771-0566-4D08-9F9B-CDCDEF4BDC98@gmail.com
Previously, error messages for oversized injection point names, libraries,
and functions showed buffer sizes (64, 128, 128) instead of the usable
character limits (63, 127, 127) as it did not count for the
zero-terminated byte, which was confusing. These messages are adjusted
to show better the reality.
The limit enforced for the private area was also too strict by one byte,
as specifying a zone worth exactly INJ_PRIVATE_MAXLEN should be able to
work because three is no zero-terminated byte in this case.
This is a stylistic change (well, mostly, a private_area size of exactly
1024 bytes can be defined with this change, something that nobody seem
to care about based on the lack of complaints). However, this is a
testing facility let's keep the logic consistent across all the branches
where this code exists, as there is an argument in favor of out-of-core
extensions that use injection points.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CABPTF7VxYp4Hny1h+7ejURY-P4O5-K8WZg79Q3GUx13cQ6B2kg@mail.gmail.com
Backpatch-through: 17
guc_var_compare() is invoked from qsort() on an array of struct
config_generic, but the function accesses these directly as
strings (char *). This relies on the name being the first field, so
this works. But we can write this more clearly by using the struct
and then accessing the field through the struct. Before the
reorganization of the GUC structs (commit a13833c35f), the old code
was probably more convenient, but now we can write this more clearly
and correctly.
After this change, it is no longer required that the name is the first
field in struct config_generic, so remove that comment.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/2c961fa1-14f6-44a2-985c-e30b95654e8d%40eisentraut.org
This omission allowed table owners to create statistics in any
schema, potentially leading to unexpected naming conflicts. For
ALTER TABLE commands that require re-creating statistics objects,
skip this check in case the user has since lost CREATE on the
schema. The addition of a second parameter to CreateStatistics()
breaks ABI compatibility, but we are unaware of any impacted
third-party code.
Reported-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Security: CVE-2025-12817
Backpatch-through: 13
It seems plausible that someone might want to experiment with
different values. The pressing reason though is that I'm reviewing a
patch that requires pg_upgrade to manipulate SLRU files. That patch
needs to access SLRU_PAGES_PER_SEGMENT from pg_upgrade code, and
slru.h, where SLRU_PAGES_PER_SEGMENT is currently defined, cannot be
included from frontend code. Moving it to pg_config_manual.h makes it
accessible.
Now that it's a little more likely that someone might change
SLRU_PAGES_PER_SEGMENT, add a cluster compatibility check for it.
Bump catalog version because of the new field in the control file.
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/c7a4ea90-9f7b-4953-81be-b3fcb47db057@iki.fi
This warning was disabled in meson.build (warning 4273). If you
enable it, it looks like this:
../src/backend/utils/misc/ps_status.c(27): warning C4273: '__p__environ': inconsistent dll linkage
C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h(1158): note: see previous definition of '__p__environ'
The declaration in ps_status.c was:
#if !defined(WIN32) || defined(_MSC_VER)
extern char **environ;
#endif
The declaration in the OS header file is:
_DCRTIMP char*** __cdecl __p__environ (void);
#define _environ (*__p__environ())
So it is evident that this could be problematic.
The old declaration was required by the old MSVCRT library, but we
don't support that anymore with MSVC.
To fix, disable the re-declaration in ps_status.c, and also in some
other places that use the same code pattern but didn't trigger the
warning.
Then we can also re-enable the warning (delete the disablement in
meson.build).
Reviewed-by: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/bf060644-47ff-441b-97cf-c685d0827757@eisentraut.org
This commit adds a new column, seq_sync_error_count, to the
pg_stat_subscription_stats view. This counter tracks the number of errors
encountered by the sequence synchronization worker during operation.
Since a single worker handles the synchronization of all sequences, this
value may reflect errors from multiple sequences. This addition improves
observability of sequence synchronization behavior and helps monitor
potential issues during replication.
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
The prior commit made it legal to modify BufferDesc.state while the buffer
header spinlock is held. This allows us to replace the CAS loop
inUnpinBufferNoOwner() with an atomic sub. This improves scalability
significantly. See the prior commits for more background.
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Until now BufferDesc.state was not allowed to be modified while the buffer
header spinlock was held. This meant that operations like unpinning buffers
needed to use a CAS loop, waiting for the buffer header spinlock to be
released before updating.
The benefit of that restriction is that it allowed us to unlock the buffer
header spinlock with just a write barrier and an unlocked write (instead of a
full atomic operation). That was important to avoid regressions in
48354581a4. However, since then the hottest buffer header spinlock uses have
been replaced with atomic operations (in particular, the most common use of
PinBuffer_Locked(), in GetVictimBuffer() (formerly in BufferAlloc()), has been
removed in 5e89985928).
This change will allow, in a subsequent commit, to release buffer pins with a
single atomic-sub operation. This previously was not possible while such
operations were not allowed while the buffer header spinlock was held, as an
atomic-sub would not have allowed a race-free check for the buffer header lock
being held.
Using atomic-sub to unpin buffers is a nice scalability win, however it is not
the primary motivation for this change (although it would be sufficient). The
primary motivation is that we would like to merge the buffer content lock into
BufferDesc.state, which will result in more frequent changes of the state
variable, which in some situations can cause a performance regression, due to
an increased CAS failure rate when unpinning buffers. The regression entirely
vanishes when using atomic-sub.
Naively implementing this would require putting CAS loops in every place
modifying the buffer state while holding the buffer header lock. To avoid
that, introduce UnlockBufHdrExt(), which can set/add flags as well as the
refcount, together with releasing the lock.
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
A couple of ereports were making use of StringInfos as temporary storage
for the portions of the WARNING message. One was doing this to avoid
having 2 separate ereports. This was all fairly unnecessary and
resulted in more code rather than less code.
Refactor out the additional StringInfos and make
check_publications_origin_tables() use 2 ereports.
In passing, adjust pubnames to become a stack-allocated StringInfoData to
avoid having to palloc the temporary StringInfoData. This follows on
from the efforts made in 6d0eba662.
Author: Mats Kindahl <mats.kindahl@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/0b381b02-cab9-41f9-a900-ad6c8d26c1fc%40gmail.com
Now that commit 06edbed478 has introduced XLogRecPtrIsValid(), we can
use that instead of:
- XLogRecPtrIsInvalid()
- direct comparisons with InvalidXLogRecPtr
- direct comparisons with literal 0
This makes the code more consistent.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aQB7EvGqrbZXrMlg@ip-10-97-1-34.eu-west-3.compute.internal
Stored generated columns are not yet computed when the filtering
happens, so we need to prohibit them to avoid incorrect behavior.
Virtual generated columns currently error out ("unexpected virtual
generated column reference"). They could probably work if we expand
them in the right place, but for now let's keep them consistent with
the stored variant. This doesn't change the behavior, it only gives a
nicer error message.
Co-authored-by: jian he <jian.universality@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxHb8YPQ095R_pYDr77W9XKNaXg5Rzy-WP525mkq+hRM3g@mail.gmail.com
Before commit e25626677f, spinlocks were implemented using semaphores
on some platforms (--disable-spinlocks). That made it necessary to
initialize semaphores early, before any spinlocks could be used. Now
that we don't support --disable-spinlocks anymore, we can allocate the
shared memory needed for semaphores the same way as other shared
memory structures. Since the semaphores are used only in the PGPROC
array, move the semaphore shmem size estimation and initialization
calls to ProcGlobalShmemSize() and InitProcGlobal().
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com
Test failure on buildfarm member prion:
The test failed due to an unexpected LOCATION: line appearing between the
WARNING and ERROR messages. This occurred because the prion machine uses
log_error_verbosity = verbose, which includes additional context in error
messages. The test was originally checking for both WARNING and ERROR
messages in sequence sync, but the extra LOCATION: line disrupted this
pattern. To make the test robust across different verbosity settings, it
now only checks for the presence of the WARNING message after the test,
which is sufficient to validate the intended behavior.
Failure to sync sequences with quoted names:
The previous implementation did not correctly quote sequence names when
querying remote information, leading to failures when quoted sequence
names were used. This fix ensures that sequence names are properly quoted
during remote queries, allowing sequences with quoted identifiers to be
synced correctly.
Author: Vignesh C <vignesh21@gmail.com>
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CALDaNm0WcdSCoNPiE-5ek4J2dMJ5o111GPTzKCYj9G5i=ONYtQ@mail.gmail.com
Discussion: https://postgr.es/m/CAOzEurQOSN=Zcp9uVnatNbAy=2WgMTJn_DYszYjv0KUeQX_e_A@mail.gmail.com
Like relation_stats.c, these structures are used to track the argument
number, names and types of pg_restore_attribute_stats() and
pg_clear_attribute_stats().
Extracted from a larger patch by the same author, reworded by me for
consistency with relation_stats.c.
Author: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
03d40e4b5 added code to provide better row estimates for when a UNION
query ended up only with a single child due to other children being
found to be dummy rels. In that case, ordinarily it would be ok to call
estimate_num_groups() on the targetlist of the only child path, however
that's not safe to do if the UNION child is the result of some other set
operation as we generate targetlists containing Vars with varno==0 for
those, which estimate_num_groups() can't handle. This could lead to:
ERROR: XX000: no relation entry for relid 0
Fix this by avoiding doing this when the only child is the result of
another set operation. In that case we'll fall back on the
assume-all-rows-are-unique method.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/cfbc99e5-9d44-4806-ba3c-f36b57a85e21@gmail.com
Various places that were using StringInfo but didn't need that
StringInfo to exist beyond the scope of the function were using
makeStringInfo(), which allocates both a StringInfoData and the buffer it
uses as two separate allocations. It's more efficient for these cases to
use a StringInfoData on the stack and initialize it with initStringInfo(),
which only allocates the string buffer. This also simplifies the cleanup,
in a few cases.
Author: Mats Kindahl <mats.kindahl@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4379aac8-26f1-42f2-a356-ff0e886228d3@gmail.com
We've successfully used libsanitizer for awhile with the undefined
and alignment sanitizers, but with some other sanitizers (at least
thread and hwaddress) it crashes due to internal recursion before
it's fully initialized itself. It turns out that that's due to the
"__ubsan_default_options" hack installed by commit f686ae82f, and we
can fix it by ensuring that __ubsan_default_options is built without
any sanitizer instrumentation hooks.
Reported-by: Emmanuel Sibi <emmanuelsibi.mec@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Diagnosed-by: Emmanuel Sibi <emmanuelsibi.mec@gmail.com>
Fix-suggested-by: Jacob Champion <jacob.champion@enterprisedb.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/F7543B04-E56C-4D68-A040-B14CCBAD38F1@gmail.com
Discussion: https://postgr.es/m/dbf77bf7-6e54-ed8a-c4ae-d196eeb664ce@gmail.com
Backpatch-through: 16
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
In generate_orderedappend_paths(), the function does not handle the
case where the paths in total_subpaths and fractional_subpaths are
identical. This situation is not uncommon, and as a result, it may
generate two exactly identical ordered append paths.
Fix by checking whether total_subpaths and fractional_subpaths contain
the same paths, and skipping creation of the ordered append path for
the fractional case when they are identical.
Given the lack of field complaints about this, I'm a bit hesitant to
back-patch, but let's clean it up in HEAD.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-OYsgA75tGGiBARt87G0y_z_GBTSLrzudcJxAzndYkYw@mail.gmail.com
In generate_orderedappend_paths(), there is an assumption that a child
relation's row estimate is always greater than zero. There is an
Assert verifying this assumption, and the estimate is also used to
convert an absolute tuple count into a fraction.
However, this assumption is not always valid -- for example, upper
relations can have their row estimates unset, resulting in a value of
zero. This can cause an assertion failure in debug builds or lead to
the tuple fraction being computed as infinity in production builds.
To fix, use the row estimate from the cheapest_total path to compute
the tuple fraction. The row estimate in this path should already have
been forced to a valid value.
In passing, update the comment for generate_orderedappend_paths() to
note that the function also considers the cheapest-fractional case
when not all tuples need to be retrieved. That is, it collects all
the cheapest fractional paths and builds an ordered append path for
each interesting ordering.
Backpatch to v18, where this issue was introduced.
Bug: #19102
Reported-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/19102-93480667e1200169@postgresql.org
Backpatch-through: 18
This patch introduces sequence synchronization. Sequences that are synced
will have 2 states:
- INIT (needs [re]synchronizing)
- READY (is already synchronized)
A new sequencesync worker is launched as needed to synchronize sequences.
A single sequencesync worker is responsible for synchronizing all
sequences. It begins by retrieving the list of sequences that are flagged
for synchronization, i.e., those in the INIT state. These sequences are
then processed in batches, allowing multiple entries to be synchronized
within a single transaction. The worker fetches the current sequence
values and page LSNs from the remote publisher, updates the corresponding
sequences on the local subscriber, and finally marks each sequence as
READY upon successful synchronization.
Sequence synchronization occurs in 3 places:
1) CREATE SUBSCRIPTION
- The command syntax remains unchanged.
- The subscriber retrieves sequences associated with publications.
- Published sequences are added to pg_subscription_rel with INIT
state.
- Initiate the sequencesync worker to synchronize all sequences.
2) ALTER SUBSCRIPTION ... REFRESH PUBLICATION
- The command syntax remains unchanged.
- Dropped published sequences are removed from pg_subscription_rel.
- Newly published sequences are added to pg_subscription_rel with INIT
state.
- Initiate the sequencesync worker to synchronize only newly added
sequences.
3) ALTER SUBSCRIPTION ... REFRESH SEQUENCES
- A new command introduced for PG19 by f0b3573c3a.
- All sequences in pg_subscription_rel are reset to INIT state.
- Initiate the sequencesync worker to synchronize all sequences.
- Unlike "ALTER SUBSCRIPTION ... REFRESH PUBLICATION" command,
addition and removal of missing sequences will not be done in this
case.
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
Previously, unnamed portals were kept until the next Bind message or the
end of the transaction. This could cause temporary files to persist
longer than expected and make logging not reflect the actual SQL
responsible for the temporary file.
This patch changes exec_execute_message() to drop unnamed portals
immediately after execution to completion at the end of an Execute
message, making their removal more aggressive. This forces temporary
file cleanups to happen at the same time as the completion of the portal
execution, with statement logging correctly reflecting to which
statements these temporary files were attached to (see the diffs in the
TAP test updated by this commit for an idea).
The documentation is updated to describe the lifetime of unnamed
portals, and test cases are updated to verify temporary file removal and
proper statement logging after unnamed portal execution. This changes
how unnamed portals are handled in the protocol, hence no backpatch is
done.
Author: Frédéric Yhuel <frederic.yhuel@dalibo.com>
Co-Authored-by: Sami Imseih <samimseih@gmail.com>
Co-Authored-by: Mircea Cadariu <cadariu.mircea@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0tTrTUoEr3kDXCuKsvqYGq8OOHiBwoD-dyJocq95uEOTQ%40mail.gmail.com
The comment for ChangeVarNodes() refers to a parameter named
change_RangeTblRef, which does not exist in the code.
The comment for ChangeVarNodesExtended() contains an extra space,
while the comment for replace_relid_callback() has an awkward line
break and a typo.
This patch fixes these issues and revises some sentences for smoother
wording.
Oversights in commits ab42d643c and fc069a3a6.
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs480j16HC1JtjKCgj5WshivT8ZJYkOfTyZAM0POjFomJkg@mail.gmail.com
Backpatch-through: 18
These assertions may prove to become useful to make sure that no process
other than the startup process calls the routines where these checks are
added, as we expect that these do not interfere with a WAL receiver
switched to a "stopping" state by a startup process.
The assumption that only the startup process can use this code has
existed for many years, without a check enforcing it.
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/aQmGeVLYl51y1m_0@paquier.xyz
First, the assertions in assign_io_method() were the wrong way round. Second,
the lengthof() assertion checked the length of io_method_options, which is the
wrong array to check and is always longer than pgaio_method_ops_table.
While add it, add a static assert to ensure pgaio_method_ops_table and
io_method_options stay in sync.
Per coverity and Tom Lane.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 18
In 2a0faed9d7, which added JIT compilation support for expressions, I
accidentally used sizeof(LLVMBasicBlockRef *) instead of
sizeof(LLVMBasicBlockRef) as part of computing the size of an allocation. That
turns out to have no real negative consequences due to LLVMBasicBlockRef being
a pointer itself (and thus having the same size). It still is wrong and
confusing, so fix it.
Reported by coverity.
Backpatch-through: 13
Allow pg_newlocale_from_collation(C_COLLATION_OID) to work even if
there's no catalog access, which some extensions expect.
Not known to be a bug without extensions involved, but backport to 18.
Also corrects an issue in master with dummy_c_locale (introduced in
commit 5a38104b36) where deterministic was not set. That wasn't a bug,
but could have been if that structure was used more widely.
Reported-by: Alexander Kukushkin <cyberdemn@gmail.com>
Reviewed-by: Alexander Kukushkin <cyberdemn@gmail.com>
Discussion: https://postgr.es/m/CAFh8B=nj966ECv5vi_u3RYij12v0j-7NPZCXLYzNwOQp9AcPWQ@mail.gmail.com
Backpatch-through: 18
This commit adds CHECK_FOR_INTERRUPTS to the shared buffer iteration
loops in EvictRelUnpinnedBuffers and EvictAllUnpinnedBuffers. These
functions, used by pg_buffercache's pg_buffercache_evict_relation and
pg_buffercache_evict_all, can now be interrupted during long-running
operations.
Backpatch to version 18, where these functions and their corresponding
pg_buffercache functions were introduced.
Author: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/8DC280D4-94A2-4E7B-BAB9-C345891D0B78%40gmail.com
Backpatch-through: 18
03d40e4b5 allowed dummy UNION [ALL] children to be removed from the plan
by checking for is_dummy_rel(). That commit neglected to still account
for the relids from the dummy rel so that the correct UPPERREL_SETOP
RelOptInfo could be found and used for adding the Paths to.
Not doing this could result in processing of subsequent UNIONs using the
same RelOptInfo as a previously processed UNION, which could result in
add_path() freeing old Paths that are needed by the previous UNION.
The same fix was independently submitted (2 mins later) by Richard Guo.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/bee34aec-659c-46f1-9ab7-7bbae0b7616c@gmail.com
Commit a95e3d84c0 added ActiveSnapshot push+pop when processing
work-items (BRIN autosummarization), but forgot to handle the case of
a transaction failing during the run, which drops the snapshot untimely.
Fix by making the pop conditional on an element being actually there.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Backpatch-through: 13
Discussion: https://postgr.es/m/202511041648.nofajnuddmwk@alvherre.pgsql
The parallel GIN builds perform "freezing" of TID lists when merging
chunks built earlier. This means determining what part of the list can
no longer change, depending on the last received chunk. The frozen part
can be evicted from memory and written out.
The code attempted to freeze items right before merging the old and new
TID list, after already attempting to trim the current buffer. That
means part of the data may get frozen based on the new TID list, but
will be trimmed later (on next loop). This increases memory usage.
This inverts the order, so that we freeze data first (before trimming).
The benefits are likely relatively small, but it's also virtually free
with no other downsides.
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
When building intermediate TID lists during parallel GIN builds, split
the sorted lists into smaller chunks, to limit the amount of memory
needed when merging the chunks later.
The leader may need to keep in memory up to one chunk per worker, and
possibly one extra chunk (before evicting some of the data). The code
processing item pointers uses regular palloc/repalloc calls, which means
it's subject to the MaxAllocSize (1GB) limit.
We could fix this by allowing huge allocations, but that'd require
changes in many places without much benefit. Larger chunks do not
actually improve performance, so the memory usage would be wasted.
Fixed by limiting the chunk size to not hit MaxAllocSize. Each worker
gets a fair share.
This requires remembering the number of participating workers, in a
place that can be accessed from the callback. Luckily, the bs_worker_id
field in GinBuildState was unused, so repurpose that.
Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.
Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
Backpatch-through: 18
We have never had a SET syntax that allows setting a GUC_LIST_INPUT
parameter to be an empty list. A locution such as
SET search_path = '';
doesn't mean that; it means setting the GUC to contain a single item
that is an empty string. (For search_path the net effect is much the
same, because search_path ignores invalid schema names and '' must be
invalid.) This is confusing, not least because configuration-file
entries and the set_config() function can easily produce empty-list
values.
We considered making the empty-string syntax do this, but that would
foreclose ever allowing empty-string items to be valid in list GUCs.
While there isn't any obvious use-case for that today, it feels like
the kind of restriction that might hurt someday. Instead, let's
accept the forbidden-up-to-now value NULL and treat that as meaning an
empty list. (An objection to this could be "what if we someday want
to allow NULL as a GUC value?". That seems unlikely though, and even
if we did allow it for scalar GUCs, we could continue to treat it as
meaning an empty list for list GUCs.)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrei Klychkov <andrew.a.klychkov@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/CA+mfrmwsBmYsJayWjc8bJmicxc3phZcHHY=yW5aYe=P-1d_4bg@mail.gmail.com
A generated column may end up being part of the partition key
expression, if it's specified as an expression e.g. "(<generated
column name>)" or if the partition key expression contains a whole-row
reference, even though we do not allow a generated column to be part
of partition key expression. Fix this hole.
Co-authored-by: jian he <jian.universality@gmail.com>
Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxF%3DWDGthXSAQr9thYUsfx_1_t9E6N8tE3B8EqXcVoVfQw%40mail.gmail.com
It's possible to define BRIN indexes on functions that require a
snapshot to run, but the autosummarization feature introduced by commit
7526e10224 fails to provide one. This causes autovacuum to leave a
BRIN placeholder tuple behind after a failed work-item execution, making
such indexes less efficient. Repair by obtaining a snapshot prior to
running the task, and add a test to verify this behavior.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Giovanni Fabris <giovanni.fabris@icon.it>
Reported-by: Arthur Nascimento <tureba@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/202511031106.h4fwyuyui6fz@alvherre.pgsql
When the startup process switches from streaming to archive as WAL
source, we avoid calling ShutdownWalRcv() if the WAL receiver is not
streaming, based on WalRcvStreaming(). WALRCV_STOPPING is a state set
by ShutdownWalRcv(), called only by the startup process, meaning that it
should not be possible to reach this state while in
WaitForWALToBecomeAvailable().
This commit adds an assertion to make sure that a WAL receiver is never
in a WALRCV_STOPPING state should the startup process attempt to reset
InstallXLogFileSegmentActive.
Idea suggested by Noah Misch.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
This has come up as useful as an alternative of WalRcvStreaming(), to be
able to do sanity checks based on the state of a WAL receiver. This
will be used in a follow-up commit.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
Commit b4f584f9d2 (affecting v15~, later backpatched down to 13 as of
3635a0a35a) introduced an unconditional WAL receiver shutdown when
switching from streaming to archive WAL sources. This causes problems
during a timeline switch, when a WAL receiver enters WALRCV_WAITING
state but remains alive, waiting for instructions.
The unconditional shutdown can break some monitoring scenarios as the
WAL receiver gets repeatedly terminated and re-spawned, causing
pg_stat_wal_receiver.status to show a "streaming" instead of "waiting"
status, masking the fact that the WAL receiver is waiting for a new TLI
and a new LSN to be able to continue streaming.
This commit changes the WAL receiver behavior so as the shutdown becomes
conditional, with InstallXLogFileSegmentActive being always reset to
prevent the regression fixed by b4f584f9d2: only terminate the WAL
receiver when it is actively streaming (WALRCV_STREAMING,
WALRCV_STARTING, or WALRCV_RESTARTING). When in WALRCV_WAITING state,
just reset InstallXLogFileSegmentActive flag to allow archive
restoration without killing the process. WALRCV_STOPPED and
WALRCV_STOPPING are not reachable states in this code path. For the
latter, the startup process is the one in charge of setting
WALRCV_STOPPING via ShutdownWalRcv(), waiting for the WAL receiver to
reach a WALRCV_STOPPED state after switching walRcvState, so
WaitForWALToBecomeAvailable() cannot be reached while a WAL receiver is
in a WALRCV_STOPPING state.
A regression test is added to check that a WAL receiver is not stopped
on timeline jump, that fails when the fix of this commit is reverted.
Reported-by: Ryan Bird <ryanzxg@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
Backpatch-through: 13
The new wal_fpi_bytes counter calculates the total amount of full page
images inserted in WAL records, in bytes. This commit adds this
information to VACUUM and ANALYZE logs alongside the existing counters,
building upon f9a09aa295.
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aQMMSSlFXy4Evxn3@paquier.xyz
The order in this list was previously pretty random and had grown
organically over time. This made it unnecessarily cumbersome to
maintain these lists, as there was no clear guidelines about where to
put new entries. Also, after the merger of the type-specific GUC
structs, the list still reflected the previous type-specific
super-order.
By using alphabetical order, the place for new entries becomes clear,
and often related entries will be listed close together.
This patch reorders the existing entries in guc_parameters.dat, and it
also augments the generation script to error if an entry is found at
the wrong place.
Note: The order is actually checked after lower-casing, to handle the
likes of "DateStyle".
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
We've been nibbling away at removing uses of "long" for a long time,
since its width is platform-dependent. Here's one more: change the
remaining "long" fields in Plan nodes to Cardinality, since the three
surviving examples all represent group-count estimates. The upstream
planner code was converted to Cardinality some time ago; for example
the corresponding fields in Path nodes are type Cardinality, as are
the arguments of the make_foo_path functions. Downstream in the
executor, it turns out that these all feed to the table-size argument
of BuildTupleHashTable. Change that to "double" as well, and fix it
so that it safely clamps out-of-range values to the uint32 limit of
simplehash.h, as was not being done before.
Essentially, this is removing all the artificial datatype-dependent
limitations on these values from upstream processing, and applying
just one clamp at the moment where we're forced to do so by the
datatype choices of simplehash.h.
Also, remove BuildTupleHashTable's misguided attempt to enforce
work_mem/hash_mem_limit. It doesn't have enough information
(particularly not the expected tuple width) to do that accurately,
and it has no real business second-guessing the caller's choice.
For all these plan types, it's really the planner's responsibility
to not choose a hashed implementation if the hashtable is expected
to exceed hash_mem_limit. The previous patch improved the
accuracy of those estimates, and even if BuildTupleHashTable had
more information it should arrive at the same conclusions.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
For several types of plan nodes that use TupleHashTables, the
planner estimated the expected size of the table as basically
numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)).
This is pretty far off, especially for small data widths, because
it doesn't account for the overhead of the simplehash.h hash table
nor for any per-tuple "additional space" the plan node may request.
Jeff Janes noted a case where the estimate was off by about a factor
of three, even though the obvious hazards such as inaccurate estimates
of numEntries or dataWidth didn't apply.
To improve matters, create functions provided by the relevant executor
modules that can estimate the required sizes with reasonable accuracy.
(We're still not accounting for effects like allocator padding, but
this at least gets the first-order effects correct.)
I added functions that can estimate the tuple table sizes for
nodeSetOp and nodeSubplan; these rely on an estimator for
TupleHashTables in general, and that in turn relies on one for
simplehash.h hash tables. That feels like kind of a lot of mechanism,
but if we take any short-cuts we're violating modularity boundaries.
The other places that use TupleHashTables are nodeAgg, which took
pains to get its numbers right already, and nodeRecursiveunion.
I did not try to improve the situation for nodeRecursiveunion because
there's nothing to improve: we are not making an estimate of the hash
table size, and it wouldn't help us to do so because we have no
non-hashed alternative implementation. On top of that, our estimate
of the number of entries to be hashed in that module is so suspect
that we'd likely often choose the wrong implementation if we did have
two ways to do it.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
Add comments explaining when and where it is safe for nbtree to treat
row compare keys as if they were simple scalar inequality keys on the
row's most significant column. This is particularly important within
_bt_advance_array_keys, which deals with required inequality keys in a
general and uniform way, without any special handling for row compares.
Also spell out the implications of _bt_check_rowcompare's approach of
_conditionally_ evaluating lower-order row compare subkeys, particularly
when one of its lower-order subkeys might see NULL index tuple values
(these may or may not affect whether the qual as a whole is satisfied).
The behavior in this area isn't particularly intuitive, so these issues
seem worth going into.
In passing, add a few more defensive/documenting row comparison related
assertions to _bt_first and _bt_check_rowcompare.
Follow-up to commits bd3f59fd and ec986020.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wznwkak_K7pcAdv9uH8ZfNo8QO7+tHXOaCUddMeTfaCCFw@mail.gmail.com
Backpatch-through: 18
_bt_first reliably uses the same equality key (on each index column) for
initial positioning purposes as the one that _bt_checkkeys can use to
end the scan following commit f09816a0. _bt_first no longer applies its
own independent rules to determine which initial positioning key to use
on each column (for equality and inequality keys alike). Preprocessing
is now fully in control of determining which keys start and end each
scan, ensuring that _bt_first and _bt_checkkeys have symmetric behavior.
Remove obsolete comments that described why _bt_first was expected to
use at least one of the available required equality keys for initial
positioning purposes. The rules in this area are now maximally strict
and uniform, so there's no reason to draw attention to equality keys.
Any column with a required equality key cannot have a redundant required
inequality key (nor can it have a redundant required equality key).
Oversight in commit f09816a0, which removed similar comments from
_bt_first, but missed these comments.
Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
Several functions in the codebase accept "Datum *" parameters but do
not modify the pointed-to data. These have been updated to take
"const Datum *" instead, improving type safety and making the
interfaces clearer about their intent. This change helps the compiler
catch accidental modifications and better documents immutability of
arguments.
Most of "Datum *" parameters have a pairing "bool *isnull" parameter,
they are constified as well.
No functional behavior is changed by this patch.
Author: Chao Li <lic@highgo.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2msfT0knvzUa72ZBwu9LR_RLY4on85w2a9YpE-o2By5HQ@mail.gmail.com
For all extant uses of TupleHashTables, execGrouping.c itself does
nothing with the "tablecxt" except to allocate new hash entries in it,
and the callers do nothing with it except to reset the whole context.
So this is an ideal use-case for a BumpContext, and the hash tables
are frequently big enough for the savings to be significant.
(Commit cc721c459 already taught nodeAgg.c this idea, but neglected
the other callers of BuildTupleHashTable.)
While at it, let's clean up some ill-advised leftovers from rebasing
TupleHashTables on simplehash.h:
* Many comments and variable names were based on the idea that the
tablecxt holds the whole TupleHashTable, whereas now it only holds the
hashed tuples (plus any caller-defined "additional storage"). Rename
to names like tuplescxt and tuplesContext, and adjust the comments.
Also adjust the memory context names to be like "<Foo> hashed tuples".
* Make ResetTupleHashTable() reset the tuplescxt rather than relying
on the caller to do so; that was fairly bizarre and seems like a
recipe for leaks. This is less efficient in the case where nodeAgg.c
uses the same tuplescxt for several different hashtables, but only
microscopically so because mcxt.c will short-circuit the extra resets
via its isReset flag. I judge the extra safety and intellectual
cleanliness well worth those few cycles.
* Remove the long-obsolete "allow_jit" check added by ac88807f9;
instead, just Assert that metacxt and tuplescxt are different.
We need that anyway for this definition of ResetTupleHashTable() to
be safe.
There is a side issue of the extent to which this change invalidates
the planner's estimates of hashtable memory consumption. However,
those estimates are already pretty bad, so improving them seems like
it can be a separate project. This change is useful to do first to
establish consistent executor behavior that the planner can expect.
A loose end not addressed here is that the "entrysize" calculation
in BuildTupleHashTable seems wrong: "sizeof(TupleHashEntryData) +
additionalsize" corresponds neither to the size of the simplehash
entries nor to the total space needed per tuple. It's questionable
why BuildTupleHashTable is second-guessing its caller's nbuckets
choice at all, since the original source of the number should have had
more information. But that all seems wrapped up with the planner's
estimation logic, so let's leave it for the planned followup patch.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Reported-by: David Rowley <dgrowleyml@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
Discussion: https://postgr.es/m/2268409.1761512111@sss.pgh.pa.us
The original is pretty baroque for no apparent reason; arguably, commit
2f9661311b should have done this. Noted while reviewing related code
for bug #18984. This is cosmetic (though I'm surprised that my compiler
generates shorter assembly this way), so no backpatch.
Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org
The new wal_fpi_bytes counter calculates the total amount of full page
images inserted in WAL records, in bytes. This commit exposes this
information in EXPLAIN (ANALYZE, WAL) alongside the existing counters,
for both the text and JSON/YAML outputs, building upon f9a09aa295.
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discusssion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com
This commit reverts 818fefd8fd, that has been introduced to address a
an instability in some of the TAP tests due to the presence of random
standby snapshot WAL records, when slots are invalidated by
InvalidatePossiblyObsoleteSlot().
Anyway, this commit had also the consequence of introducing a behavior
regression. After 818fefd8fd, the code may determine that a slot needs
to be invalidated while it may not require one: the slot may have moved
from a conflicting state to a non-conflicting state between the moment
when the mutex is released and the moment when we recheck the slot, in
InvalidatePossiblyObsoleteSlot(). Hence, the invalidations may be more
aggressive than they actually have to.
105b2cb336 has tackled the test instability in a way that should be
hopefully sufficient for the buildfarm, even for slow members:
- In v18, the test relies on an injection point that bypasses the
creation of the random records generated for standby snapshots,
eliminating the random factor that impacted the test. This option was
not available when 818fefd8fd was discussed.
- In v16 and v17, the problem was bypassed by disallowing a slot to
become active in some of the scenarios tested.
While on it, this commit adds a comment to document that it is fine for
a recheck to use xmin and LSN values stored in the slot, without storing
and reusing them across multiple checks.
Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com
Backpatch-through: 16
RIGHT_SEMI joins rely on the HEAP_TUPLE_HAS_MATCH flag to guarantee
that only the first match for each inner tuple is considered.
However, in a parallel hash join, the inner relation is stored in a
shared global hash table that can be probed by multiple workers
concurrently. This allows different workers to inspect and set the
match flags of the same inner tuples at the same time.
If two workers probe the same inner tuple concurrently, both may see
the match flag as unset and emit the same tuple, leading to duplicate
output rows and violating RIGHT_SEMI join semantics.
For now, we disable parallel plans for RIGHT_SEMI joins. In the long
term, it may be possible to support parallel execution by performing
atomic operations on the match flag, for example using a CAS or
similar mechanism.
Backpatch to v18, where RIGHT_SEMI join was introduced.
Bug: #19094
Reported-by: Lori Corbani <Lori.Corbani@jax.org>
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19094-6ed410eb5b256abd@postgresql.org
Backpatch-through: 18
Because long is 32-bit on 64-bit Windows, it isn't a good datatype to
store the difference between 2 pointers. The under-sized type could
overflow and lead to scary warnings in MEMORY_CONTEXT_CHECKING builds,
such as:
WARNING: problem in alloc set ExecutorState: bad single-chunk %p in block %p
However, the problem lies only in the code running the check, not from
an actual memory accounting bug.
Fix by using "Size" instead of "long". This means using an unsigned
type rather than the previous signed type. If the block's freeptr was
corrupted, we'd still catch that if the unsigned type wrapped. Unsigned
allows us to avoid further needless complexities around comparing signed
and unsigned types.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 13
Discussion: https://postgr.es/m/CAApHDvo-RmiT4s33J=aC9C_-wPZjOXQ232V-EZFgKftSsNRi4w@mail.gmail.com
Previously, we'd fill all fields except ccbin, and only later obtain and
detoast ccbin, with hypothetical failures being possible. If ccbin is
null (rare catalog corruption I have never witnessed) or its a corrupted
toast entry, we leak a tiny bit of memory in CacheMemoryContext from
having strdup'd the constraint name. Repair these by only attempting to
fill the struct once ccbin has been detoasted.
Author: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CAEudQAr=i3_Z4GvmediX900+sSySTeMkvuytYShhQqEwoGyvhA@mail.gmail.com
Instead of having five separate GUC structs, one for each type, with
the generic part contained in each of them, flip it around and have
one common struct, with the type-specific part has a subfield.
The very original GUC design had type-specific structs and
type-specific lists, and the membership in one of the lists defined
the type. But now the structs themselves know the type (from the
.vartype field), and they are all loaded into a common hash table at
run time, and so this original separation no longer makes sense. It
creates a bunch of inconsistencies in the code about whether the
type-specific or the generic struct is the primary struct, and a lot
of casting in between, which makes certain assumptions about the
struct layouts.
After the change, all these casts are gone and all the data is
accessed via normal field references. Also, various code is
simplified because only one kind of struct needs to be processed.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
XLogRecordAssemble() may be called multiple times before inserting a
record in XLogInsertRecord(), and the amount of FPIs generated inside
a record whose insertion is attempted multiple times may vary.
The logic added in f9a09aa295 touched directly pgWalUsage in
XLogRecordAssemble(), meaning that it could be possible for pgWalUsage
to be incremented multiple times for a single record. This commit
changes the code to use the same logic as the number of FPIs added to a
record, where XLogRecordAssemble() returns this information and feeds it
to XLogInsertRecord(), updating pgWalUsage only when a record is
inserted.
Reported-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSiSr+rusd0GzVy8Bt30QwLTK=ugVMnF6=5WhsSrukvvw@mail.gmail.com
I have never seen this be a problem in practice, but it came up when
purposely corrupting catalog contents to study the fix for a nearby bug:
we'd try to decrement relchecks, but since it's zero we error out and
fail to drop the constraint. The fix is to downgrade the error to
warning, skip decrementing the counter, and otherwise proceed normally.
Given lack of field complaints, no backpatch.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202508291058.q2zscdcs64fj@alvherre.pgsql
Some recent changes were made to remove the explicit dependency on
btree indexes in some parts of the code. One of these changes was
made in commit 9ef1851685, which allows non-btree indexes to be used
in get_actual_variable_range(). A follow-up commit ee1ae8b99f fixes
the cases where an index doesn’t have a sortopfamily as this is a
prerequisite to be used in get_actual_variable_range().
However, it was found that indexes that have amcanorder = true but do
not allow index-only-scans (amcanreturn returns false or is NULL) will
pass all of the conditions, while they should be rejected since
get_actual_variable_range() uses the index-only-scan machinery in
get_actual_variable_endpoint(). Such an index might cause errors like
ERROR: no data returned for index-only scan
during query planning.
The fix is to add a check in get_actual_variable_range() to reject
indexes that do not allow index-only scans.
Author: Maxime Schoemans <maxime.schoemans@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/20ED852A-C2D9-41EB-8671-8C8B9D418BE9%40enterprisedb.com
This new counter, called "wal_fpi_bytes", tracks the total amount in
bytes of full page images (FPIs) generated in WAL. This data becomes
available globally via pg_stat_wal, and for backend statistics via
pg_stat_get_backend_wal().
Previously, this information could only be retrieved with pg_waldump or
pg_walinspect, which may not be available depending on the environment,
and are expensive to execute. It offers hints about how much FPIs
impact the WAL generated, which could be a large percentage for some
workloads, as well as the effects of wal_compression or page holes.
Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, due to the addition of wal_fpi_bytes in
PgStat_WalCounters.
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com
Extend logicalrep_worker_stop, logicalrep_worker_wakeup, and
logicalrep_worker_find to accept a worker type argument. This change
enables differentiation between logical replication worker types, such as
apply workers and table sync workers. While preserving existing behavior,
it lays the groundwork for upcoming patch to add sequence synchronization
workers.
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
This type is just char * underneath, it provides no real value, no
type safety, and just makes the code one level more mysterious. It is
more idiomatic to refer to blobs of memory by a combination of void *
and size_t, so change it to that.
Also, since this type hides the pointerness, we can't apply qualifiers
to what is pointed to, which requires some unconstify nonsense. This
change allows fixing that.
Extension code that uses the Item type can change its code to use
void * to be backward compatible.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org
Previously, the check_hook for synchronized_standby_slots attempted to
validate that each specified slot existed and was physical. However, these
checks were not performed during server startup. As a result, if users
configured non-existent slots before startup, the misconfiguration would
go undetected initially. This could later cause parallel query failures,
as newly launched workers would detect the issue and raise an ERROR.
This patch improves the check_hook by validating the syntax and format of
slot names. Validation of slot existence and type is deferred to the WAL
sender process, aligning with the behavior of the check_hook for
primary_slot_name.
Reported-by: Fabrice Chapuis <fabrice636861@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Backpatch-through: 17, where it was introduced
Discussion: https://postgr.es/m/CAA5-nLCeO4MQzWipCXH58qf0arruiw0OeUc1+Q=Z=4GM+=v1NQ@mail.gmail.com
When dealing with ResultRelInfos for partitions, there are cases where
there are mixed requirements for the ri_RootResultRelInfo. There are
cases when the partition itself requires a NULL ri_RootResultRelInfo and
in the same query, the same partition may require a ResultRelInfo with
its parent set in ri_RootResultRelInfo. This could cause the column
mapping between the partitioned table and the partition not to be done
which could result in crashes if the column attnums didn't match
exactly.
The fix is simple. We now check that the ri_RootResultRelInfo matches
what the caller passed to ExecGetTriggerResultRel() and only return a
cached ResultRelInfo when the ri_RootResultRelInfo matches what the
caller wants, otherwise we'll make a new one.
Author: David Rowley <dgrowleyml@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Reported-by: Dmitry Fomin <fomin.list@gmail.com>
Discussion: https://postgr.es/m/7DCE78D7-0520-4207-822B-92F60AEA14B4@gmail.com
Backpatch-through: 15
These two functions expect there to be room to insert another item
in the FreePageBtree's array, but their assertions were too weak
to guarantee that. This has little practical effect granting that
the callers are not buggy, but it seems to be misleading late-model
Coverity into complaining about possible array overrun.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/799984.1761150474@sss.pgh.pa.us
Backpatch-through: 13
This patch adds support for a new SQL command:
ALTER SUBSCRIPTION ... REFRESH SEQUENCES
This command updates the sequence entries present in the
pg_subscription_rel catalog table with the INIT state to trigger
resynchronization.
In addition to the new command, the following subscription commands have
been enhanced to automatically refresh sequence mappings:
ALTER SUBSCRIPTION ... REFRESH PUBLICATION
ALTER SUBSCRIPTION ... ADD PUBLICATION
ALTER SUBSCRIPTION ... DROP PUBLICATION
ALTER SUBSCRIPTION ... SET PUBLICATION
These commands will perform the following actions:
Add newly published sequences that are not yet part of the subscription.
Remove sequences that are no longer included in the publication.
This ensures that sequence replication remains aligned with the current
state of the publication on the publisher side.
Note that the actual synchronization of sequence data/values will be
handled in a subsequent patch that introduces a dedicated sequence sync
worker.
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
Commit 883a95646a introduced overflow entries in the replication lag tracker
to fix an issue where lag columns in pg_stat_replication could stall when
the replay LSN stopped advancing.
This commit adds comments clarifying the purpose and behavior of overflow
entries to improve code readability and understanding.
Since commit 883a95646a was recently applied and backpatched to all
supported branches, this follow-up commit is also backpatched accordingly.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VxqQA_DePxyZ7Y8V+ErYyXkmwJ1P6NC+YC+cvxMipWKw@mail.gmail.com
Backpatch-through: 13
The "else" code block having single statement with comments on a
separate line should have been surrounded by braces.
Reported-by: Chao Li <lic@highgo.com>
Suggested-by: David Rowley <dgrowleyml@gmail.com>
Author: Tatsuo Ishii <ishii@postgresql.org>
Discussion: https://postgr.es/m/20251020.125847.997839131426057290.ishii%40postgresql.org
When JIT deformed tuples (controlled via the jit_tuple_deforming GUC),
types narrower than sizeof(Datum) would be zero-extended up to Datum
width. This wasn't the same as what fetch_att() does in the standard
tuple deforming code. Logically the values are the same when fetching
via the DatumGet*() marcos, but negative numbers are not the same in
binary form.
In the report, the problem was manifesting itself with:
ERROR: could not find memoization table entry
in a query which had a "Cache Mode: binary" Memoize node. However, it's
currently unclear what else is affected. Anything that uses
datum_image_eq() or datum_image_hash() on a Datum from a tuple deformed by
JIT could be affected, but it may not be limited to that.
The fix for this is simple: use signed extension instead of zero
extension.
Many thanks to Emmanuel Touzery for reporting this issue and providing
steps and backup which allowed the problem to easily be recreated.
Reported-by: Emmanuel Touzery <emmanuel.touzery@plandela.si>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/DB8P194MB08532256D5BAF894F241C06393F3A@DB8P194MB0853.EURP194.PROD.OUTLOOK.COM
Backpatch-through: 13
We had several places that used cast-to-unsigned-int as a substitute
for properly checking for overflow. Coverity has started objecting
to that practice as likely introducing Y2038 bugs. An extra
comparison is surely not much compared to the cost of time(NULL), nor
is this coding practice particularly readable. Let's do it honestly,
with explicit logic covering the cases of first-time-through and
clock-went-backwards.
I don't feel a need to back-patch though: our released versions
will be out of support long before 2038, and besides which I think
the code would accidentally work anyway for another 70 years or so.
This small function is only used in one place, and it fails to
handle quoted table names (although the table name portion of the
input should never be quoted in current usage). In addition to
removing make_temptable_name_n() in favor of open-coding it where
needed, this commit ensures the "diff" table name is properly
quoted in order to future-proof this area a bit.
Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TO3a5q2NKRsjdJ6sLf8isVe4aMaaX1-Hj2TdHdhFw8zRA%40mail.gmail.com
Previously, if primary_slot_name was set to an invalid slot name and
the configuration file was reloaded, both the postmaster and all other
backend processes reported a WARNING. With many processes running,
this could produce a flood of duplicate messages. The problem was that
the GUC check hook for primary_slot_name reported errors at WARNING
level via ereport().
This commit changes the check hook to use GUC_check_errdetail() and
GUC_check_errhint() for error reporting. As with other GUC parameters,
this causes non-postmaster processes to log the message at DEBUG3,
so by default, only the postmaster's message appears in the log file.
Backpatch to all supported versions.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/CAHGQGwFud-cvthCTfusBfKHBS6Jj6kdAPTdLWKvP2qjUX6L_wA@mail.gmail.com
Backpatch-through: 13
Previously it was mistakenly assumed that there's only one window
function argument which needs to be processed by WinGetFuncArgInFrame
or WinGetFuncArgInPartition when IGNORE NULLS option is specified. To
eliminate the limitation, WindowObject->notnull_info is modified from
"uint8 *" to "uint8 **" so that WindowObject->notnull_info could store
pointers to "uint8 *" which holds NOT NULL info corresponding to each
window function argument. Moreover, WindowObject->num_notnull_info is
changed from "int" to "int64 *" so that WindowObject->num_notnull_info
could store the number of NOT NULL info corresponding to each function
argument. Memories for these data structures will be allocated when
WinGetFuncArgInFrame or WinGetFuncArgInPartition is called. Thus no
memory except the pointers is allocated for function arguments which
do not call these functions
Also fix the set mark position logic in WinGetFuncArgInPartition to
not raise a "cannot fetch row before WindowObject's mark position"
error in IGNORE NULLS case.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Tatsuo Ishii <ishii@postgresql.org>
Discussion: https://postgr.es/m/2952409.1760023154%40sss.pgh.pa.us
Previously, when the replay LSN reported in feedback messages from a standby
stopped advancing, for example, due to a recovery conflict, the write_lag and
flush_lag columns in pg_stat_replication would initially update but then stop
progressing. This prevented users from correctly monitoring replication lag.
The problem occurred because when any LSN stopped updating, the lag tracker's
cyclic buffer became full (the write head reached the slowest read head).
In that state, the lag tracker could no longer compute round-trip lag values
correctly.
This commit fixes the issue by handling the slowest read entry (the one
causing the buffer to fill) as a separate overflow entry and freeing space
so the write and other read heads can continue advancing in the buffer.
As a result, write_lag and flush_lag now continue updating even if the reported
replay LSN remains stalled.
Backpatch to all supported versions.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwGdGQ=1-X-71Caee-LREBUXSzyohkoQJd4yZZCMt24C0g@mail.gmail.com
Backpatch-through: 13
5983a4cff added CompactAttribute for storing commonly used fields from
FormData_pg_attribute. 5983a4cff didn't go to the trouble of adjusting
every location where we can use CompactAttribute rather than
FormData_pg_attribute, so here we change the remaining ones.
There are some locations where I've left the code using
FormData_pg_attribute. These are mostly in the ALTER TABLE code. Using
CompactAttribute here seems more risky as often the TupleDesc is being
changed and those changes may not have been flushed to the
CompactAttribute yet.
I've also left record_recv(), record_send(), record_cmp(), record_eq()
and record_image_eq() alone as it's not clear to me that accessing the
CompactAttribute is a win here due to the FormData_pg_attribute still
having to be accessed for most cases. Switching the relevant parts to
use CompactAttribute would result in having to access both for common
cases. Careful benchmarking may reveal that something can be done to
make this better, but in absence of that, the safer option is to leave
these alone.
In ReorderBufferToastReplace(), there was a check to skip attnums < 0
while looping over the TupleDesc. Doing this is redundant since
TupleDescs don't store < 0 attnums. Removing that code allows us to
move to using CompactAttribute.
The change in validateDomainCheckConstraint() just moves fetching the
FormData_pg_attribute into the ERROR path, which is cold due to calling
errstart_cold() and results in code being moved out of the common path.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAApHDvrMy90o1Lgkt31F82tcSuwRFHq3vyGewSRN=-QuSEEvyQ@mail.gmail.com
Previously, tsearch used the database's CTYPE setting, which only
matches the database default collation if the locale provider is libc.
Note that tsearch types (tsvector and tsquery) are not collatable
types. The locale affects parsing the original text, which is a lossy
process, so a COLLATE clause on the already-parsed value would not
make sense.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
Currently there's no bug, because we have no code path where we
invalidate relcache entries where it'd cause a problem. But it's more
robust to do it this way in case we introduce such a path later, as some
Postgres forks reportedly already have.
Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Stepan Neretin <slpmcf@gmail.com>
Discussion: https://postgr.es/m/CAJDiXgj3FNzAhV+jjPqxMs3jz=OgPohsoXFj_fh-L+nS+13CKQ@mail.gmail.com
A BlockNumber (32-bit) might not be large enough to add bo_pagesPerRange
to when the table contains close to 2^32 pages. At worst, this could
result in a cancellable infinite loop during the BRIN index scan with
power-of-2 pagesPerRange, and slow (inefficient) BRIN index scans and
scanning of unneeded heap blocks for non power-of-2 pagesPerRange.
Backpatch to all supported versions.
Author: sunil s <sunilfeb26@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOG6S4-tGksTQhVzJM19NzLYAHusXsK2HmADPZzGQcfZABsvpA@mail.gmail.com
Backpatch-through: 13
The comment fixed in this commit described the function as dealing with
database blocks, but in reality it processes shared memory allocations.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aH4DDhdiG9Gi0rG7@ip-10-97-1-34.eu-west-3.compute.internal
Backpatch-through: 18
67a54b9e8 taught the planner to push down HAVING clauses even when
grouping sets are present, as long as the clause does not reference
any columns that are nullable by the grouping sets. However, there
was an oversight: if any empty grouping sets are present, the
aggregation node can produce a row that did not come from the input,
and pushing down a HAVING clause in this case may cause us to fail to
filter out that row.
Currently, non-degenerate HAVING clauses are not pushed down when
empty grouping sets are present, since the empty grouping sets would
nullify the vars they reference. However, degenerate (variable-free)
HAVING clauses are not subject to this restriction and may be
incorrectly pushed down.
To fix, explicitly check for the presence of empty grouping sets and
retain degenerate clauses in HAVING when they are present. This
ensures that we don't emit a bogus aggregated row. A copy of each
such clause is also put in WHERE so that query_planner() can use it in
a gating Result node.
To facilitate this check, this patch expands the groupingSets tree of
the query to a flat list of grouping sets before applying the HAVING
pushdown optimization. This does not add any additional planning
overhead, since we need to do this expansion anyway.
In passing, make a small tweak to preprocess_grouping_sets() by
reordering its initial operations a bit.
Backpatch to v18, where this issue was introduced.
Reported-by: Yuhang Qiu <iamqyh@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/0879D9C9-7FE2-4A20-9593-B23F7A0B5290@gmail.com
Backpatch-through: 18
Previously, COPY TO command didn't support directly specifying
partitioned tables so users had to use COPY (SELECT ...) TO variant.
This commit adds direct COPY TO support for partitioned
tables, improving both usability and performance. Performance tests
show it's faster than the COPY (SELECT ...) TO variant as it avoids
the overheads of query processing and sending results to the COPY TO
command.
When used with partitioned tables, COPY TO copies the same rows as
SELECT * FROM table. Row-level security policies of the partitioned
table are applied in the same way as when executing COPY TO on a plain
table.
Author: jian he <jian.universality@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Melih Mutlu <m.melihmutlu@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CACJufxEZt%2BG19Ors3bQUq-42-61__C%3Dy5k2wk%3DsHEFRusu7%3DiQ%40mail.gmail.com
Our configure script intended to ensure this, but it supposed that
expr(1) would report an error for integer overflow. Maybe that
was true when the code was written (commit 3c6248a82 of 2008-05-02),
but all the modern expr's I tried will deliver bigger-than-int32
results without complaint. Moreover, if you use --with-segsize-blocks
then there's no check at all.
Ideally we'd add a test in configure itself to check that the value
fits in int, but to do that we'd need to suppose that test(1) handles
bigger-than-int32 numbers correctly. Probably modern ones do, but
that's an assumption I could do without; and I'm not too trusting
about meson either. Instead, let's install a static assertion, so
that even people who ignore all the compiler warnings you get from
such values will be forced to confront the fact that it won't work.
This has been hazardous for awhile, but given that we hadn't heard
a complaint about it till now, I don't feel a need to back-patch.
Reported-by: Casey Shobe <casey.allen.shobe@icloud.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/C5DC82D6-C76D-4E8F-BC2E-DF03EFC4FA24@icloud.com
Coverity complains that the return value from gettuple_eval_partition
(stored in variable "datum") in a do..while loop in
WinGetFuncArgInPartition is overwritten when exiting the while
loop. This commit tries to fix the issue by changing the
gettuple_eval_partition call to:
(void) gettuple_eval_partition()
explicitly stating that we discard the return value. We are just
interested in whether we are inside or outside of partition, NULL or
NOT NULL here.
Also enhance some comments for easier code reading.
Reported-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aPCOabSE4VfJLaky%40paquier.xyz
This removes a few static functions and replaces them with 2 functions
which aim to be more reusable. The upper planner's pathkey requirements
can be simplified down to operations which require pathkeys in the same
order as the pathkeys for the given operation, and operations which can
make use of a Path's pathkeys in any order.
Here we also add some short-circuiting to truncate_useless_pathkeys(). At
any point we discover that all pathkeys are useful to a single operation,
we can stop checking the remaining operations as we're not going to be
able to find any further useful pathkeys - they're all possibly useful
already. Adjusting this seems to warrant trying to put the checks
roughly in order of least-expensive-first so that the short-circuits
have the most chance of skipping the more expensive checks.
In passing clean up has_useful_pathkeys() as it seems to have grown a
redundant check for group_pathkeys. This isn't needed as
standard_qp_callback will set query_pathkeys if there's any requirement
to have group_pathkeys. All this code does is waste run-time effort and
take up needless space.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpbsEoTksvW5901MMoZo-hHf78E5up3uDOfkJnxDe_WAw@mail.gmail.com
This fixes an unlikely issue when fetching GROUPING SET results from
their internally stored hash tables. It was possible in rare cases that
the hash iterator would be set up incorrectly which could result in a
crash.
This was introduced in 4d143509c, so backpatch to v18.
Many thanks to Yuri Zamyatin for reporting and helping to debug this
issue.
Bug: #19078
Reported-by: Yuri Zamyatin <yuri@yrz.am>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org
Backpatch-through: 18
Commit a1b4f289be improved the hashjoin sizing to also consider the
memory used by BufFiles for batches. The code however had multiple
issues, making it ineffective or not working as expected in some cases.
* The amount of memory needed by buffers was calculated using uint32,
so it would overflow for nbatch >= 262144. If this happened the loop
would exit prematurely and the memory usage would not be reduced.
The nbatch overflow is fixed by reworking the condition to not use a
multiplication at all, so there's no risk of overflow. An explicit
cast was added to a similar calculation in ExecHashIncreaseBatchSize.
* The loop adjusting the nbatch value used hash_table_bytes to calculate
the old/new size, but then updated only space_allowed. The consequence
is the total memory usage was not reduced, but all the memory saved by
reducing the number of batches was used for the internal hash table.
This was fixed by using only space_allowed. This is also more correct,
because hash_table_bytes does not account for skew buckets.
* The code was also doubling multiple parameters (e.g. the number of
buckets for hash table), but was missing overflow protections.
The loop now checks for overflow, and terminates if needed. It'd be
possible to cap the value and continue the loop, but it's not worth
the complexity. And the overflow implies the in-memory hash table is
already very large anyway.
While at it, rework the comment explaining how the memory balancing
works, to make it more concise and easier to understand.
The initial nbatch overflow issue was reported by Vaibhav Jain. The
other issues were noticed by me and Melanie Plageman. Fix by me, with a
lot of review and feedback by Melanie.
Backpatch to 18, where the hashjoin memory balancing was introduced.
Reported-by: Vaibhav Jain <jainva@google.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CABa-Az174YvfFq7rLS+VNKaQyg7inA2exvPWmPWqnEn6Ditr_Q@mail.gmail.com
The URL for "Sorting and Searching Algorithms: A Cookbook"
by Thomas Niemann has started returning 404, and since we
refer to the page for license terms this replaces the now
defunct link with one to the copy on archive.org.
Author: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/6DED3DEF-875E-4D1D-8F8F-7353D5AF7B79@gmail.com
This information appears to have been unused since commit
c5b7ba4e67. We could not find any references in third-party code,
either.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aO_CyFRpbVMtgJWM%40nathan
To support the upcoming addition of a sequence synchronization worker,
this patch extracts common synchronization logic shared by table sync
workers and the new sequence sync worker into a dedicated file. This
modularization improves code reuse, maintainability, and clarity in the
logical workers framework.
Author: vignesh C <vignesh21@gmail.com>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
EvalPlanQualStart() failed to propagate es_partition_directory into
the child EState used for EPQ rechecks. When execution time partition
pruning ran during the EPQ scan, executor code dereferenced a NULL
partition directory and crashed.
Previously, propagating es_partition_directory into the EPQ EState was
unnecessary because CreatePartitionPruneState(), which sets it on
demand, also initialized the exec-pruning context. After commit
d47cbf474, CreatePartitionPruneState() now initializes only the init-
time pruning context, leaving exec-pruning context initialization to
ExecInitNode(). Since EvalPlanQualStart() runs only ExecInitNode() and
not CreatePartitionPruneState(), it can encounter a NULL
es_partition_directory. Other executor fields initialized during
CreatePartitionPruneState() are already copied into the child EState
thanks to commit 8741e48e5d, but es_partition_directory was missed.
Fix by borrowing the parent estate's es_partition_directory in
EvalPlanQualStart(), and by clearing that field in EvalPlanQualEnd()
so the parent remains responsible for freeing the directory.
Add an isolation test permutation that triggers EPQ with execution-
time partition pruning, the case that reproduces this crash.
Bug: #19078
Reported-by: Yuri Zamyatin <yuri@yrz.am>
Diagnosed-by: David Rowley <dgrowleyml@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org
Backpatch-through: 18
This commit adjusts RangeVarCallbackForReindexIndex() to handle an
extremely unlikely race condition involving concurrent OID reuse.
In short, if REINDEX INDEX is executed at the same time that the
index is re-created with the same name and OID but a different
parent table OID, we might lock the wrong parent table. To fix,
simply detect when this happens and emit an ERROR. Unfortunately,
we can't gracefully handle this situation because we will have
already locked the index, and we must lock the parent table before
the index to avoid deadlocks.
While at it, I've replaced all but one early return in this
callback function with ERRORs that should be unreachable. While I
haven't verified the presence of a live bug, the checks in question
appear to be unnecessary, and the early returns seem prone to
breaking the parent table locking code in subtle ways. If nothing
else, this simplifies the code a bit.
This is a bug fix and could be back-patched, but given the presumed
rarity of the race condition and the lack of reports, I'm not going
to bother.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan
Per-character pg_locale_t APIs. Useful for tsearch parsing and
potentially other places.
Significant overlap with the regc_wc_isalpha() and related functions
in regc_pg_locale.c, but this change leaves those intact for
now.
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
Presently, these functions look up the relation's OID, lock it, and
then check privileges. Not only does this approach provide no
guarantee that the locked relation matches the arguments of the
lookup, but it also allows users to briefly lock relations for
which they do not have privileges, which might enable
denial-of-service attacks. This commit adjusts these functions to
use RangeVarGetRelidExtended(), which is purpose-built to avoid
both of these issues. The new RangeVarGetRelidCallback function is
somewhat complicated because it must handle both tables and
indexes, and for indexes, we must check privileges on the parent
table and lock it first. Also, it needs to handle a couple of
extremely unlikely race conditions involving concurrent OID reuse.
A downside of this change is that the coding doesn't allow for
locking indexes in AccessShare mode anymore; everything is locked
in ShareUpdateExclusive mode. Per discussion, the original choice
of lock levels was intended for a now defunct implementation that
used in-place updates, so we believe this change is okay.
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan
Backpatch-through: 18
The log output functionality of log_autovacuum_min_duration applies to
both VACUUM and ANALYZE, so it is not possible to separate the VACUUM
and ANALYZE log output thresholds. Logs are likely to be output only for
VACUUM and not for ANALYZE.
Therefore, we decided to separate the threshold for log output of VACUUM
by autovacuum (log_autovacuum_min_duration) and the threshold for log
output of ANALYZE by autovacuum (log_autoanalyze_min_duration).
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Kasahara Tatsuhito <kasaharatt@oss.nttdata.com>
Discussion: https://www.postgresql.org/message-id/flat/CAOzEurQtfV4MxJiWT-XDnimEeZAY+rgzVSLe8YsyEKhZcajzSA@mail.gmail.com
This patch replaces ALTER SUBSCRIPTION REFRESH with
ALTER SUBSCRIPTION REFRESH PUBLICATION in comments and error messages to
improve clarity and support future extensibility. The change aligns with
upcoming addition REFRESH SEQUENCES for sequence synchronization.
Author: vignesh C <vignesh21@gmail.com>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
This function only requires a few fields from LVRelState, so pass them
in individually.
This change allows calling heap_page_is_all_visible() from code such as
pruneheap.c, which does not have access to an LVRelState.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
These functions appeared prominently in a profile of a patch that sets
the visibility map on-access. Inline them to remove call overhead and
make them cheaper to use in hot paths.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
After scanning the line pointers on a heap page during the first phase
of vacuum, we use the information collected to decide whether to use
the assembled freeze plans.
Move this decision logic into a helper function to improve readability.
While here, rename a PruneState member and disambiguate some local
variables in heap_page_prune_and_freeze().
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
Previously WinCheckAndInitializeNullTreatment() used elog() to emit an
error message. ereport() should be used instead because it's a
user-facing error. Also use existing get_func_name() to get a
function's name, rather than own implementation.
Moreover add an assertion to validate winobj parameter, just like
other window function API.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/2952409.1760023154%40sss.pgh.pa.us
Instead of emitting a separate XLOG_HEAP2_VISIBLE WAL record for each
page that becomes all-visible in vacuum's third phase, specify the VM
changes in the already emitted XLOG_HEAP2_PRUNE_VACUUM_CLEANUP record.
Visibility checks are now performed before marking dead items unused.
This is safe because the heap page is held under exclusive lock for the
entire operation.
This reduces the number of WAL records generated by VACUUM phase III by
up to 50%.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
log_error() would probably fail completely if used, and would
certainly print garbage for anything that needed to be interpolated
into the message, because it was failing to use the correct printing
subroutine for a va_list argument.
This bug likely went undetected because the error cases this code
is used for are rarely exercised - they only occur when Windows
security API calls fail catastrophically (out of memory, security
subsystem corruption, etc).
The FRONTEND variant can be fixed just by calling vfprintf()
instead of fprintf(). However, there was no va_list variant
of write_stderr(), so create one by refactoring that function.
Following the usual naming convention for such things, call
it vwrite_stderr().
Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAF+pBj8goe4fRmZ0V3Cs6eyWzYLvK+HvFLYEYWG=TzaM+tWPnw@mail.gmail.com
Backpatch-through: 13
Remove a variable that is no longer in use following commit 9a2e2a28.
It's not immediately clear why there were no compiler warnings about
this oversight.
Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
Commit 71f4c8c6f7 (which implemented DETACH CONCURRENTLY) added code
to create a separate table constraint when a table is detached
concurrently, identical to the partition constraint, on the theory that
such a constraint was needed in case the optimizer had constructed any
query plans that depended on the constraint being there. However, that
theory was apparently bogus because any such plans would be invalidated.
For hash partitioning, those constraints are problematic, because their
expressions reference the OID of the parent partitioned table, to which
the detached table is no longer related; this causes all sorts of
problems (such as inability of restoring a pg_dump of that table, and
the table no longer working properly if the partitioned table is later
dropped).
We'd like to get rid of all those constraints. In fact, for branch
master, do that -- no longer create any substitute constraints.
However, out of fear that some users might somehow depend on these
constraints for other partitioning strategies, for stable branches
(back to 14, which added DETACH CONCURRENTLY), only do it for hash
partitioning.
(If you repeatedly DETACH CONCURRENTLY and then ATTACH a partition, then
with this constraint addition you don't need to scan the table in the
ATTACH step, which presumably is good. But if users really valued this
feature, they would have requested that it worked for non-concurrent
DETACH also.)
Author: Haiyang Li <mohen.lhy@alibaba-inc.com>
Reported-by: Fei Changhong <feichanghong@qq.com>
Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com>
Backpatch-through: 14
Discussion: https://postgr.es/m/18371-7fef49f63de13f02@postgresql.org
Discussion: https://postgr.es/m/19070-781326347ade7c57@postgresql.org
Introduced by my (Álvaro's) commit 9e4f914b5e, which was itself
backpatched to pg10, though only pg15 and up contain the problem
because of commit 9c08aea6a3.
This isn't a particularly significant leak, but given the fix is
trivial, we might as well backpatch to all branches where it applies, so
do that.
Author: Nathan Bossart <nathandbossart@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/x4odfdlrwvsjawscnqsqjpofvauxslw7b4oyvxgt5owoyf4ysn@heafjusodrz7
An assertion in _bt_killitems expected the scan's currPos state to
contain a valid LSN, saved from when currPos's page was initially read.
The assertion failed to account for the fact that even logged relations
can have leaf pages with an invalid LSN when built with wal_level set to
"minimal". Remove the faulty assertion.
Oversight in commit e6eed40e (though note that the assertion was
backpatched to stable branches before 18 by commit 7c319f54).
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Matthijs van der Vleuten <postgresql@zr40.nl>
Bug: #19082
Discussion: https://postgr.es/m/19082-628e62160dbbc1c1@postgresql.org
Backpatch-through: 13
An error happening while a slot data is saved on disk in
SaveSlotToPath() could cause a state.tmp file (temporary file holding
the slot state data, renamed to its permanent name at the end of the
function) to remain around after it has been created. This temporary
file is created with O_EXCL, meaning that if an existing state.tmp is
found, its creation would fail. This would prevent the slot data to be
saved, requiring a manual intervention to remove state.tmp before being
able to save again a slot. Possible scenarios where this temporary file
could remain on disk is for example a ENOSPC case (no disk space) while
writing, syncing or renaming it. The bug reports point to a write
failure as the principal cause of the problems.
Using O_TRUNC has been argued back in 2019 as a potential solution to
discard any temporary file that could exist. This solution was rejected
as O_EXCL can also act as a safety measure when saving the slot state,
crash recovery offering cleanup guarantees post-crash. This commit uses
the alternative approach that has been suggested by Andres Freund back
in 2019. When the temporary state file cannot be written, synced,
closed or renamed (note: not when created!), an unlink() is used to
remove the temporary state file while holding the in-progress I/O
LWLock, so as any follow-up attempts to save a slot's data would not
choke on an existing file that remained around because of a previous
failure.
This problem has been reported a few times across the years, going back
to 2019, but for some reason I have never come back to do something
about it and it has been forgotten. A recent report has reminded me
that this was still a problem.
Reported-by: Kevin K Biju <kevinkbiju@gmail.com>
Reported-by: Sergei Kornilov <sk@zsrv.org>
Reported-by: Grigory Smolkin <g.smolkin@postgrespro.ru>
Discussion: https://postgr.es/m/CAM45KeHa32soKL_G8Vk38CWvTBeOOXcsxAPAs7Jt7yPRf2mbVA@mail.gmail.com
Discussion: https://postgr.es/m/3559061693910326@qy4q4a6esb2lebnz.sas.yp-c.yandex.net
Discussion: https://postgr.es/m/08bbfab1-a61d-3750-fc18-4ab2c1aa7f09@postgrespro.ru
Backpatch-through: 13
Instead of emitting a separate WAL XLOG_HEAP2_VISIBLE record for setting
bits in the VM, specify the VM block changes in the
XLOG_HEAP2_MULTI_INSERT record.
This halves the number of WAL records emitted by COPY FREEZE.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
The processing of the PARALLEL option for VACUUM was not quite
following what the DefElem code had intended. defGetInt32() already has
code to handle missing parameters and returns a perfectly good error
message for when that happens.
Here we get rid of the ExecVacuum() error:
ERROR: parallel option requires a value between 0 and N
and leave defGetInt32() handle it, which will give:
ERROR: parallel requires an integer value
defGetInt32() was already handling the non-integer parameter case, so it
may as well handle the missing parameter case too.
Additionally, parameterize the option name to make translator work easier,
and also use errhint_internal() rather than errhint() for the
BUFFER_USAGE_LIMIT option since there isn't any work for a translator to
do for "%s".
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAApHDvovH14tNWB+WvP6TSbfi7-=TysQ9h5tQ5AgavwyWRWKHA@mail.gmail.com
An error context callback function might leak some memory into
ErrorContext, since those functions are run with ErrorContext as
current context. In the case where the elevel is ERROR, this is
no problem since the code level that catches the error should do
FlushErrorState to clean up, and that will reset ErrorContext.
However, if the elevel is less than ERROR then no such cleanup occurs.
In principle, repeated leaks while emitting log messages or client
notices could accumulate arbitrarily much leaked data, if no ERROR
occurs in the session.
To fix, let errfinish() perform an ErrorContext reset if it is
at the outermost error nesting level. (If it isn't, we'll delay
cleanup until the outermost nesting level is exited.)
The only actual leakage of this sort that I've been able to observe
within our regression tests was recently introduced by commit
f727b63e8. While it seems plausible that there are other such
leaks not reached in the regression tests, the lack of field
reports suggests that they're not a big problem. Accordingly,
I won't take the risk of back-patching this now. We can always
back-patch later if we get field reports of leaks.
Reported-by: Andres Freund <andres@anarazel.de>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/jngsjonyfscoont4tnwi2qoikatpd5hifsg373vmmjvugwiu6g@m6opxh7uisgd
While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.
With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.
Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.
While the backbranches v13 and v14 have a similar issue where
RelationSyncCache persists even after an error when pgoutput is used
via SQL API, we decided not to backport this fix. This decision was
made because v13 is approaching its final minor release, and we won't
have an chance to fix any new issues that might arise. Additionally,
since using pgoutput via SQL API is not a common use case, the risk
outwights the benefit. If we receive bug reports, we can consider
backporting the fixes then.
Author: vignesh C <vignesh21@gmail.com>
Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch-through: 15
Some of the buildfarm is still unhappy with WinGetFuncArgInPartition
even after 2273fa32b. While it seems to be just very old compilers,
we can suppress the warnings and arguably make the code more readable
by not initializing these variables till closer to where they are
used. While at it, make a couple of cosmetic comment improvements.
This patch adds support for the ALL SEQUENCES clause in publications,
enabling synchronization/replication of all sequences that is useful for
upgrades.
Publications can now include all sequences via FOR ALL SEQUENCES.
psql enhancements:
\d shows publications for a given sequence.
\dRp indicates if a publication includes all sequences.
ALL SEQUENCES can be combined with ALL TABLES, but not with other options
like TABLE or TABLES IN SCHEMA. We can extend support for more granular
clauses in future.
The view pg_publication_sequences provides information about the mapping
between publications and sequences.
This patch enables publishing of sequences; subscriber-side support will
be added in upcoming patches.
Author: vignesh C <vignesh21@gmail.com>
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
SQL/JSON functions such as JSON_VALUE could fail with "unrecognized
node type" errors when a DEFAULT clause contained an explicit COLLATE
expression. That happened because assign_collations_walker() could
invoke exprSetCollation() on a JsonBehavior expression whose DEFAULT
still contained a CollateExpr, which exprSetCollation() does not
handle.
For example:
SELECT JSON_VALUE('{"a":1}', '$.c' RETURNING text
DEFAULT 'A' COLLATE "C" ON EMPTY);
Fix by validating in transformJsonBehavior() that the DEFAULT
expression's collation matches the enclosing JSON expression’s
collation. In exprSetCollation(), replace the recursive call on the
JsonBehavior expression with an assertion that its collation already
matches the target, since the parser now enforces that condition.
Reported-by: Jian He <jian.universality@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CACJufxHVwYYSyiVQ6o+PsRX6zQ7rAFinh_fv1kCfTsT1xG4Zeg@mail.gmail.com
Backpatch-through: 17
truncate_useless_pathkeys() seems to have neglected to account for
PathKeys that might be useful for WindowClause evaluation. Modify it so
that it properly accounts for that.
Making this work required adjusting two things:
1. Change from checking query_pathkeys to check sort_pathkeys instead.
2. Add explicit check for window_pathkeys
For #1, query_pathkeys gets set in standard_qp_callback() according to the
sort order requirements for the first operation to be applied after the
join planner is finished, so this changes depending on which upper
planner operations a particular query needs. If the query has window
functions and no GROUP BY, then query_pathkeys gets set to
window_pathkeys. Before this change, this meant PathKeys useful for the
ORDER BY were not accounted for in queries with window functions.
Because of #1, #2 is now required so that we explicitly check to ensure
we don't truncate away PathKeys useful for window functions.
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com
Previously StrategyGetBuffer() acquired the buffer header spinlock for every
buffer, whether it was reusable or not. If reusable, it'd be returned, with
the lock held, to GetVictimBuffer(), which then would pin the buffer with
PinBuffer_Locked(). That's somewhat violating the spirit of the guidelines for
holding spinlocks (i.e. that they are only held for a few lines of consecutive
code) and necessitates using PinBuffer_Locked(), which scales worse than
PinBuffer() due to holding the spinlock. This alone makes it worth changing
the code.
However, the main reason to change this is that a future commit will make
PinBuffer_Locked() slower (due to making UnlockBufHdr() slower), to gain
scalability for the much more common case of pinning a pre-existing buffer. By
pinning the buffer with a single atomic operation, iff the buffer is reusable,
we avoid any potential regression for miss-heavy workloads. There strictly are
fewer atomic operations for each potential buffer after this change.
The price for this improvement is that freelist.c needs two CAS loops and
needs to be able to set up the resource accounting for pinned buffers. The
latter is achieved by exposing a new function for that purpose from bufmgr.c,
that seems better than exposing the entire private refcount infrastructure.
The improvement seems worth the complexity.
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
We're planning to merge buffer content locks into BufferDesc.state. To reduce
the size of that patch, centralize calls to BufferDescriptorGetContentLock().
The biggest part of the change is in assertions, by introducing
BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This
seems like an improvement even without aforementioned plans.
Additionally replace some direct calls to LWLockAcquire() with calls to
LockBuffer().
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
BM_PERMANENT is defined as 1U<<31, which is a negative number when interpreted
as a signed integer. Unfortunately the mask variable in BufferSync() was
signed. This has been wrong for a long time, but failed to fail, due to
integer conversion rules.
However, in an upcoming patch the width of the state variable will be
increased, with the wrong signedness leading to never flushing permanent
buffers - luckily caught in a test.
It seems better to fix this separately, instead of doing so as part of a
large, otherwise mechanical, patch.
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
While testing a new potential use for ReadRecentBuffer(), Andres
reported that it scales badly when called concurrently for the same
buffer by many backends. Instead of a naive (but wrong) coding with
PinBuffer(), it used the spinlock, so that it could be careful to pin
only if the buffer was valid and holding the expected block, to avoid
breaking invariants in eg GetVictimBuffer(). Unfortunately that made it
less scalable than PinBuffer(), which uses compare-exchange instead.
We can fix that by giving PinBuffer() a new skip_if_not_valid mode that
doesn't pin invalid buffers. It might occasionally skip when it
shouldn't due to the unlocked read of the header flags, but that's
unlikely and perfectly acceptable for an opportunistic optimisation
routine, and it can only succeed when it really should due to the
compare-exchange loop.
Note that this fixes ReadRecentBuffer()'s failure to bump the usage
count. While this could be seen as a bug, there currently aren't cases
affected by this in core, so it doesn't seem worth backpatching that portion.
Author: Thomas Munro <thomas.munro@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
This commit introduces a new column mem_exceeded_count to the
pg_stat_replication_slots view. This counter tracks how often the
memory used by logical decoding exceeds the logical_decoding_work_mem
limit. The new statistic helps users determine whether exceeding the
logical_decoding_work_mem limit is a rare occurrences or a frequent
issue, information that wasn't available through existing statistics.
Bumps catversion.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/978D21E8-9D3B-40EA-A4B1-F87BABE7868C@yesql.se
These hooks allow plugins to get control at the earliest point at
which the PlannerGlobal object is fully initialized, and then just
before it gets destroyed. This is useful in combination with the
extendable plan state facilities (see extendplan.h) and perhaps for
other purposes as well.
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com
This allows extensions to have access to any data they've stored
in the ExplainState during planning. Unfortunately, it won't help
with EXPLAIN EXECUTE is used, but since that case is less common,
this still seems like an improvement.
Since planner() has quite a few arguments now, also add some
documentation of those arguments and the return value.
Author: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com
Eager aggregation is a query optimization technique that partially
pushes aggregation past a join, and finalizes it once all the
relations are joined. Eager aggregation may reduce the number of
input rows to the join and thus could result in a better overall plan.
In the current planner architecture, the separation between the
scan/join planning phase and the post-scan/join phase means that
aggregation steps are not visible when constructing the join tree,
limiting the planner's ability to exploit aggregation-aware
optimizations. To implement eager aggregation, we collect information
about aggregate functions in the targetlist and HAVING clause, along
with grouping expressions from the GROUP BY clause, and store it in
the PlannerInfo node. During the scan/join planning phase, this
information is used to evaluate each base or join relation to
determine whether eager aggregation can be applied. If applicable, we
create a separate RelOptInfo, referred to as a grouped relation, to
represent the partially-aggregated version of the relation and
generate grouped paths for it.
Grouped relation paths can be generated in two ways. The first method
involves adding sorted and hashed partial aggregation paths on top of
the non-grouped paths. To limit planning time, we only consider the
cheapest or suitably-sorted non-grouped paths in this step.
Alternatively, grouped paths can be generated by joining a grouped
relation with a non-grouped relation. Joining two grouped relations
is currently not supported.
To further limit planning time, we currently adopt a strategy where
partial aggregation is pushed only to the lowest feasible level in the
join tree where it provides a significant reduction in row count.
This strategy also helps ensure that all grouped paths for the same
grouped relation produce the same set of rows, which is important to
support a fundamental assumption of the planner.
For the partial aggregation that is pushed down to a non-aggregated
relation, we need to consider all expressions from this relation that
are involved in upper join clauses and include them in the grouping
keys, using compatible operators. This is essential to ensure that an
aggregated row from the partial aggregation matches the other side of
the join if and only if each row in the partial group does. This
ensures that all rows within the same partial group share the same
"destiny", which is crucial for maintaining correctness.
One restriction is that we cannot push partial aggregation down to a
relation that is in the nullable side of an outer join, because the
NULL-extended rows produced by the outer join would not be available
when we perform the partial aggregation, while with a
non-eager-aggregation plan these rows are available for the top-level
aggregation. Pushing partial aggregation in this case may result in
the rows being grouped differently than expected, or produce incorrect
values from the aggregate functions.
If we have generated a grouped relation for the topmost join relation,
we finalize its paths at the end. The final paths will compete in the
usual way with paths built from regular planning.
The patch was originally proposed by Antonin Houska in 2017. This
commit reworks various important aspects and rewrites most of the
current code. However, the original patch and reviews were very
useful.
Author: Richard Guo <guofenglinux@gmail.com>
Author: Antonin Houska <ah@cybertec.at> (in an older version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me> (in an older version)
Reviewed-by: Andy Fan <zhihuifan1213@163.com> (in an older version)
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> (in an older version)
Discussion: https://postgr.es/m/CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com
The following information is added in the description of some GIN
records:
- In INSERT_LISTPAGE, the number of tuples and the right link block.
- In UPDATE_META_PAGE, the number of tuples, the previous tail block,
and the right link block.
- In SPLIT, the left and right children blocks.
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgnAt5L=D_xGXRXLYO5FK1H31_eYEESxdU1n-r4g+6GqA@mail.gmail.com
It is possible to call pg_stat_reset_single_function_counters() for a
single function, but the reset time was missing the system view showing
its statistics. Like all the fields of pg_stat_user_functions, the GUC
track_functions needs to be enabled to show the statistics about
function executions.
Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, as a result of the new field added to
PgStat_StatFuncEntry.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aONjnsaJSx-nEdfU@paquier.xyz
Fix several issues pointed out by Coverity (reported by Tome Lane).
- In row_is_in_frame(), return value of window_gettupleslot() was not
checked.
- WinGetFuncArgInPartition() tried to derefference "isout" pointer
even if it could be NULL in some places.
Besides the issues, I also fixed a compiler warning reported by Álvaro
Herrera.
Moreover, in WinGetFuncArgInPartition refactor the do...while loop so
that the codes inside the loop simpler. Also simplify the case when
abs_pos < 0.
Author: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Paul Ramsey <pramsey@cleverelephant.ca>
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/1686755.1759679957%40sss.pgh.pa.us
Discussion: https://postgr.es/m/202510051612.gw67jlc2iqpw%40alvherre.pgsql
Instead, use the new mechanism that allows planner extensions to store
private state inside a PlannerInfo, treating GEQO as an in-core planner
extension. This is a useful test of the new facility, and also buys
back a few bytes of storage.
To make this work, we must remove innerrel_is_unique_ext's hack of
testing whether join_search_private is set as a proxy for whether
the join search might be retried. Add a flag that extensions can
use to explicitly signal their intentions instead.
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com
Extension that make extensive use of planner hooks may want to
coordinate their efforts, for example to avoid duplicate computation,
but that's currently difficult because there's no really good way to
pass data between different hooks.
To make that easier, allow for storage of extension-managed private
state in PlannerGlobal, PlannerInfo, and RelOptInfo, along very
similar lines to what we have permitted for ExplainState since commit
c65bc2e1d1.
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com
Previously, subqueries were given names only after they were planned,
which makes it difficult to use information from a previous execution of
the query to guide future planning. If, for example, you knew something
about how you want "InitPlan 2" to be planned, you won't know whether
the subquery you're currently planning will end up being "InitPlan 2"
until after you've finished planning it, by which point it's too late to
use the information that you had.
To fix this, assign each subplan a unique name before we begin planning
it. To improve consistency, use textual names for all subplans, rather
than, as we did previously, a mix of numbers (such as "InitPlan 1") and
names (such as "CTE foo"), and make sure that the same name is never
assigned more than once.
We adopt the somewhat arbitrary convention of using the type of sublink
to set the plan name; for example, a query that previously had two
expression sublinks shown as InitPlan 2 and InitPlan 1 will now end up
named expr_1 and expr_2. Because names are assigned before rather than
after planning, some of the regression test outputs show the numerical
part of the name switching positions: what was previously SubPlan 2 was
actually the first one encountered, but we finished planning it later.
We assign names even to subqueries that aren't shown as such within the
EXPLAIN output. These include subqueries that are a FROM clause item or
a branch of a set operation, rather than something that will be turned
into an InitPlan or SubPlan. The purpose of this is to make sure that,
below the topmost query level, there's always a name for each subquery
that is stable from one planning cycle to the next (assuming no changes
to the query or the database schema).
Author: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: http://postgr.es/m/3641043.1758751399@sss.pgh.pa.us
When either inputs of an INTERSECT [ALL] operator are proven not to return
any results (a dummy rel), then mark the entire INTERSECT operation as
dummy.
Likewise, if an EXCEPT [ALL] operation's left input is proven empty, then
mark the entire operation as dummy.
With EXCEPT ALL, we can easily handle the right input being dummy as
we can return the left input without any processing. That can lead to
significant performance gains during query execution. We can't easily
handle dummy right inputs for EXCEPT (without ALL), as that would require
deduplication of the left input. Wiring up those Paths is likely more
complex than it's worth as the gains during execution aren't that great,
so let's leave that one to be handled by the normal Path generation code.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAApHDvri53PPF76c3M94_QNWbJfXjyCnjXuj_2=LYM-0m8WZtw@mail.gmail.com
The prior code, added in 03d40e4b5 attempted to use the targetlist of the
first UNION child when all UNION children were proven as dummy rels.
That's not going to work when some operation atop of the Result node must
find target entries within the Result's targetlist. This could have been
something as simple as trying to sort the results of the UNION operation,
which would lead to:
ERROR: could not find pathkey item to sort
Instead, use the top-level UNION's targetlist and fix the varnos in
setrefs.c. Because set operation targetlists always use varno==0, we
can rewrite those to become varno==1, i.e. use the Vars from the first
UNION child. This does result in showing Vars from relations that are
not present in the final plan, but that's no different to what we see
when normal base relations are proven dummy.
Without this fix it would be possible to see the following error in
EXPLAIN VERBOSE when all UNION inputs were proven empty.
ERROR: bogus varno: 0
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrUASy9sfULMEsM2udvZJP6AoBRCZvHYXYxZTy2tX9FYw@mail.gmail.com
Previously, we attempted to form a posting list tuple even when
ginCompressPostingList() failed to compress the posting list due to
its size. While there was no functional failure, it always wasted one
GinFormTuple() call when item pointers didn't fit in a posting list
tuple.
This commit ensures that a GIN index tuple is formed only when all
item pointers in the posting list are successfully compressed.
Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAE7r3M+C=jcpTD93f_RBHrQp3C+=TAXFs+k4tTuZuuxboK8AvA@mail.gmail.com
The hex_encode() and hex_decode() functions serve as the workhorses
for hexadecimal data for bytea's text format conversion functions,
and some workloads are sensitive to their performance. This commit
adds new implementations that use routines from port/simd.h, which
testing indicates are much faster for larger inputs. For small or
invalid inputs, we fall back on the existing scalar versions.
Since we are using port/simd.h, these optimizations apply to both
x86-64 and AArch64.
Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Co-authored-by: Susmitha Devanga <devanga.susmitha@fujitsu.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/aLhVWTRy0QPbW2tl%40nathan
This patch enhances the pg_get_sequence_data function to include the
page-level LSN (Log Sequence Number) of the sequence. This additional
metadata will be used by upcoming patches to support synchronization
of sequences during logical replication.
By exposing the LSN, we enable more accurate tracking of sequence
changes, which is essential for maintaining consistency across
replicated nodes.
Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
The WAL records XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE
included some information about the blocks added to the record.
This information is already provided by XLogRecGetBlockRefInfo() with
much more details about the blocks included in each record, like the
compression information, for example. This commit removes the block
information that existed in the record descriptions specific to GIN.
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgk=9WRoXhZy5fdk+T1hiau7qbL_vn94w_L1N=gtEdbsg@mail.gmail.com
It is possible to call pg_stat_reset_single_table_counters() on a
relation (index or table) but the reset time was missing from the system
views showing their statistics.
This commit adds the reset time as an attribute of pg_stat_all_tables,
pg_stat_all_indexes, and other relations related to them.
Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, as a result of the new field added to
PgStat_StatTabEntry.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aN8l182jKxEq1h9f@paquier.xyz
This adjusts UNION planning so that the planner produces more optimal
plans when one or more of the UNION's subqueries have been proven to be
empty (a dummy rel).
If any of the inputs are empty, then that input can be removed from the
Append / MergeAppend. Previously, a const-false "Result" node would
appear to represent this. Removing empty inputs has a few extra
benefits when only 1 union child remains as it means the Append or
MergeAppend can be removed in setrefs.c, making the plan slightly faster
to execute. Also, we can provide better n_distinct estimates by looking
at the sole remaining input rel's statistics.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAApHDvri53PPF76c3M94_QNWbJfXjyCnjXuj_2=LYM-0m8WZtw@mail.gmail.com
bms_union() causes a new set to be allocated. What this caller needs is
members added to an existing set. bms_add_members() is the tool for
that job.
This is just a matter of fixing an inefficiency due to surplus memory
allocations. No bugs being fixed.
The only other place I found that might be valid to apply this change is
in markNullableIfNeeded(), but I opted not to do that due to the risk to
reward ratio not looking favorable. The risk being that there *could* be
another pointer pointing to the Bitmapset.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAApHDvoCcoS-p5tZNJLTxFOKTYNjqVh7Dwf+5ikDUBwnvWftRw@mail.gmail.com
Make some use of anonymous unions, which are allowed as of C11, as
examples and encouragement for future code, and to test compilers.
This commit changes the DSMRegistryEntry struct.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/aNKsDg0fJwqhZdXX%40nathan
In similar vein to commit ccc8194e42, a reset instance of a shared
memory TID store happened to occupy the same private memory as the old
one for the entry point, since the chunk freed after the last round
of index vacuuming was put on the context's freelist. The failure
to update the vacrel->dead_items pointer was evident by nudging the
system to allocate memory in a different area. This was not discovered
at the time of the earlier commit since our regression tests didn't
cover multiple index passes with parallel vacuum.
Backpatch to v17, when TidStore came in.
Author: Kevin Oommen Anish <kevin.o@zohocorp.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Tested-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/199a07cbdfc.7a1c4aac25838.1675074408277594551%40zohocorp.com
Backpatch-through: 17
The comment incorrectly references the defunct function
BufFileOpenShared(), which was replaced in commit dcac5e7ac.
This patch updates the comment to refer to the current function
BufFileOpenFileSet().
Author: Zhang Mingli <zmlpostgres@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/1cb48b4c-54ab-40cc-b355-0b3c2af6d3f7@Spark
Add IGNORE NULLS/RESPECT NULLS option (null treatment clause) to lead,
lag, first_value, last_value and nth_value window functions. If
unspecified, the default is RESPECT NULLS which includes NULL values
in any result calculation. IGNORE NULLS ignores NULL values.
Built-in window functions are modified to call new API
WinCheckAndInitializeNullTreatment() to indicate whether they accept
IGNORE NULLS/RESPECT NULLS option or not (the API can be called by
user defined window functions as well). If WinGetFuncArgInPartition's
allowNullTreatment argument is true and IGNORE NULLS option is given,
WinGetFuncArgInPartition() or WinGetFuncArgInFrame() will return
evaluated function's argument expression on specified non NULL row (if
it exists) in the partition or the frame.
When IGNORE NULLS option is given, window functions need to visit and
evaluate same rows over and over again to look for non null rows. To
mitigate the issue, 2-bit not null information array is created while
executing window functions to remember whether the row has been
already evaluated to NULL or NOT NULL. If already evaluated, we could
skip the evaluation work, thus we could get better performance.
Author: Oliver Ford <ojford@gmail.com>
Co-authored-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Krasiyan Andreev <krasiyan@gmail.com>
Reviewed-by: Andrew Gierth <andrew@tao11.riddles.org.uk>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Fetter <david@fetter.org>
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/flat/CAGMVOdsbtRwE_4+v8zjH1d9xfovDeQAGLkP_B6k69_VoFEgX-A@mail.gmail.com
Oversight in 2c03216d83, when the redo code of GIN got refactored for
the new WAL format where block information has been standardized, as the
payload data got tracked for each block after the change, and not in the
whole record. This is just a cleanup.
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgnAt5L=D_xGXRXLYO5FK1H31_eYEESxdU1n-r4g+6GqA@mail.gmail.com
Stop including utils/relcache.h in access/genam.h, and stop including
htup_details.h in nodes/tidbitmap.h. Both these files (genam.h and
tidbitmap.h) are widely used in other header files, so it's in our best
interest that they remain as lean as reasonable.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/202509291356.o5t6ny2hoa3q@alvherre.pgsql
During recovery, XLogNeedsFlush() checks the minimum recovery LSN point
instead of the flush LSN point. The same condition checks are used when
updating the minimum recovery point in UpdateMinRecoveryPoint(), but are
written in reverse order.
This commit makes the order of the checks consistent between
XLogNeedsFlush() and UpdateMinRecoveryPoint(), improving the code
clarity. Note that the second check (as ordered by this commit) relies
on InRecovery, which is true only in the startup process. So this makes
XLogNeedsFlush() cheaper in the startup process with the first check
acting as a shortcut while doing crash recovery, where
LocalMinRecoveryPoint is an invalid LSN.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/aMIHNRTP6Wj6vw1s%40paquier.xyz
GROUP BY ALL is a form of GROUP BY that adds any TargetExpr that does
not contain an aggregate or window function into the groupClause of
the query, making it exactly equivalent to specifying those same
expressions in an explicit GROUP BY list.
This feature is useful for certain kinds of data exploration. It's
already present in some other DBMSes, and the SQL committee recently
accepted it into the standard, so we can be reasonably confident in
the syntax being stable. We do have to invent part of the semantics,
as the standard doesn't allow for expressions in GROUP BY, so they
haven't specified what to do with window functions. We assume that
those should be treated like aggregates, i.e., left out of the
constructed GROUP BY list.
In passing, wordsmith some existing documentation about GROUP BY,
and update some neglected synopsis entries in select_into.sgml.
Author: David Christensen <david@pgguru.net>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAHM0NXjz0kDwtzoe-fnHAqPB1qA8_VJN0XAmCgUZ+iPnvP5LbA@mail.gmail.com
Neighbor get_statistics_object_oid() ignores objects in pg_temp, as has
been the standard for non-relation, non-type namespace searches since
CVE-2007-2138. Hence, most operations that name a statistics object
correctly decline to map an unqualified name to a statistics object in
pg_temp. StatisticsObjIsVisibleExt() did not. Consequently,
pg_statistics_obj_is_visible() wrongly returned true for such objects,
psql \dX wrongly listed them, and getObjectDescription()-based ereport()
and pg_describe_object() wrongly omitted namespace qualification. Any
malfunction beyond that would depend on how a human or application acts
on those wrong indications. Commit
d99d58cdc8 introduced this. Back-patch to
v13 (all supported versions).
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/20250920162116.2e.nmisch@google.com
Backpatch-through: 13
For UNION, EXCEPT and INTERSECT, we were not very good at estimating the
PathTarget.width for the set operation. Since the targetlist of the set
operation is made up of Vars with varno==0, this would result in
get_expr_width() applying a default estimate based on the Var's type
rather than taking width estimates from any relation's statistics.
Here we attempt to improve the situation by looking at the width estimates
for the set operation child paths and calculating the average width of the
relevant child paths weighted over the estimated number of rows. For
UNION and INTERSECT, the relevant paths to look at are *all* child paths.
For EXCEPT, since we don't return rows from the right-hand child (only
possibly remove left-hand rows matching those), we use only the left-hand
child for width estimates.
This also adjusts the hashed-UNION Path's PathTarget to use the same
PathTarget as its Append subpath. Both PathTargets will be the same and
are void of any resjunk columns, per generate_append_tlist(). Making
the AggPath use the same PathTarget saves having to adjust the "width"
of the AggPath's PathTarget too.
This was reported as a bug by sunw.fnst, but it's not something we ever
claimed to do properly. Plus, if we were to adjust this in back
branches, plans could change as the estimated input sizes to Sorts and
Hash Aggregates could go up or down. Plan choices aren't something we
want to destabilize in stable versions.
Reported-by: sunw.fnst <936739278@qq.com>
Author: David Rowley <drowleyml@gmail.com>
Discussion: https://postgr.es/m/tencent_34CF8017AB81944A4C08DD089D410AB6C306@qq.com
Stats kinds can set a new option called "track_entry_count" (disabled by
default, available for variable-numbered stats) that will make pgstats
track the number of entries that exist in its shared hashtable.
As there is only one code path where a new entry is added, and one code
path where entries are freed, the count tracking is straight-forward in
its implementation. Reads of these counters are optimistic, and may
change across two calls. The counter is incremented when an entry is
created (not when reused), and is decremented when an entry is freed
from the hashtable (marked for drop with its refcount reaching 0), which
is something that pgstats decides internally.
A first use case of this facility would be pg_stat_statements, where we
need to be able to cap the number of entries that would be stored in the
shared hashtable, based on its "max" GUC. The module currently relies
on hash_get_num_entries(), which offers a cheap way to count how many
entries are in its hash table, but we cannot do that in pgstats for
variable-sized stats kinds as a single hashtable is used for all the
stats kinds. Independently of PGSS, this is useful for other custom
stats kinds that want to cap, control, or track the number of entries
they have, without depending on a potentially expensive sequential scan
to know the number of entries while holding an extra exclusive lock.
Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Keisuke Kuroda <keisuke.kuroda.3862@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aMPKWR81KT5UXvEr@paquier.xyz
transformPLAssignStmt contained many lines cribbed directly from
transformSelectStmt. I had supposed that we could manage to keep
the two copies in sync, but the bug just fixed in 7504d2be9 shows
that that hope was foolish. Let's refactor so there's just one copy.
The main stumbling block to doing this is that transformPLAssignStmt
has a chunk of custom code that has to run after transformTargetList
but before we potentially modify the tlist further during analysis
of ORDER BY and GROUP BY. Rather than make transformSelectStmt fully
aware of PLAssignStmt processing, I put that code into a callback
function. It still feels a little bit ugly, but it's not too awful,
and surely it's better than a hundred lines of duplicated code.
The steps involved in processing a PLAssignStmt remain exactly
the same as before, just in different places.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/31027.1758919078@sss.pgh.pa.us
Because we failed to do this, DISTINCT in GROUP BY DISTINCT would be
ignored in PL/pgSQL assignment statements. It's not surprising that
no one noticed, since such statements will throw an error if the query
produces more than one row. That eliminates most scenarios where
advanced forms of GROUP BY could be useful, and indeed makes it hard
even to find a simple test case. Nonetheless it's wrong.
This is directly the fault of be45be9c3 which added the groupDistinct
field, but I think much of the blame has to fall on c9d529848, in
which I incautiously supposed that we'd manage to keep two copies of
a big chunk of parse-analysis logic in sync. As a follow-up, I plan
to refactor so that there's only one copy. But that seems useful
only in master, so let's use this one-line fix for the back branches.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/31027.1758919078@sss.pgh.pa.us
Backpatch-through: 14
Use our established coding pattern to reduce maintenance pain when
adding other per-process-type characteristics.
Like PG_KEYWORD, PG_CMDTAG, PG_RMGR.
To keep the strings translatable, the relevant makefile now also scans
src/include for this specific file. I didn't want to have it scan all
.h files, as then gettext would have to scan all header files. I didn't
find any way to affect the meson behavior in this respect though.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Co-authored-by: Jonathan Gonzalez V. <jonathan.abdiel@gmail.com>
Discussion: https://postgr.es/m/202507151830.dwgz5nmmqtdy@alvherre.pgsql
This improves the stability of VACUUM when processing btree indexes,
which was previously able to trigger an assertion failure in
_bt_lock_subtree_parent() when an error was previously thrown outside
the scope of _bt_split() when splitting a btree page. VACUUM would
consider the index as in a corrupted state as the right page would not
be zeroed for the error thrown (allocation failure is one pattern).
In a non-assert build, VACUUM is able to succeed, reporting what it sees
as a corruption while attempting to fix the index. This would manifest
as a LOG message, as of:
LOG: failed to re-find parent key in index "idx" for deletion target
page N
CONTEXT: while vacuuming index "idx" of relation "public.tab"
This commit improves the code to rely on two PGAlignedBlocks that are
used as a temporary space for the left and right pages. The main change
concerns the right page, whose contents are now copied into the
"temporary" PGAlignedBlock page while its original space is zeroed. Its
contents are moved from the PGAlignedBlock page back to the page once we
enter in the critical section used for the split. This simplifies the
split logic, as it is not necessary to zero the right page before
throwing an error anymore. Hence errors can now be thrown outside the
split code. For the left page, this shaves one allocation, with
PageGetTempPage() being previously used.
The previous logic originates from commit 8fa30f906b, at a point where
PGAlignedBlock did not exist yet. This could be argued as something
that should be backpatched, but the lack of complaints indicates that it
may not be necessary.
Author: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/566dacaf-5751-47e4-abc6-73de17a5d42a@garret.ru
The comment claimed that a TABLESPACE reference was added to the
resulting string, but that's not true. Looks like the comment was
copied from pg_get_indexdef_string() without being adjusted correctly.
Reported-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxHwVPgeu8o9D8oUeDQYEHTAZGt-J5uaJNgYMzkAW7MiCA@mail.gmail.com
... and find_window_run_conditions.
This seems to have been around and unused ever since the Run Condition
feature was added in 9d9c02ccd. Let's remove it to clean things up a
bit.
Author: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/DD26NJ0Y34ZS.2ZOJPHSY12PFI@gmail.com
I noticed the surprising behavior that pg_sleep(0.001) will sleep
for 2ms not the expected 1ms. Apparently the float8 calculation of
time-to-sleep is managing to produce something a hair over 1, which
ceil() rounds up to 2, and then WaitLatch() faithfully waits 2ms.
It could be that this works as-expected for some ranges of current
timestamp but not others, which would account for not having seen
it before. In any case, let's try to avoid it by removing the
float arithmetic in the delay calculation. We're stuck with the
declared input type being float8, but we can convert that to integer
microseconds right away, and then work strictly with integral values.
There might still be roundoff surprises for certain input values,
but at least the behavior won't be time-varying.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/3879137.1758825752@sss.pgh.pa.us
If we already have an extension_state array but see a new extension_id
much larger than the highest the extension_id we've previously seen,
the old code might have failed to expand the array to a large enough
size, leading to disaster. Also, if we don't have an extension array
at all and need to create one, we should make sure that it's big enough
that we don't have to resize it instantly.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/2949591.1758570711@sss.pgh.pa.us
Backpatch-through: 18
When defining an injection point there is no need to wrap the definition
with USE_INJECTION_POINT guards, the INJECTION_POINT macro is available
in all builds. Remove to make the code consistent.
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/OSCPR01MB14966C8015DEB05ABEF2CE077F51FA@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 17
... which silently propagates a lot of headers into many places
via pgstat.h, as evidenced by the variety of headers that this patch
needs to add to seemingly random places. Add a minimum of typedefs to
conflict.h to be able to remove execnodes.h, and fix the fallout.
Backpatch to 18, where conflict.h first appeared.
Discussion: https://postgr.es/m/202509191927.uj2ijwmho7nv@alvherre.pgsql
f83d709760 incorrectly refers to a XLOG_HEAP2_PRUNE_FREEZE WAL record
opcode. No such code exists. The relevant opcodes are
XLOG_HEAP2_PRUNE_ON_ACCESS, XLOG_HEAP2_PRUNE_VACUUM_SCAN, and
XLOG_HEAP2_PRUNE_VACUUM_CLEANUP. Correct it.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/yn4zp35kkdsjx6wf47zcfmxgexxt4h2og47pvnw2x5ifyrs3qc%407uw6jyyxuyf7
Without this, rebuilds can malfunction unless --enable-depend is used.
Historically we've expected that you can get away without
--enable-depend as long as you manually clean after changing *.h
files; the makefiles are supposed to handle other sorts of
dependencies. So add this one.
Follow-on to 635998965, so no need for back-patch.
Discussion: https://postgr.es/m/3121329.1758650878@sss.pgh.pa.us
This commit corrects several issues in function comments:
* The parameter "rel" was incorrectly referred to as "relation" in the comments
for table_tuple_delete(), table_tuple_update(), and table_tuple_lock().
* In table_tuple_delete(), "changingPart" was listed as an output parameter
in the comments but is actually input.
* In table_tuple_update(), "slot" was listed as an input parameter
in the comments but is actually output.
* The comment for "update_indexes" in table_tuple_update() was mis-indented.
* The comments for heap_lock_tuple() incorrectly referenced a non-existent
"tid" parameter.
Author: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2nB6Ay8g=KEn7L3qbYX_4+sLk9XOMkV0XZqHR4cTY8ZvQ@mail.gmail.com
Format validation and element extraction for intermediate line
strings were inconsistent in their handling of tab delimiters,
which resulted in an unclear error when multiple tab characters
were used as a delimiter. This fixes it by using captures from
the validation regex instead of a separate split() to avoid the
inconsistency. Also, it ensures that \t+ is used consistently
when inspecting the strings.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/20250729.135638.1148639539103758555.horikyota.ntt@gmail.com
Mappings for 18 characters have changed, affecting 36 code points. This
is a break in compatibility, but these characters are rarely used.
U+E5E5 (Private Use Area) was previously mapped to \xA3A0. This code
point now maps to \x65356535. Attempting to convert \xA3A0 will now
raise an error.
Separate from the 2022 update, the following mappings were previously
swapped, and subsequently corrected in 2000 and later versions:
* U+E7C7 (Private Use Area) now maps to \x8135F437
* U+1E3F (Latin Small Letter M with Acute) now maps to \xA8BC
The 2022 standard mentions the following policy changes, but they
have no effect in our implementation:
66 new ideographs are now required, but these are mapped
algorithmically so were already handled by utf8_and_gb18030.c.
Nine CJK compatibility ideographs are no longer required, but
implementations may retain them, as does the source we use from
the Unicode Consortium.
Release notes: Compatibility section
For further details, see:
https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdfhttps://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
Author: Chao Li <lic@highgo.com>
Author: Zheng Tao <taoz@highgo.com>
Discussion: https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com
Previously, the parallel apply worker used SIGINT to receive a graceful
shutdown signal from the leader apply worker. However, SIGINT is also used
by the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. This
overlap caused the parallel apply worker to miss LOCK_TIMEOUT signals,
leading to incorrect behavior during lock wait/contention.
This patch resolves the conflict by switching the graceful shutdown signal
from SIGINT to SIGUSR2.
Reported-by: Zane Duffield <duffieldzane@gmail.com>
Diagnosed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 16, where it was introduced
Discussion: https://postgr.es/m/CACMiCkXyC4au74kvE2g6Y=mCEF8X6r-Ne_ty4r7qWkUjRE4+oQ@mail.gmail.com
Result nodes now include an RTI set, which is only non-NULL when they
have no subplan, and is taken from the relid set of the RelOptInfo that
the Result is generating. ExplainPreScanNode now takes notice of these
RTIs, which means that a few things get schema-qualified in the
regression tests that previously did not. This makes the output more
consistent between cases where some part of the plan tree is replaced by
a Result node and those where this does not happen.
Likewise, pg_overexplain's EXPLAIN (RANGE_TABLE) now displays the RTIs
stored in a Result node just as it already does for other RTI-bearing
node types.
Result nodes also now include a result_reason, which tells us something
about why the Result node was inserted. Using that information, EXPLAIN
now emits, where relevant, a "Replaces" line describing the origin of
a Result node.
The purpose of these changes is to allow code that inspects a Plan
tree to understand the origin of Result nodes that appear therein.
Discussion: http://postgr.es/m/CA+TgmoYeUZePZWLsSO+1FAN7UPePT_RMEZBKkqYBJVCF1s60=w@mail.gmail.com
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
This adds support for base64url encoding and decoding, a base64
variant which is safe to use in filenames and URLs. base64url
replaces '+' in the base64 alphabet with '-' and '/' with '_',
thus making it safe for URL addresses and file systems.
Support for base64url was originally suggested by Przemysław Sztoch.
Author: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: David E. Wheeler <david@justatheory.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Chao Li (Evan) <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/70f2b6a8-486a-4fdb-a951-84cef35e22ab@sztoch.pl
The lossy-counting algorithm that ANALYZE uses to identify most-common
array elements has a notion of cutoff frequency: elements with
frequency greater than that are guaranteed to be collected, elements
with smaller frequencies are not. In cases where we find fewer MCEs
than the stats target would permit us to store, the cutoff frequency
provides valuable additional information, to wit that there are no
non-MCEs with frequency greater than that. What the selectivity
estimation functions actually use the "minfreq" entry for is as a
ceiling on the possible frequency of non-MCEs, so using the cutoff
rather than the lowest stored MCE frequency provides a tighter bound
and more accurate estimates.
Therefore, instead of redundantly storing the minimum observed MCE
frequency, store the cutoff frequency when there are fewer tracked
values than we want. (When there are more, then of course we cannot
assert that no non-stored elements are above the cutoff frequency,
since we're throwing away some that are; so we still use the
minimum stored frequency in that case.)
Notably, this works even when none of the values are common enough
to be called MCEs. In such cases we previously stored nothing in
the STATISTIC_KIND_MCELEM pg_statistic slot, which resulted in the
selectivity functions falling back to default estimates. So in that
case we want to construct a STATISTIC_KIND_MCELEM entry that contains
no "values" but does have "numbers", to wit the three extra numbers
that the MCELEM entry type defines. A small obstacle is that
update_attstats() has traditionally stored a null, not an empty array,
when passed zero "values" for a slot. That gives rise to an MCELEM
entry that get_attstatsslot() will spit up on. The least risky
solution seems to be to adjust update_attstats() so that it will emit
a non-null (but possibly empty) array when the passed stavalues array
pointer isn't NULL, rather than conditioning that on numvalues > 0.
In other existing cases I don't believe that that changes anything.
For consistency, handle the stanumbers array the same way.
In passing, improve the comments in routines that use
STATISTIC_KIND_MCELEM data. Particularly, explain why we use
minfreq / 2 not minfreq as the estimate for non-MCE values.
Thanks to Matt Long for the suggestion that we could apply this
idea even when there are more than zero MCEs.
Reported-by: Mark Frost <FROSTMAR@uk.ibm.com>
Reported-by: Matt Long <matt@mattlong.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/PH3PPF1C905D6E6F24A5C1A1A1D8345B593E16FA@PH3PPF1C905D6E6.namprd15.prod.outlook.com
Commit a391ff3c3, which added the ability for a function's support
function to provide a custom selectivity estimate for "WHERE f(...)",
unintentionally removed the possibility of applying expression
statistics after finding there's no applicable support function.
That happened because we no longer fell through to boolvarsel()
as before. Refactor to do so again, putting the 0.3333333 default
back into boolvarsel() where it had been (cf. commit 39df0f150).
I surely wouldn't have made this error if 39df0f150 had included
a test case, so add one now. At the time we did not have the
"extended statistics" infrastructure, but we do now, and it is
also unable to work in this scenario because of this error.
So make use of that for the test case.
This is very clearly a bug fix, but I'm afraid to put it into
released branches because of the likelihood of altering plan
choices, which we avoid doing in minor releases. So, master only.
Reported-by: Frédéric Yhuel <frederic.yhuel@dalibo.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/a8b99dce-1bfb-4d97-af73-54a32b85c916@dalibo.com
Initially this was to fix the "catched" typo, but I (David) wasn't quite
clear on what the previous comment meant about being "effective". I
expect this means efficiency, so I've reworded the comment to indicate
that.
While this is only a comment fixup, for the sake of possibly minimizing
possible future backpatching pain, I've opted to backpatch to 18 since
this code is new to that version and the release isn't out the door yet.
Author: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNmSYWPud1sfBvpKbCJeRkWeZYuqatxtV9U9LvAFXBEiBw@mail.gmail.com
Backpatch-through: 18
Commit 216a784829 introduced parallel apply workers, allowing multiple
processes to share a replication origin. To support this,
replorigin_session_setup() was extended to accept a pid argument
identifying the process using the origin.
This commit exposes that capability through the SQL interface function
pg_replication_origin_session_setup() by adding an optional pid parameter.
This enables multiple processes to coordinate replication using the same
origin when using SQL-level replication functions.
This change allows the non-builtin logical replication solutions to
implement parallel apply for large transactions.
Additionally, an existing internal error was made user-facing, as it can
now be triggered via the exposed SQL API.
Author: Doruk Yilmaz <doruk@mixrank.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CAMPB6wfe4zLjJL8jiZV5kjjpwBM2=rTRme0UCL7Ra4L8MTVdOg@mail.gmail.com
Discussion: https://postgr.es/m/CAE2gYzyTSNvHY1+iWUwykaLETSuAZsCWyryokjP6rG46ZvRgQA@mail.gmail.com
When deciding which code path to use depending on the state of recovery,
XLogFlush() and XLogNeedsFlush() have been relying on different
criterias:
- XLogFlush() relied on XLogInsertAllowed().
- XLogNeedsFlush() relied on RecoveryInProgress().
Currently, the checkpointer is allowed to insert WAL records while
RecoveryInProgress() returns true for an end-of-recovery checkpoint,
where XLogInsertAllowed() matters. Using RecoveryInProgress() in
XLogNeedsFlush() did not really matter for its existing callers, as the
checkpointer only called XLogFlush(). However, a feature under
discussion, by Melanie Plageman, needs XLogNeedsFlush() to be able to
work in more contexts, the end-of-recovery checkpoint being one.
This commit changes XLogNeedsFlush() to use XLogInsertAllowed() instead
of RecoveryInProgress(), making the checks in both routines more
consistent. While on it, an assertion based on XLogNeedsFlush() is
added at the end of XLogFlush(), triggered when flushing a physical
position (not for the normal recovery patch that checks for updates of
the minimum recovery point). This assertion would fail for example in
the recovery test 015_promotion_pages if XLogNeedsFlush() is changed to
use RecoveryInProgress(). This should be hopefully enough to ensure
that the checks done in both routines remain consistent.
Author: Melanie Plageman <melanieplageman@gmail.com>
Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g@mail.gmail.com
Commit bb3ec16e14 moved partition pruning metadata into PlannedStmt.
At executor startup this metadata is used to initialize the EState
fields es_part_prune_infos, es_part_prune_states, and
es_part_prune_results. EvalPlanQualStart() failed to copy those
fields into the child EState, causing NULL dereference when Append
ran partition pruning during a recheck. This can occur with DELETE
or UPDATE on partitioned tables that use runtime pruning, e.g. with
generic plans.
Fix by copying all partition pruning state into the EPQ estate.
Add an isolation test that reproduces the crash with concurrent
UPDATE and DELETE on a partitioned table, where the DELETE session
hits the crash during its EPQ recheck after the UPDATE commits.
Bug: #19056
Reported-by: Fei Changhong <feichanghong@qq.com>
Diagnozed-by: Fei Changhong <feichanghong@qq.com>
Author: David Rowley <dgrowleyml@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/19056-a677cef9b54d76a0%40postgresql.org
This change is a tighter rework of 7d85d87f4d, which tried to improve
the code so as it would work should PgStat_HashKey gain new fields that
create padding bytes. However, the previous change is proving to not be
enough as some code paths of pgstats do not pass PgStat_HashKey by
reference (valgrind would warn when padding is added to the structure,
through a new field).
Per discussion, let's document and check that PgStat_HashKey has no
padding rather than try to complicate the code of pgstats so as it is
able to work around that.
This removes a couple of memset(0) calls that should not be required.
While on it, this commit adds a static assertion checking that no
padding is introduced in the structure, by checking that the size of
PgStat_HashKey matches with the sum of the size of all its fields.
The object ID part of the hash key is already 8 bytes, which should be
plenty enough already. A comment is added to discourage the addition of
new fields.
Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0t9omat+HVSakJXwTMWvhpYFcAZb41RPWKwrKFUgmAFBQ@mail.gmail.com
When shared memory is re-initialized after a crash, the named
LWLock tranche request array that was copied to shared memory will
no longer be accessible. To fix, save the pointer to the original
array in postmaster's local memory, and switch to it when
re-initializing the LWLock-related shared memory.
Oversight in commit ed1aad15e0. Per buildfarm member batta.
Reported-by: Michael Paquier <michael@paquier.xyz>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aMoejB3iTWy1SxfF%40paquier.xyz
Discussion: https://postgr.es/m/f8ca018f-3479-49f6-a92c-e31db9f849d7%40gmail.com
StrategyInitialize() calls InitBufTable() with maximum number of entries that
the buffer lookup table can ever have. Thus there should not be any need to
allocate more element after initialization. Hence mark the hash table as fixed
sized.
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAExHW5v0jh3F_wj86yC=qBfWk0uiT94qy=Z41uzAHLHh0SerRA@mail.gmail.com
If an aggregate function call contains a sub-select that has
an RTE referencing a CTE outside the aggregate, we must treat
that reference like a Var referencing the CTE's query level
for purposes of determining the aggregate's level. Otherwise
we might reach the nonsensical conclusion that the aggregate
should be evaluated at some query level higher than the CTE,
ending in a planner error or a broken plan tree that causes
executor failures.
Bug: #19055
Reported-by: BugForge <dllggyx@outlook.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19055-6970cfa8556a394d@postgresql.org
Backpatch-through: 13
Commit 2a600a93 made Datum 8 bytes wide everywhere. It was no longer
appropriate to use TypeSizeT on 32 bit systems, and JIT compilation
would fail with various type check errors. Introduce a separate
LLVMTypeRef with the name TypeDatum. TypeSizeT is still used in some
places for actual size_t values.
Reported-by: Dmitry Mityugov <d.mityugov@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Dmitry Mityugov <d.mityugov@postgrespro.ru>
Discussion: https://postgr.es/m/0a9f0be59171c2e8f1b3bc10f4fcf267%40postgrespro.ru
The shared memory size was calculated based on an offset of io_handles,
which is itself a pointer included in the structure. We tend to
overestimate the shared memory size overall, so this was unlikely an
issue in practice, but let's be correct and use the full size of the
structure in the calculation, so as the pointer for io_handles is
included.
Oversight in da7226993f.
Author: Madhukar Prasad <madhukarprasad@google.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAKi+wrbC2dTzh_vKJoAZXV5wqTbhY0n4wRNpCjJ=e36aoo0kFw@mail.gmail.com
Backpatch-through: 18
The EvalPlanQual recheck for TID Range Scan wasn't rechecking the TID qual
still passed after following update chains. This could result in tuples
being updated or deleted by plans using TID Range Scans where the ctid of
the new (updated) tuple no longer matches the clause of the scan. This
isn't desired behavior, and isn't consistent with what would happen if the
chosen plan had used an Index or Seq Scan, and that could lead to hard to
predict behavior for scans that contain TID quals and other quals as the
planner has freedom to choose TID Range or some other non-TID scan method
for such queries, and the chosen plan could change at any moment.
Here we fix this by properly implementing the recheck function for TID
Range Scans.
Backpatch to 14, where TID Range Scans were added
Reported-by: Sophie Alpert <pg@sophiebits.com>
Author: Sophie Alpert <pg@sophiebits.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com
Backpatch-through: 14
The EvalPlanQual recheck for TID Scan wasn't rechecking the TID qual
still passed after following update chains. This could result in tuples
being updated or deleted by plans using TID Scans where the ctid of the
new (updated) tuple no longer matches the clause of the scan. This isn't
desired behavior, and isn't consistent with what would happen if the
chosen plan had used an Index or Seq Scan, and that could lead to hard to
predict behavior for scans that contain TID quals and other quals as the
planner has freedom to choose TID or some other scan method for such
queries, and the chosen plan could change at any moment.
Here we fix this by properly implementing the recheck function for TID
Scans.
Backpatch to 13, oldest supported version
Reported-by: Sophie Alpert <pg@sophiebits.com>
Author: Sophie Alpert <pg@sophiebits.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com
Backpatch-through: 13
This reverts commit 98fc31d649.
That change allowed DROP OWNED BY to drop grants of the target
role to other roles, arguing that nobody would need those
privileges anymore. But that's not so: if you're not superuser,
you still need admin privilege on the target role so you can
drop it.
It's not clear whether or how the dependency-based approach
to solving the original problem can be adapted to keep these
grants. Since v18 release is fast approaching, the sanest
thing to do seems to be to revert this patch for now. The
race-condition problem is low severity and not worth taking
risks for.
I didn't force a catversion bump in 98fc31d64, so I won't do
so here either.
Reported-by: Dipesh Dhameliya <dipeshdhameliya125@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CABgZEgczOFicCJoqtrH9gbYMe_BV3Hq8zzCBRcMgmU6LRsihUA@mail.gmail.com
Backpatch-through: 18
Up to now we've contented ourselves with a one-size-fits-all error
hint when we fail to find any match to a function or procedure call.
That was mostly okay in the beginning, but it was never great, and
since the introduction of named arguments it's really not adequate.
We at least ought to distinguish "function name doesn't exist" from
"function name exists, but not with those argument names". And the
rules for named-argument matching are arcane enough that some more
detail seems warranted if we match the argument names but the call
still doesn't work.
This patch creates a framework for dealing with these problems:
FuncnameGetCandidates and related code will now pass back a bitmask of
flags showing how far the match succeeded. This allows a considerable
amount of granularity in the reports. The set-bits-in-a-bitmask
approach means that when there are multiple candidate functions, the
report will reflect the match(es) that got the furthest, which seems
correct. Also, we can avoid mentioning "maybe add casts" unless
failure to match argument types is actually the issue.
Extend the same return-a-bitmask approach to OpernameGetCandidates.
The issues around argument names don't apply to operator syntax,
but it still seems worth distinguishing between "there is no
operator of that name" and "we couldn't match the argument types".
While at it, adjust these messages and related ones to more strictly
separate "detail" from "hint", following our message style guidelines'
distinction between those.
Reported-by: Dominique Devienne <ddevienne@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/1756041.1754616558@sss.pgh.pa.us
JsonConstructorExpr can produce non-NULL output with a NULL input, so
it should be treated as a non-strict construct. Failing to do so can
lead to incorrect query behavior.
For example, in the reported case, when pulling up a subquery that is
under an outer join, if the subquery's target list contains a
JsonConstructorExpr that uses subquery variables and it is mistakenly
treated as strict, it will be pulled up without being wrapped in a
PlaceHolderVar. As a result, the expression will be evaluated at the
wrong place and will not be forced to null when the outer join should
do so.
Back-patch to v16 where JsonConstructorExpr was introduced.
Bug: #19046
Reported-by: Runyuan He <runyuan@berkeley.edu>
Author: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/19046-765b6602b0a8cfdf@postgresql.org
Backpatch-through: 16
Previously we built the .map files for GB18030 (version 2000) from an
XML file. The 2022 version for this encoding is only available as a
Unicode Character Mapping (UCM) file, so as preparatory refactoring
switch to this format as the source for building version 2000.
As we do with most input files for the conversion mappings, download
the file on demand. In order to generate the same mappings we have
now, we must download from a previous upstream commit, rather than
the head since the latter contains a correction not present in our
current .map files.
The XML file is still used by EUC_CN, so we cannot delete it from our
repository. GB18030 is a superset of EUC_CN, so it may be possible to
build EUC_CN from the same UCM file, but that is left for future work.
Author: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com
Commit 7202d72787 added in passing some const qualifiers, but the one
on the postmaster_child_launch() startup_data argument was incorrect,
because the function itself modifies the pointed-to data. This is
hidden from the compiler because of casts. The qualifiers on the
functions called by postmaster_child_launch() are still correct.
Add logic to _bt_set_startikey that determines whether row compare keys
are guaranteed to be satisfied by every tuple on a page that is about to
be read by _bt_readpage. This works in essentially the same way as the
existing scalar inequality logic. Testing has shown that the new logic
improves performance to about the same degree as the existing scalar
inequality logic (compared to the unoptimized case). In other words,
the new logic makes many row compare scans significantly faster.
Note that the new row compare inequality logic is only effective when
the same individual row member is the deciding subkey for all tuples on
the page (obviously, all tuples have to satisfy the row compare, too).
This is what makes the new row compare logic very similar to the
existing logic for scalar inequalities. Note, in particular, that this
makes it safe to ignore whether all row compare members are against
either ASC or DESC index attributes (i.e. it doesn't matter if
individual subkeys don't all use the same inequality strategy).
Also stop refusing to set pstate.startikey to an offset beyond any
nonrequired key (don't add logic that'll do that for an individual row
compare subkey, either). We can fully rely on our firstchangingattnum
tests instead. This will do the right thing when a page has a group of
tuples with NULLs in a lower-order attribute that makes the tuples fail
to satisfy a row compare key -- we won't incorrectly conclude that all
tuples must satisfy the row compare, just because firsttup and lasttup
happen to. Our firstchangingattnum test prevents that from happening.
(Note that the original "avoid evaluating nbtree scan keys" mechanism
added by commit e0b1ee17 couldn't support row compares due to issues
with tuples that contain NULLs in a lower-order subkey's attribute.
That original mechanism relied on requiredness markings, which the
replacement _bt_set_startikey mechanism never really needed.)
Follow up to commit 8a510275, which added the _bt_set_startikey
optimization. _bt_set_startikey is now feature complete; there's no
remaining kind of nbtree scan key that it still doesn't support.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznL6Z3H_GTQze9d8T_Ls=cYbnd-_9f-Jo7aYgTGRUD58g@mail.gmail.com
fmgr.h defined some types such as fmNodePtr which is just Node *, but
it made its own types to avoid having to include various header files.
With C11, we can now instead typedef the original names without fear
of conflicts.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org
This commit resumes automatic retention of conflict-relevant data for a
subscription. Previously, retention would stop if the apply process failed
to advance its xmin (oldest_nonremovable_xid) within the configured
max_retention_duration and user needs to manually re-enable
retain_dead_tuples option. With this change, retention will resume
automatically once the apply worker catches up and begins advancing its
xmin (oldest_nonremovable_xid) within the configured threshold.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
If extensions of equal names were installed in different directories
in the path, the views pg_available_extensions and
pg_available_extension_versions would show all of them, even though
only the first one was actually reachable by CREATE EXTENSION. To
fix, have those views skip extensions found later in the path if they
have names already found earlier.
Also add a bit of documentation that only the first extension in the
path can be used.
Reported-by: Pierrick <pierrick.chovelon@dalibo.com>
Discussion: https://www.postgresql.org/message-id/flat/8f5a0517-1cb8-4085-ae89-77e7454e27ba%40dalibo.com
The TimescaleDB extension expects to be able to change an nbtree scan's
keys across rescans. The issue arises in the extension's implementation
of loose index scan. This is arguably a misuse of the index AM API,
though apparently it worked until recently. It stopped working when the
skipScan flag was added to BTScanOpaqueData by commit 8a510275, though.
The flag wouldn't reliably track whether the scan (actually, the current
rescan) has any skip arrays, leading to confusion in _bt_set_startikey.
nbtree preprocessing will now defensively initialize the scan's skipScan
flag in all cases, including the case where _bt_preprocess_array_keys
returns early due to the (re)scan not using arrays. While nbtree isn't
obligated to support this use case (at least not according to my reading
of the index AM API), it still seems like a good idea to be consistent
here, on general robustness grounds.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Natalya Aksman <natalya@timescale.com>
Discussion: https://postgr.es/m/CAJumhcirfMojbk20+W0YimbNDkwdECvJprQGQ-XqK--ph09nQw@mail.gmail.com
Backpatch-through: 18
Commit e3ffc3e91 fixed the translation of character classes in
SIMILAR TO regular expressions. Unfortunately the fix broke a corner
case: if there is an escape character right after the opening bracket
(for example in "[\q]"), a closing bracket right after the escape
sequence would not be seen as closing the character class.
There were two more oversights: a backslash or a nested opening bracket
right at the beginning of a character class should remove the special
meaning from any following caret or closing bracket.
This bug suggests that this code needs to be more readable, so also
rename the variables "charclass_depth" and "charclass_start" to
something more meaningful, rewrite an "if" cascade to be more
consistent, and improve the commentary.
Reported-by: Dominique Devienne <ddevienne@gmail.com>
Reported-by: Stephan Springl <springl-psql@bfw-online.de>
Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFCRh-8NwJd0jq6P=R3qhHyqU7hw0BTor3W0SvUcii24et+zAw@mail.gmail.com
Backpatch-through: 13
If the database or user had no entry in pg_db_role_setting,
RESET silently did nothing --- including not checking the
validity of the given GUC name. This is quite inconsistent
and surprising, because you *would* get such an error if there
were any pg_db_role_setting entry, even though it contains
values for unrelated GUCs.
While this is clearly a bug, changing it in stable branches seems
unwise. The effect will be that some ALTER commands that formerly
were no-ops will now be errors, and people don't like that sort of
thing in minor releases.
Author: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/30783e-68c28a00-9-41004480@130449754
Commit a0b99fc12 caused pg_event_trigger_dropped_objects()
to not fill the object_name field for schemas, which it
should have; and caused it to fill the object_name field
for default values, which it should not have.
In addition, triggers and RLS policies really should behave
the same way as we're making column defaults do; that is,
they should have is_temporary = true if they belong to a
temporary table.
Fix those things, and upgrade event_trigger.sql's woefully
inadequate test coverage of these secondary output columns.
As before, back-patch only to v15.
Reported-by: Sergey Shinderuk <s.shinderuk@postgrespro.ru>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/bd7b4651-1c26-4d30-832b-f942fabcb145@postgrespro.ru
Backpatch-through: 15
A recently added nbtree preprocessing step failed to account for the
fact that DESC columns already had their B-Tree strategy number commuted
at this point in preprocessing. As a result, preprocessing could output
a set of scan keys where one or more keys had the correct strategy
number, but used the wrong comparison routine.
To fix, make the faulty code path that looks up a more restrictive
replacement operator/comparison routine commute its requested inequality
strategy (while outputting the transformed strategy number as before).
This makes the final transformed scan key comport with the approach
preprocessing has always used to deal with DESC columns (which is
described by comments above _bt_fix_scankey_strategy).
Oversight in commit commit b3f1a13f, which made nbtree preprocessing
perform transformations on skip array inequalities that can reduce the
total number of index searches.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Natalya Aksman <natalya@timescale.com>
Discussion: https://postgr.es/m/19049-b7df801e71de41b2@postgresql.org
Backpatch-through: 18
Users of logical decoding can encounter an unexpected change of
CurrentResourceOwner and CurrentMemoryContext. The problem is that,
unlike other call sites of RollbackAndReleaseCurrentSubTransaction(), in
reorderbuffer.c we fail to restore the original values of these global
variables after being clobbered by subtransaction abort. This patch
saves the values prior to the call and restores them eventually.
In addition, logical.c and logicalfuncs.c had a hack to restore resource
owner, presumably because of lack of this restore. Remove that.
Instead, because the test coverage here is not very consistent, add an
Assert() to ensure that the resowner is kept identical; this would make
it easy to detect other cases of bugs were we fail to restore resowner
properly. This could be removed later.
This is arguably an old bug, but there appears to be no reason to
backpatch it and it's risky to do so, so refrain for now.
Author: Antonin Houska <ah@cybertec.at>
Reported-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/119497.1756892972@localhost
If someone is stuck behind a lock for more than a second, that is
almost always a problem that is worth a log entry.
Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-By: Michael Banck <mbanck@gmx.net>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Christoph Berg <myon@debian.org>
Reviewed-By: Stephen Frost <sfrost@snowman.net>
Discussion: https://postgr.es/m/b8b8502915e50f44deb111bc0b43a99e2733e117.camel%40cybertec.at
Per discussion, this compiler suite is no longer maintained, and
it has not been able to compile PostgreSQL since at least PostgreSQL
17.
This removes all the remaining support code for this compiler.
Note that the Solaris operating system continues to be supported, but
using GCC as the compiler.
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/a0f817ee-fb86-483a-8a14-b6f7f5991b6e%40eisentraut.org
Clang 21 shows some new compiler warnings, for example:
warning: variable 'dstsize' is uninitialized when passed as a const pointer argument here [-Wuninitialized-const-pointer]
The fix is to initialize the variables when they are defined. This is
similar to, for example, the existing situation in gistKeyIsEQ().
Discussion: https://www.postgresql.org/message-id/flat/6604ad6e-5934-43ac-8590-15113d6ae4b1%40eisentraut.org
The typedef Relids (Bitmapset *) is intended to represent set of
relation identifiers, but was incorrectly used in several places to
store sets of attribute numbers. This is my oversight in e2debb643.
Fix that by replacing such usages with Bitmapset * to reflect the
correct semantics.
Author: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAEG8a3LJhp_xriXf39iCz0TsK+M-2biuhDhpLC6Baxw8+ZYT3A@mail.gmail.com
hashdesc.c was missing a couple of fields in its record descriptions, as
of:
- is_prev_bucket_same_wrt for SQUEEZE_PAGE.
- procid for INIT_META_PAGE.
- old_bucket_flag and new_bucket_flag for SPLIT_ALLOCATE_PAGE.
The author has noted the first hole, and I have spotted the others while
double-checking this area of the code. Note that the only data missing
now are the offsets stored in VACUUM_ONE_PAGE. We could perhaps add
them, if somebody sees value in this data, even if it makes the output
larger. These are discarded here.
Author: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CALdSSPjc-OVwtZH0Xrkvg7n=2ZwdbMJzqrm_ed_CfjiAzuKVGg@mail.gmail.com
In EXEC_BACKEND builds, GetNamedLWLockTranche() can segfault when
called outside of the postmaster process, as it might access
NamedLWLockTrancheRequestArray, which won't be initialized. Given
the lack of reports, this is apparently unusual, presumably because
it is usually called from a shmem_startup_hook like this:
mystruct = ShmemInitStruct(..., &found);
if (!found)
{
mystruct->locks = GetNamedLWLockTranche(...);
...
}
This genre of shmem_startup_hook evades the aforementioned
segfaults because the struct is initialized in the postmaster, so
all other callers skip the !found path.
We considered modifying the documentation or requiring
GetNamedLWLockTranche() to be called from the postmaster, but
ultimately we decided to simply move the request array to shared
memory (and add it to the BackendParameters struct), thereby
allowing calls outside postmaster on all platforms. Since the main
shared memory segment is initialized after accepting LWLock tranche
requests, postmaster builds the request array in local memory first
and then copies it to shared memory later.
Given the lack of reports, back-patching seems unnecessary.
Reported-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0v1_15QPg5Sqd2Qz5rh_qcsyCeHHmRDY89xVHcy2yt5BQ%40mail.gmail.com
pg_event_trigger_dropped_objects() would report a column default
object with is_temporary = false, even if it belongs to a
temporary table. This seems clearly wrong, so adjust it to
report the table's temp-ness.
While here, refactor EventTriggerSQLDropAddObject to make its
handling of namespace objects less messy and avoid duplication
of the schema-lookup code. And add some explicit test coverage
of dropped-object reports for dependencies of temp tables.
Back-patch to v15. The bug exists further back, but the
GetAttrDefaultColumnAddress function this patch depends on does not,
and it doesn't seem worth adjusting it to cope with the older code.
Author: Antoine Violin <violin.antuan@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFjUV9x3-hv0gihf+CtUc-1it0hh7Skp9iYFhMS7FJjtAeAptA@mail.gmail.com
Backpatch-through: 15
The Sun Studio compiler complains about an empty declaration here.
Note for future historians: This does not mean that this compiler is
still of current interest for anyone using PostgreSQL. But we can let
this small fix be its parting gift.
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/a0f817ee-fb86-483a-8a14-b6f7f5991b6e%40eisentraut.org
If sizeof(Pointer) is 4 then sizeof(SortItem) will be 12, so that
if data->numrows is odd then we placed the values array at a location
that's not a multiple of 8. That was fine when sizeof(Datum) was also
4, but in the wake of commit 2a600a93c it makes some alignment-picky
machines unhappy. (You need a 32-bit machine that nonetheless expects
8-byte alignment of 8-byte quantities, which is an odd-seeming
combination but it does exist outside the Intel universe.)
To fix, MAXALIGN the space allocated to the SortItem array.
In passing, let's make the "len" variable be Size not int,
just for paranoia's sake.
This code was arguably not too safe even before 2a600a93c, but at
present I don't see a strong argument for back-patching.
Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/87036018-8d70-40ad-a0ac-192b07bd7b04@vondra.me
Instead of building a separate memory context that's used just
for running hash functions, make the hash functions run in the
per-tuple context of the node's innerecontext. This saves a
little space at runtime, and it avoids needing to reset two
contexts instead of one inside buildSubPlanHash's main loop.
This largely reverts commit 133924e13. That's safe to do now
because bf6c614a2 decoupled the evaluation context used by
TupleHashTableMatch from that used for hash function evaluation,
so that there's no longer a risk of resetting the innerecontext
too soon.
Per discussion of bug #19040, although this is not directly
a fix for that.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Haiyang Li <mohen.lhy@alibaba-inc.com>
Reviewed-by: Fei Changhong <feichanghong@qq.com>
Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org
If the hash functions used for hashing tuples leaked any memory,
we failed to clean that up, resulting in query-lifespan memory
leakage in queries using hashed subplans. One way that could
happen is if the values being hashed require de-toasting, since
most of our hash functions don't trouble to clean up de-toasted
inputs.
Prior to commit bf6c614a2, this leakage was largely masked
because TupleHashTableMatch would reset hashtable->tempcxt
(via execTuplesMatch). But it doesn't do that anymore, and
that's not really the right place for this anyway: doing it
there could reset the tempcxt many times per hash lookup,
or not at all. Instead put reset calls into ExecHashSubPlan
and buildSubPlanHash. Along the way to that, rearrange
ExecHashSubPlan so that there's just one place to call
MemoryContextReset instead of several.
This amounts to accepting the de-facto API spec that the caller
of the TupleHashTable routines is responsible for resetting the
tempcxt adequately often. Although the other callers seem to
get this right, it was not documented anywhere, so add a comment
about it.
Bug: #19040
Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com>
Author: Haiyang Li <mohen.lhy@alibaba-inc.com>
Reviewed-by: Fei Changhong <feichanghong@qq.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org
Backpatch-through: 13
autoconf builds have compiled this file with -ftree-vectorize since
commit 8870917623, but meson builds seem to have missed the memo.
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/aL85CeasM51-0D1h%40nathan
Backpatch-through: 16
All the calls replaced by this commit use 4-byte integers for their
variables used in input of my_log2(). Hence, the limit against
too-large inputs does not really apply. Thresholds are also applied, as
of:
- In nodeAgg.c, the number of partitions is limited by
HASHAGG_MAX_PARTITIONS.
- In nodeHash.c, ExecChooseHashTableSize() caps its maximum number of
buckets based on HashJoinTuple and palloc() allocation limit.
- In worker.c, the number of subxacts tracked by ApplySubXactData uses
uint32, making pg_ceil_log2_64() safe to use directly.
Several approaches have been discussed, like an integration with
thresholds in pg_bitutils.h, but it was found confusing. This uses
Dean's idea, which gives a simpler result than what I came up with to be
able to remove dynahash.h. dynahash.h will be removed in a follow-up
commit, removing some duplication with the ceil log2 routines.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/CAEZATCUJPQD_7sC-wErak2CQGNa6bj2hY-mr8wsBki=kX7f2_A@mail.gmail.com
The startup process does not process shared invalidation messages, only
sending them, and never calls AtEOXact_SMgr() which clean up any
unpinned SMgrRelations. Hence, it is never able to free SMgrRelations
on a periodic basis, bloating its hashtable over time.
Like the checkpointer and the bgwriter, this commit takes a conservative
approach by freeing periodically SMgrRelations when replaying a
checkpoint record, either online or shutdown, so as the startup process
has a way to perform a periodic cleanup.
Issue caused by 21d9c3ee4e, so backpatch down to v17.
Author: Jingtang Zhang <mrdrivingduck@gmail.com>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/28C687D4-F335-417E-B06C-6612A0BD5A10@gmail.com
Backpatch-through: 17
This adds 3 new variants of the random() function:
random(min date, max date) returns date
random(min timestamp, max timestamp) returns timestamp
random(min timestamptz, max timestamptz) returns timestamptz
Each returns a random value x in the range min <= x <= max.
Author: Damien Clochard <damien@dalibo.info>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/f524d8cab5914613d9e624d9ce177d3d@dalibo.info
Address a potential SIGSEGV that may occur when the tablesync worker
attempts to locate a deleted row while applying changes. This situation
arises during conflict detection for update-deleted scenarios.
To prevent this crash, ensure that the operation is errored out early if
the leader apply worker is unavailable. Since the leader worker maintains
the necessary conflict detection metadata, proceeding without it serves no
purpose and risks reporting incorrect conflict type.
In the passing, improve a nearby comment.
Reported by Tom Lane as per Coverity
Author: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/334468.1757280992@sss.pgh.pa.us
Commit fd6ec93bf8 and other previous work established the
principle that when an error is potentially reachable in case of on-disk
corruption but is not expected to be reached otherwise,
ERRCODE_DATA_CORRUPTED should be used. This allows log monitoring
software to search for evidence of corruption by filtering on the error
code.
Enhance the existing log messages emitted when the heap page is found to
be inconsistent with the VM by adding this error code.
Suggested-by: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/87DD95AA-274F-4F4F-BAD9-7738E5B1F905%40yandex-team.ru
Commit 161a3e8b68 taught pg_upgrade to use COPY for large object
metadata for upgrades from v12 and newer, which is much faster to
restore than the proper large object commands. For upgrades from
v16 and newer, we can take this a step further and transfer the
large object metadata files as if they were user tables. We can't
transfer the files from older versions because the aclitem data
type (needed by pg_largeobject_metadata.lomacl) changed its storage
format in v16 (see commit 7b378237aa). Note that this commit is
essentially a revert of commit 12a53c732c.
There are a couple of caveats. First, we still need to COPY the
corresponding pg_shdepend rows for large objects. Second, we need
to COPY anything in pg_largeobject_metadata with a comment or
security label, else restoring those will fail. This means that an
upgrade in which every large object has a comment or security label
won't gain anything from this commit, but it should at least avoid
making those unusual use-cases any worse.
pg_upgrade must also take care to transfer the relfilenodes of
pg_largeobject_metadata and its index, as was done for
pg_largeobject in commits d498e052b4 and bbe08b8869.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aJ3_Gih_XW1_O2HF%40nathan
rte->alias should point only to a user-written alias, but in these
cases that principle was violated. Fixing this causes some regression
test output changes: wherever rte->alias previously had a value and
is now NULL, rte->eref is now set to a generated name rather than to
rte->alias; and the scheme used to generate eref names differs from
what we were doing for aliases.
The upshot is that instead of "*SELECT*" or "*SELECT* %d",
EXPLAIN will now emit "unnamed_subquery" or "unnamed_subquery_%d".
But that's a reasonable descriptor, and we were already producing
that in yet other cases, so this seems not too objectionable.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Co-authored-by: Robert Haas <rhaas@postgresql.org>
Discussion: https://postgr.es/m/CA+TgmoYSYmDA2GvanzPMci084n+mVucv0bJ0HPbs6uhmMN6HMg@mail.gmail.com
Previously, heap_xlog_visible() called visibilitymap_pin() even after
getting a buffer from XLogReadBufferForRedoExtended() -- which returns a
pinned buffer containing the specified block of the visibility map.
This would just have resulted in visibilitymap_pin() returning early
since the specified page was already present and pinned, but it was
confusing extraneous code, so remove it. It doesn't seem worth
backporting, though.
It appears to be an oversight in 2c03216.
While we are at it, remove two VM-related redundant asserts in the COPY
FREEZE code path. visibilitymap_set() already asserts that
PD_ALL_VISIBLE is set on the heap page and checks that the vmbuffer
contains the bits corresponding to the specified heap block, so callers
do not also need to check this.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CALdSSPhu7WZd%2BEfQDha1nz%3DDC93OtY1%3DUFEdWwSZsASka_2eRQ%40mail.gmail.com
A test has been added to ensure that conflict-relevant data is not
prematurely removed when a concurrent prepared transaction is being
committed on the publisher.
This test introduces an injection point that simulates the presence of a
prepared transaction in the commit phase, validating that the system
correctly delays conflict slot advancement until the transaction is fully
committed.
Additionally, the test serves as a safeguard for developers, ensuring that
the acquisition of the commit timestamp does not occur before marking
DELAY_CHKPT_IN_COMMIT in RecordTransactionCommitPrepared.
Reported-by: Robert Haas <robertmhaas@gmail.com>
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com
A new pgstats entry is created as a two-step process:
- The entry is looked at in the shared hashtable of pgstats, and is
inserted if not found.
- When not found and inserted, its fields are then initialized. This
part include a DSA chunk allocation for the stats data of the new entry.
As currently coded, if the DSA chunk allocation fails due to an
out-of-memory failure, an ERROR is generated, leaving in the pgstats
shared hashtable an inconsistent entry due to the first step, as the
entry has already been inserted in the hashtable. These broken entries
can then be found by other backends, crashing them.
There are only two callers of pgstat_init_entry(), when loading the
pgstats file at startup and when creating a new pgstats entry. This
commit changes pgstat_init_entry() so as we use dsa_allocate_extended()
with DSA_ALLOC_NO_OOM, making it return NULL on allocation failure
instead of failing. This way, a backend failing an entry creation can
take appropriate cleanup actions in the shared hashtable before throwing
an error. Currently, this means removing the entry from the shared
hashtable before throwing the error for the allocation failure.
Out-of-memory errors unlikely happen in the wild, and we do not bother
with back-patches when these are fixed, usually. However, the problem
dealt with here is a degree worse as it breaks the shared memory state
of pgstats, impacting other processes that may look at an inconsistent
entry that a different process has failed to create.
Author: Mikhail Kot <mikhail.kot@databricks.com>
Discussion: https://postgr.es/m/CAAi9E7jELo5_-sBENftnc2E8XhW2PKZJWfTC3i2y-GMQd2bcqQ@mail.gmail.com
Backpatch-through: 15
This commit fixes three issues:
1) When a disabled subscription is created with retain_dead_tuples set to true,
the launcher is not woken up immediately, which may lead to delays in creating
the conflict detection slot.
Creating the conflict detection slot is essential even when the subscription is
not enabled. This ensures that dead tuples are retained, which is necessary for
accurately identifying the type of conflict during replication.
2) Conflict-related data was unnecessarily retained when the subscription does
not have a table.
3) Conflict-relevant data could be prematurely removed before applying
prepared transactions on the publisher that are in the commit critical section.
This issue occurred because the backend executing COMMIT PREPARED was not
accounted for during the computation of oldestXid in the commit phase on
the publisher. As a result, the subscriber could advance the conflict
slot's xmin without waiting for such COMMIT PREPARED transactions to
complete.
We fixed this issue by identifying prepared transactions that are in the
commit critical section during computation of oldestXid in commit phase.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS9PR01MB16913DACB64E5721872AA5C02943BA@OS9PR01MB16913.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com
This commit allows to log the raw parse tree in the same way we
currently log the parse tree, rewritten tree, and plan tree.
To avoid unnecessary log noise for users not interested in this
detail, a new GUC option, "debug_print_raw_parse", has been added.
When starting the PostgreSQL process with "-d N", and N is 3 or higher,
debug_print_raw_parse is enabled automatically, alongside
debug_print_parse.
Author: Chao Li <lic@highgo.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2mcO0Gpo4vd8kPMAFWeJLSp0MeUUnaLdE1x0tSVd-VzUw%40mail.gmail.com
This set of changes removes the list of available buffers and instead simply
uses the clock-sweep algorithm to find and return an available buffer. This
also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.
While on the surface this appears to be removing an optimization it is in fact
eliminating code that induces overhead in the form of synchronization that is
problematic for multi-core systems.
The main reason for removing the freelist, however, is not the moderate
improvement in scalability, but that having the freelist would require
dedicated complexity in several upcoming patches. As we have not been able to
find a case benefiting from the freelist...
Author: Greg Burd <greg@burd.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com
Add an assert to visibilitymap_set() that the provided heap buffer is
exclusively locked, which is expected.
Also, enhance the debug logging message to specify which VM flags were
set.
Based on a related suggestion by Kirill Reshke on an in-progress
patchset.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CALdSSPhAU56g1gGVT0%2BwG8RrSWE6qW8TOfNJS1HNAWX6wPgbFA%40mail.gmail.com
When executing a MERGE UPDATE action, if there is more than one
concurrent update of the target row, the lock-and-retry code would
sometimes incorrectly identify the latest version of the target tuple,
leading to incorrect results.
This was caused by using the ctid field from the TM_FailureData
returned by table_tuple_lock() in a case where the result was TM_Ok,
which is unsafe because the TM_FailureData struct is not guaranteed to
be fully populated in that case. Instead, it should use the tupleid
passed to (and updated by) table_tuple_lock().
To reduce the chances of similar errors in the future, improve the
commentary for table_tuple_lock() and TM_FailureData to make it
clearer that table_tuple_lock() updates the tid passed to it, and most
fields of TM_FailureData should not be relied on in non-failure cases.
An exception to this is the "traversed" field, which is set in both
success and failure cases.
Reported-by: Dmitry <dsy.075@yandex.ru>
Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1570d30e-2b95-4239-b9c3-f7bf2f2f8556@yandex.ru
Backpatch-through: 15
SlruRecentlyUsed() is an inline function since 53c2a97a92, not a
macro. The description of long_segment_names was missing at the top of
SimpleLruInit(), part forgotten in 4ed8f0913b.
Author: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://postgr.es/m/aLpBLMOYwEQkaleF@jrouhaud
Backpatch-through: 17
This commit changes some functions related to the data type numeric to
use the soft error reporting rather than a custom boolean flag (called
"have_error") that callers of these functions could rely on to bypass
the generation of ERROR reports, letting the callers do their own error
handling (timestamp, jsonpath and numeric_to_char() require them).
This results in the removal of some boilerplate code that was required
to handle both the ereport() and the "have_error" code paths bypassing
ereport(), unifying everything under the soft error reporting facility.
While on it, some duplicated error messages are removed. The function
upgraded in this commit were suffixed with "_opt_error" in their names.
They are renamed to "_safe" instead.
This change relies on d9f7f5d32f, that has introduced the soft error
reporting infrastructure.
Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/CAAJ_b96No5h5tRuR+KhcC44YcYUCw8WAHuLoqqyyop8_k3+JDQ@mail.gmail.com
pg_lsn includes pg_lsn_in_internal() for the purpose of parsing a LSN
position for the GUC recovery_target_lsn (21f428ebde). It relies on a
boolean called "have_error" that would be set when the LSN parsing
fails, then let its callers handle any errors.
d9f7f5d32f has added support for soft error reporting. This commit
removes some boilerplate code and switches the routine to use soft error
reporting directly, giving to the callers of pg_lsn_in_internal()
the possibility to be fed the error message generated on failure.
pg_lsn_in_internal() routine is renamed to pg_lsn_in_safe(), for
consistency with other similar routines that are given an escontext.
Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/CAAJ_b96No5h5tRuR+KhcC44YcYUCw8WAHuLoqqyyop8_k3+JDQ@mail.gmail.com
Commit 38b602b028 modified this function to allocate enough space
for MAX_NAMED_TRANCHES (256) requests, which is likely far more
than most clusters need. This commit reverts that change so that
it first allocates enough space for only 16 requests and resizes
the array when necessary. While at it, remove the check for too
many tranches from this function. We can now rely on
InitializeLWLocks() to do that check via its calls to
LWLockNewTrancheId() for the named tranches.
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aLmzwC2dRbqk14y6%40nathan
When executing a MERGE, check that the target relation supports all
actions mentioned in the MERGE command. Specifically, check that it
has a REPLICA IDENTITY if it publishes updates or deletes and the
MERGE command contains update or delete actions. Failing to do this
can silently break replication.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Tested-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/OS3PR01MB57180C87E43A679A730482DF94B62@OS3PR01MB5718.jpnprd01.prod.outlook.com
Backpatch-through: 15
If an INSERT has an ON CONFLICT DO UPDATE clause, the executor must
check that the target relation supports UPDATE as well as INSERT. In
particular, it must check that the target relation has a REPLICA
IDENTITY if it publishes updates. Formerly, it was not doing this
check, which could lead to silently breaking replication.
Fix by adding such a check to CheckValidResultRel(), which requires
adding a new onConflictAction argument. In back-branches, preserve ABI
compatibility by introducing a wrapper function with the original
signature.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Tested-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/OS3PR01MB57180C87E43A679A730482DF94B62@OS3PR01MB5718.jpnprd01.prod.outlook.com
Backpatch-through: 13
The counters saved from pgWalUsage, used for the difference calculations
when flushing the backend WAL stats, are updated when calling
pgstat_flush_backend() under PGSTAT_BACKEND_FLUSH_WAL, and not
pgstat_report_wal(). The comment updated in this commit referenced the
latter, but it is perfectly OK to flush the backend stats independently
of the WAL stats.
Noticed while looking at this area of the code, introduced by
76def4cdd7 as a copy-pasto.
Backpatch-through: 18
There are two ways for shared libraries to allocate their own
LWLock tranches. One way is to call RequestNamedLWLockTranche() in
a shmem_request_hook, which requires the library to be loaded via
shared_preload_libraries. The other way is to call
LWLockNewTrancheId(), which is not subject to the same
restrictions. However, LWLockNewTrancheId() does require each
backend to store the tranche's name in backend-local memory via
LWLockRegisterTranche(). This API is a little cumbersome and leads
to things like unhelpful pg_stat_activity.wait_event values in
backends that haven't loaded the library.
This commit moves these LWLock tranche names to shared memory, thus
eliminating the need for each backend to call
LWLockRegisterTranche(). Instead, the tranche name must be
provided to LWLockNewTrancheId(), which immediately makes the name
available to all backends. Since the tranche name array is
append-only, lookups can ordinarily avoid locking as long as their
local copy of the LWLock counter is greater than the requested
tranche ID.
One downside of this approach is that we now have a hard limit on
both the length of tranche names (NAMEDATALEN-1 bytes) and the
number of dynamically-allocated tranches (256). Besides a limit of
NAMEDATALEN-1 bytes for tranche names registered via
RequestNamedLWLockTranche(), no such limits previously existed. We
could avoid these new limits by using dynamic shared memory, but
the complexity involved didn't seem worth it. We briefly
considered making the tranche limit user-configurable but
ultimately decided against that, too. Since there is still a lot
of time left in the v19 development cycle, it's possible we will
revisit this choice.
Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA5RZ0vvED3naph8My8Szv6DL4AxOVK3eTPS0qXsaKi%3DbVdW2A%40mail.gmail.com
Commit 6359989654 had it so that the parameter "debug_discard_caches"
did not exist unless DISCARD_CACHES_ENABLED was defined (typically via
enabling asserts). This was a mistake, it did not correspond to the
prior setup. Several tests use this parameter, so they were now
failing if you did not have asserts enabled.
Store the information in guc_tables.c in a .dat file similar to the
catalog data in src/include/catalog/, and generate a part of
guc_tables.c from that. The goal is to make it easier to edit that
information, and to be able to make changes to the downstream data
structures more easily. (Essentially, those are the same reasons as
for the original adoption of the .dat format.)
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: David E. Wheeler <david@justatheory.com>
Discussion: https://www.postgresql.org/message-id/flat/dae6fe89-1e0c-4c3f-8d92-19d23374fb10%40eisentraut.org
SubPlan nodes are typically built very early, before any RelOptInfos
have been constructed for the parent query level. As a result, the
simple_rel_array in the parent root has not yet been initialized.
Currently, during cost estimation of a SubPlan's testexpr, we may call
examine_variable() to look up statistical data about the expressions.
This can lead to "no relation entry for relid" errors.
To fix, pass root as NULL to cost_qual_eval() in cost_subplan(), since
the root does not yet contain enough information to safely consult
statistics.
One exception is SubPlan nodes built for the initplans of MIN/MAX
aggregates from indexes. In this case, having a NULL root is safe
because testexpr will be NULL. Additionally, an initplan will by
definition not consult anything from the parent plan.
Backpatch to all supported branches. Although the reported call path
that triggers this error is not reachable prior to v17, there's no
guarantee that other code paths -- especially in extensions -- could
not encounter the same issue when cost_qual_eval() is called with a
root that lacks a valid simple_rel_array. The test case is not
included in pre-v17 branches though.
Bug: #19037
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19037-3d1c7bb553c7ce84@postgresql.org
Backpatch-through: 13
SLRU bank locks are referred as "bank locks" or "SLRU bank locks" in the
code comments. The comments updated in this commit use the latter term.
Oversight in 53c2a97a92, that has replaced the single ControlLock by
the bank control locks.
Author: Julien Rouhaud <julien.rouhaud@free.fr>
Discussion: https://postgr.es/m/aLUT2UO8RjJOzZNq@jrouhaud
Backpatch-through: 17
COPY TO does not support a WHERE clause, and currently fails with the error:
ERROR: WHERE clause not allowed with COPY TO
Since the intended behavior can be achieved by using
COPY (SELECT ... WHERE ...) TO, this commit adds a HINT
to the error message:
HINT: Try the COPY (SELECT ... WHERE ...) TO variant.
This makes the error more informative and helps users
quickly find the alternative usage.
Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/3520c224c5ffac0113aef84a9179f37e@oss.nttdata.com
Previously, duplicate labels in CREATE TYPE AS ENUM were caught by
the unique index on pg_enum, resulting in a generic error message.
While this was evidently intentional, it's not terribly user-friendly,
nor consistent with the ALTER TYPE cases which take more care with
such errors. This patch adds an explicit check to produce a more
user-friendly and descriptive error message.
A potential objection to this implementation is that it adds O(N^2)
work to the creation operation. However, quick testing finds that
that's pretty negligible below 1000 enum labels, and tolerable even
at 10000. So it doesn't really seem worth being smarter.
Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20250704000402.37e605ab0c59c300965a17ee@sraoss.co.jp
This change replaces seven functions definitions by macros, reducing a
bit some repetitive patterns in the code. An interesting side effect is
that this removes an inconsistency in the naming of SLRU increment
functions with the field names.
This change is similar to 850f4b4c8c, 8018ffbf58 or 83a1a1b566.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aLHA//gr4dTpDHHC@ip-10-97-1-34.eu-west-3.compute.internal
This commit introduces a new subscription parameter,
max_retention_duration, aimed at mitigating excessive accumulation of dead
tuples when retain_dead_tuples is enabled and the apply worker lags behind
the publisher.
When the time spent advancing a non-removable transaction ID exceeds the
max_retention_duration threshold, the apply worker will stop retaining
conflict detection information. In such cases, the conflict slot's xmin
will be set to InvalidTransactionId, provided that all apply workers
associated with the subscription (with retain_dead_tuples enabled) confirm
the retention duration has been exceeded.
To ensure retention status persists across server restarts, a new column
subretentionactive has been added to the pg_subscription catalog. This
prevents unnecessary reactivation of retention logic after a restart.
The conflict detection slot will not be automatically re-initialized
unless a new subscription is created with retain_dead_tuples = true, or
the user manually re-enables retain_dead_tuples.
A future patch will introduce support for automatic slot re-initialization
once at least one apply worker confirms that the retention duration is
within the configured max_retention_duration.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Constraint expressions and statistics expressions loaded from the
system catalogs need to be run through const-simplification, because
the planner will be comparing them to similarly-processed qual
clauses. Without this step, the planner may fail to detect valid
matches.
Currently, NullTest clauses in these expressions may not be reduced
correctly during const-simplification. This happens because their Var
nodes do not yet have the correct varno when eval_const_expressions is
applied. Since eval_const_expressions relies on varno to reduce
NullTest quals, incorrect varno can cause problems.
Additionally, for statistics expressions, eval_const_expressions is
called with root set to NULL, which also inhibits NullTest reduction.
This patch fixes the issue by ensuring that Vars are updated to have
the correct varno before const-simplification, and that a valid root
is passed to eval_const_expressions when needed.
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/19007-4cc6e252ed8aa54a@postgresql.org
A proposed patch would place a limit of NAMEDATALEN-1 (i.e., 63)
bytes on the names of dynamically-allocated LWLock tranches, but
GetNamedDSA() and GetNamedDSHash() may register tranches with
longer names. This commit lowers the maximum DSM registry entry
name length to NAMEDATALEN-1 bytes and modifies GetNamedDSHash() to
create only one tranche, thereby allowing us to keep the DSM
registry's tranche names below NAMEDATALEN bytes.
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aKzIg1JryN1qhNuy%40nathan
Show the requested lock level and the object being waited on,
in the same format we use for deadlock reports and similar errors.
This is particularly helpful for debugging lock-timeout errors,
since otherwise the user has very little to go on about which
lock timed out. The performance cost of setting up the callback
should be negligible compared to the other tracing support already
present in WaitOnLock.
As in the deadlock-report case, we just show numeric object OIDs,
because it seems too scary to try to perform catalog lookups
in this context.
Reported-by: Steve Baldwin <steve.baldwin@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1602369.1752167154@sss.pgh.pa.us
Using the LWLockCounter requires first calculating its address in
shared memory like this:
LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
Commit 82e861fbe1 started this trend in order to fix EXEC_BACKEND
builds, but it could also be fixed by adding it to the
BackendParameters struct. The current approach is somewhat
difficult to follow, so this commit switches to the latter. While
at it, swap around the code in LWLockShmemSize() to match the order
of assignments in CreateLWLocks() for added readability.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aLDLnan9gNCS9fHx%40nathan
It's possible that if the only live partition is concurrently dropped
and try_table_open() fails, that the bms_del_member() will pfree the
live_parts Bitmapset. Since the bms_del_member() call does not assign
the result back to the live_parts local variable, the while loop could
segfault as that variable would still reference the pfree'd Bitmapset.
Backpatch to 15. 52f3de874 was backpatched to 14, but there's no
bms_del_member() there due to live_parts not yet existing in RelOptInfo in
that version. Technically there's no bug in version 15 as
bms_del_member() didn't pfree when the set became empty prior to
00b41463c (from v16). Applied to v15 anyway to keep the code similar and
to avoid the bad coding pattern.
Author: Bernd Reiß <bd_reiss@gmx.at>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/6b88f27a-c45c-4826-8e37-d61a04d90182@gmx.at
Backpatch-through: 15
I think the error message for a different condition was inadvertently
copied.
This problem seems to have been introduced by commit a4d75c86bf.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 14
Discussion: https://postgr.es/m/CACJufxEZ48toGH0Em_6vdsT57Y3L8pLF=DZCQ_gCii6=C3MeXw@mail.gmail.com
BufferGetPage() already returns type Page, so casting it to Page
doesn't achieve anything. A sizable number of call sites does this
casting; remove that.
This was already done inconsistently in the code in the first import
in 1996 (but didn't exist in the pre-1995 code), and it was then
apparently just copied around.
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CALdSSPgFhc5=vLqHdk-zCcnztC0zEY3EU_Q6a9vPEaw7FkE9Vw@mail.gmail.com
For a child relation, we should not assume that its parent's
unique-ified relation (or unique-ified path in v18) always exists. In
cases where all RHS columns that need to be unique-ified are equated
to constants, the unique-ified relation/path for the parent table is
not built, as there are no columns left to unique-ify. Failing to
account for this can result in a SIGSEGV crash during planning.
This patch checks whether the parent's unique-ified relation or path
exists and skips unique-ification of the child relation if it does
not.
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49MOdLW2c+qbLHHBt8VBu=4ONpM91D19=AWeW93eFUF6A@mail.gmail.com
Backpatch-through: 18
Previously, we used LW_EXCLUSIVE in several places despite only reading
WalSummarizerCtl fields. This patch reduces the lock level to LW_SHARED
where we are only reading the shared fields.
Backpatch to 17, where wal summarization was introduced.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CAD21AoDdKhf_9oriEYxY-JCdF+Oe_muhca3pcdkMEdBMzyHyKw@mail.gmail.com
Backpatch-through: 17
One mechanism we have for implementing semi-joins is to de-duplicate
the output of the RHS and then treat the join as a plain inner join.
Initial construction of the join's SpecialJoinInfo identifies the
RHS columns that need to be de-duplicated, but later we may find that
some of those don't need to be handled explicitly, either because
they're known to be constant or because they are redundant with some
previous column.
Up to now, while sort-based de-duplication handled such cases well,
hash-based de-duplication didn't: we'd still hash on all of the
originally-identified columns. This is probably not a very big
deal performance-wise, but in the wake of commit a3179ab69 it can
cause planner errors. That happens when join elimination causes
recalculation of variables' attr_needed bitmapsets, and we decide
that a variable mentioned in a semijoin clause doesn't need to be
propagated up to the join level anymore.
There are a number of ways we could slice the blame for this, but the
only fix that doesn't result in pessimizing plans for loosely-related
cases is to be more careful about not hashing columns we don't
actually need to de-duplicate. We can install that consideration
into create_unique_paths in master, or the predecessor code in
create_unique_path in v18, without much refactoring.
(As follow-up work, it might be a good idea to look at more-invasive
refactoring, in hopes of preventing other bugs in this area. But
with v18 release so close, there's not time for that now, nor would
we be likely to want to put such refactoring into v18 anyway.)
Reported-by: Sergey Soloviev <sergey.soloviev@tantorlabs.ru>
Diagnosed-by: Richard Guo <guofenglinux@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/1fd1a421-4609-4d46-a1af-ab74d5de504a@tantorlabs.ru
Backpatch-through: 18
This has been done historically because of get_database_name (which
since commit cb98e6fb8f belongs in lsyscache.c/h, so let's move it
there) and get_database_oid (which is in the right place, but whose
declaration should appear in pg_database.h rather than dbcommands.h).
Clean this up.
Also, xlogreader.h and stringinfo.h are no longer needed by dbcommands.h
since commit f1fd515b39, so remove them.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/202508191031.5ipojyuaswzt@alvherre.pgsql
During an investigation into rather odd aio related errors on macos, observed
by Alexander and Konstantin, we started to wonder if bitfield access is
related to the error. At the moment it looks like it is related, we cannot
reproduce the failures when replacing the bitfields. In addition, the problem
can only be reproduced with some compiler [versions] and not everyone has been
able to reproduce the issue.
The observed problem is that, very rarely, PgAioHandle->{state,target} are in
an inconsistent state, after having been checked to be in a valid state not
long before, triggering an assertion failure. Unfortunately, this could be
caused by wrong compiler code generation or somehow of missing memory barriers
- we don't really know. In theory there should not be any concurrent write
access to the handle in the state the bug is triggered, as the handle was idle
and is just being initialized.
Separately from the bug, we observed that at least gcc and clang generate
rather terrible code for the bitfield access. Even if it's not clear if the
observed assertion failure is actually caused by the bitfield somehow, the bad
code generation alone is sufficient reason to stop using bitfields.
Therefore, replace the enum bitfields with uint8s and instead cast in each
switch statement.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reported-by: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/1500090.1745443021@sss.pgh.pa.us
Backpatch-through: 18
Commit d31bbfb659 removed the comment at objectNamesToOids() that
there is no locking, because that commit added locking. But to fix
all the problems, we'd still need a stronger lock. So put the comment
back with more a detailed explanation.
Co-authored-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/bf72b82c-124d-4efa-a484-bb928e9494e4@eisentraut.org
This patch fixes a bug in how 'load_external_function' handles
'$libdir/ prefixes in module paths.
Previously, 'load_external_function' would unconditionally strip
'$libdir/' from the beginning of the 'filename' string. This caused
an issue when the path was nested, such as "$libdir/nested/my_lib".
Stripping the prefix resulted in a path of "nested/my_lib", which
would fail to be found by the expand_dynamic_library_name function
because the original '$libdir' macro was removed.
To fix this, the code now checks for the presence of an additional
directory separator ('/' or '\') after the '$libdir/' prefix. The
prefix is only stripped if the remaining string does not contain a
separator. This ensures that simple filenames like '"$libdir/my_lib"'
are correctly handled, while nested paths are left intact for
'expand_dynamic_library_name' to process correctly.
Reported-by: Dilip Kumar <dilipbalaut@gmail.com>
Co-authored-by: Matheus Alcantara <matheusssilv97@gmail.com>
Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAFiTN-uKNzAro4tVwtJhF1UqcygfJ%2BR%2BRL%3Db-_ZMYE3LdHoGhA%40mail.gmail.com
Commit 4b754d6c1 introduced the concept of an excludeOnly scan key,
which cannot select matching index entries but can reject
non-matching tuples, for example a tsquery such as '!term'. There are
poorly-documented assumptions that such scan keys do not appear as the
first scan key. ginNewScanKey did nothing to ensure that, however,
with the result that certain GIN index searches could go into an
infinite loop while apparently-equivalent queries with the clauses in
a different order were fine.
Fix by teaching ginNewScanKey to place all excludeOnly scan keys
after all not-excludeOnly ones. So far as we know at present,
it might be sufficient to avoid the case where the very first
scan key is excludeOnly; but I'm not very convinced that there
aren't other dependencies on the ordering.
Bug: #19031
Reported-by: Tim Wood <washwithcare@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19031-0638148643d25548@postgresql.org
Backpatch-through: 13
The CHECK_FOR_INTERRUPTS call in gingetbitmap turns out to be
inadequate to prevent a long uninterruptible loop, because
we now know a case where looping occurs within scanGetItem.
While the next patch will fix the bug that caused that, it
seems foolish to assume that no similar patterns are possible.
Let's do the CFI within scanGetItem's retry loop, instead.
This demonstrably allows canceling out of the loop exhibited
in bug #19031.
Bug: #19031
Reported-by: Tim Wood <washwithcare@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19031-0638148643d25548@postgresql.org
Backpatch-through: 13
The Self-Join Elimination SJE feature messes up keeping and removing RowMark's
in remove_self_joins_one_group(). That didn't lead to user-level error,
because the planned RowMark is only used to reference a rtable entry in later
execution stages. An RTE entry for keeping and removing relations is
identical and refers to the same relation OID.
To reduce confusion and prevent future issues, this commit cleans up the code
and fixes the incorrect behaviour. Furthermore, it includes sanity checks in
setrefs.c on existing non-null RTE and RelOptInfo entries for each RowMark.
Discussion: https://postgr.es/m/18c6bd6c-6d2a-419a-b0da-dfedef34b585%40gmail.com
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Backpatch-through: 18
Rename inner and outer to rrel and krel, respectively, to highlight their
connection to r and k indexes. For the same reason, rename imark and omark
to rmark and kmark.
Discussion: https://postgr.es/m/18c6bd6c-6d2a-419a-b0da-dfedef34b585%40gmail.com
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Backpatch-through: 18
This reverts commit bc22dc0e0d.
It appears that conditional variables are not suitable for use inside
critical sections. If WaitLatch()/WaitEventSetWaitBlock() face postmaster
death, they exit, releasing all locks instead of PANIC. In certain
situations, this leads to data corruption.
Reported-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Backpatch-through: 18
Commit 1585ff7387 changed GetTransactionSnapshot() to throw an error
if it's called during logical decoding, instead of returning the
historic snapshot. I made that change for extra protection, because a
historic snapshot can only be used to access catalog tables while
GetTransactionSnapshot() is usually called when you're executing
arbitrary queries. You might get very subtle visibility problems if
you tried to use the historic snapshot for arbitrary queries.
There's no built-in code in PostgreSQL that calls
GetTransactionSnapshot() during logical decoding, but it turns out
that the pglogical extension does just that, to evaluate row filter
expressions. You would get weird results if the row filter runs
arbitrary queries, but it is sane as long as you don't access any
non-catalog tables. Even though there are no checks to enforce that in
pglogical, a typical row filter expression does not access any tables
and works fine. Accessing tables marked with the user_catalog_table =
true option is also OK.
To fix pglogical with row filters, and any other extensions that might
do similar things, revert GetTransactionSnapshot() to return a
historic snapshot during logical decoding.
To try to still catch the unsafe usage of historic snapshots, add
checks in heap_beginscan() and index_beginscan() to complain if you
try to use a historic snapshot to scan a non-catalog table. We're very
close to the version 18 release however, so add those new checks only
in master.
Backpatch-through: 18
Reported-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/20250809222338.cc.nmisch@google.com
Reduce from ShareLock to ShareUpdateExclusivelock. Validation during
ALTER DOMAIN ... ADD CONSTRAINT keeps using ShareLock.
Example:
create domain d1 as int;
create table t (a d1);
alter domain d1 add constraint cc10 check (value > 10) not valid;
begin;
alter domain d1 validate constraint cc10;
-- another session
insert into t values (8);
Now we should still be able to perform DML operations on table t while
the domain constraint is being validated. The equivalent works
already on table constraints.
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxHz92A88NLRTA2msgE2dpXpE-EoZ2QO61od76-6bfqurA%40mail.gmail.com
This code was relying on "long", which is signed 8 bytes everywhere
except on Windows where it is 4 bytes, that could potentially expose it
to overflows, even if the current uses in the code are fine as far as I
know. This code is now able to rely on the same sizeof() variable
everywhere, with int64. long was used for sizes, partition counts and
entry counts.
Some callers of the dynahash.c routines used long declarations, that can
be cleaned up to use int64 instead. There was one shortcut based on
SIZEOF_LONG, that can be removed. long is entirely removed from
dynahash.c and hsearch.h.
Similar work was done in b1e5c9fa9a.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aKQYp-bKTRtRauZ6@paquier.xyz
Temporary relations may share the same RelFileNumber with a permanent
relation, or other temporary relations associated with other sessions.
Being able to uniquely identify a temporary relation would require
RelidByRelfilenumber() to know about the proc number of the temporary
relation it wants to identify, something it is not designed for since
its introduction in f01d1ae3a1.
There are currently three callers of RelidByRelfilenumber():
- autoprewarm.
- Logical decoding, reorder buffer.
- pg_filenode_relation(), that attempts to find a relation OID based on
a tablespace OID and a RelFileNumber.
This makes the situation problematic particularly for the first two
cases, leading to the possibility of random ERRORs due to
inconsistencies that temporary relations can create in the cache
maintained by RelidByRelfilenumber(). The third case should be less of
an issue, as I suspect that there are few direct callers of
pg_filenode_relation().
The window where the ERRORs are happen is very narrow, requiring an OID
wraparound to create a lookup conflict in RelidByRelfilenumber() with a
temporary table reusing the same OID as another relation already cached.
The problem is easier to reach in workloads with a high OID consumption
rate, especially with a higher number of temporary relations created.
We could get pg_filenode_relation() and RelidByRelfilenumber() to work
with temporary relations if provided the means to identify them with an
optional proc number given in input, but the years have also shown that
we do not have a use case for it, yet. Note that this could not be
backpatched if pg_filenode_relation() needs changes. It is simpler to
ignore temporary relations.
Reported-by: Shenhao Wang <wangsh.fnst@fujitsu.com>
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-By: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-By: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-By: Michael Paquier <michael@paquier.xyz>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-By: Shenhao Wang <wangsh.fnst@fujitsu.com>
Discussion: https://postgr.es/m/bbaaf9f9-ebb2-645f-54bb-34d6efc7ac42@fujitsu.com
Backpatch-through: 13
The result of pgaio_io_get_id() was being assigned to a mix of int and
uint32 variables. This fixes it to use int consistently, which seems
the most correct. Also change the queue empty special value in
method_worker.c to -1 from UINT32_MAX.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/70c784b3-f60b-4652-b8a6-75e5f051243e%40eisentraut.org
Replication slot synchronization (sync_replication_slots = on)
requires wal_level to be logical. This commit prevents the server
from starting if sync_replication_slots is enabled but wal_level
is set to minimal or replica.
Failing early during startup helps users catch invalid configurations
immediately, which is important because changing wal_level requires
a server restart.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAH0PTU_pc3oHi__XESF9ZigCyzai1Mo3LsOdFyQA4aUDkm01RA@mail.gmail.com
If we error out during execution of a SQL-language function, we will
often leave behind non-null pointers in its SQLFunctionCache's cplan
and eslist fields. This is problematic if the SQLFunctionCache is
re-used, because those pointers will point at resources that were
released during error cleanup. This problem escaped detection so far
because ordinarily we won't re-use an FmgrInfo+SQLFunctionCache struct
after a query error. However, in the rather improbable case that
someone implements an opclass support function in SQL language, there
will be long-lived FmgrInfos for it in the relcache, and then the
problem is reachable after the function throws an error.
To fix, add a flag to SQLFunctionCache that tracks whether execution
escapes out of fmgr_sql, and clear out the relevant fields during
init_sql_fcache if so. (This is going to need more thought if we ever
try to share FMgrInfos across threads; but it's very far from being
the only problem such a project will encounter, since many functions
regard fn_extra as being query-local state.)
This broke at commit 0313c5dc6; before that we did not try to re-use
SQLFunctionCache state across calls. Hence, back-patch to v18.
Bug: #19026
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19026-90aed5e71d0c8af3@postgresql.org
Backpatch-through: 18
In refuseDupeIndexAttach(), change from
errdetail("Another index is already attached for partition \"%s\"."...)
to
errdetail("Another index \"%s\" is already attached for partition \"%s\"."...)
so we can easily understand which index is already attached for
partition \"%s\".
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxGBfykJ_1ztk9T%2BL_gLmkOSOF%2BmL9Mn4ZPydz-rh%3DLccQ%40mail.gmail.com
Some replication slot manipulations (logical decoding via SQL,
advancing) were failing an assertion when releasing a slot in
single-user mode, because active_pid was not set in a ReplicationSlot
when its slot is acquired.
ReplicationSlotAcquire() has some logic to be able to work with the
single-user mode. This commit sets ReplicationSlot->active_pid to
MyProcPid, to let the slot-related logic fall-through, considering the
single process as the one holding the slot.
Some TAP tests are added for various replication slot functions with the
single-user mode, while on it, for slot creation, drop, advancing, copy
and logical decoding with multiple slot types (temporary, physical vs
logical). These tests are skipped on Windows, as direct calls of
postgres --single would fail on permission failures. There is no
platform-specific behavior that needs to be checked, so living with this
restriction should be fine. The CI is OK with that, now let's see what
the buildfarm tells.
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Mutaamba Maasha <maasha@gmail.com>
Discussion: https://postgr.es/m/OSCPR01MB14966ED588A0328DAEBE8CB25F5FA2@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 13
This comment mentions that pg_buffercache locks all buffer
partitions simultaneously, but it hasn't done so since v10.
Oversight in commit 6e654546fb.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/aKTuAHVEuYCUmmIy%40nathan
The DROP SUBSCRIPTION command performs several operations: it stops the
subscription workers, removes subscription-related entries from system
catalogs, and deletes the replication slot on the publisher server.
Previously, this command acquired an AccessExclusiveLock on
pg_subscription before initiating these steps.
However, while holding this lock, the command attempts to connect to the
publisher to remove the replication slot. In cases where the connection is
made to a newly created database on the same server as subscriber, the
cache-building process during connection tries to acquire an
AccessShareLock on pg_subscription, resulting in a self-deadlock.
To resolve this issue, we reduce the lock level on pg_subscription during
DROP SUBSCRIPTION from AccessExclusiveLock to RowExclusiveLock. Earlier,
the higher lock level was used to prevent the launcher from starting a new
worker during the drop operation, as a restarted worker could become
orphaned.
Now, instead of relying on a strict lock, we acquire an AccessShareLock on
the specific subscription being dropped and re-validate its existence
after acquiring the lock. If the subscription is no longer valid, the
worker exits gracefully. This approach avoids the deadlock while still
ensuring that orphan workers are not created.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/18988-7312c868be2d467f@postgresql.org
This provides a single entry point to access some information about the
state of MultiXacts, able to return some data about multixacts offsets
and counts. Originally this function was only able to return some
information about the number of multixacts and multixact members,
extended here to provide some data about the oldest multixact ID in use
and the oldest offset, if known.
This change has been proposed in a patch that aims at providing more
monitoring capabilities for multixacts, and it is useful on its own.
GetMultiXactInfo() is added to multixact.h, becoming available for
out-of-core code.
Extracted from a larger patch by the same author.
Author: Naga Appani <nagnrik@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com
This pointer was not used after its last update. This variable
assignment was most likely a vestige artifact of the earlier versions of
the patch set that have led to 5891c7a8ed.
This pointer update is useless, so let's remove it. It removes one call
to pgstat_dsa_init_size(), making the code slightly easier to grasp.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aKLsu2sdpnyeuSSc@ip-10-97-1-34.eu-west-3.compute.internal
Now that the only call to relation_has_unique_index_for() that
supplied an exprlist and oprlist has been removed, the loop handling
those lists is effectively dead code. This patch removes that loop
and simplifies the function accordingly.
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com
There are two implementation techniques for semijoins: one uses the
JOIN_SEMI jointype, where the executor emits at most one matching row
per left-hand side (LHS) row; the other unique-ifies the right-hand
side (RHS) and then performs a plain inner join.
The latter technique currently has some drawbacks related to the
unique-ification step.
* Only the cheapest-total path of the RHS is considered during
unique-ification. This may cause us to miss some optimization
opportunities; for example, a path with a better sort order might be
overlooked simply because it is not the cheapest in total cost. Such
a path could help avoid a sort at a higher level, potentially
resulting in a cheaper overall plan.
* We currently rely on heuristics to choose between hash-based and
sort-based unique-ification. A better approach would be to generate
paths for both methods and allow add_path() to decide which one is
preferable, consistent with how path selection is handled elsewhere in
the planner.
* In the sort-based implementation, we currently pay no attention to
the pathkeys of the input subpath or the resulting output. This can
result in redundant sort nodes being added to the final plan.
This patch improves semijoin planning by creating a new RelOptInfo for
the RHS rel to represent its unique-ified version. It then generates
multiple paths that represent elimination of distinct rows from the
RHS, considering both a hash-based implementation using the cheapest
total path of the original RHS rel, and sort-based implementations
that either exploit presorted input paths or explicitly sort the
cheapest total path. All resulting paths compete in add_path(), and
those deemed worthy of consideration are added to the new RelOptInfo.
Finally, the unique-ified rel is joined with the other side of the
semijoin using a plain inner join.
As a side effect, most of the code related to the JOIN_UNIQUE_OUTER
and JOIN_UNIQUE_INNER jointypes -- used to indicate that the LHS or
RHS path should be made unique -- has been removed. Besides, the
T_Unique path now has the same meaning for both semijoins and upper
DISTINCT clauses: it represents adjacent-duplicate removal on
presorted input. This patch unifies their handling by sharing the
same data structures and functions.
This patch also removes the UNIQUE_PATH_NOOP related code along the
way, as it is dead code -- if the RHS rel is provably unique, the
semijoin should have already been simplified to a plain inner join by
analyzejoins.c.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com
Two header declarations were related to SQL-callable functions, that
should have been cleaned up in df9133fa63. Some more includes can be
removed on closer inspection, so let's clean up these as well, while on
it.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/345438.1755524834@sss.pgh.pa.us
This existed in a semi broken stated from be0a66666 until 296cba276.
Recent discussion has questioned the value of having this at all as it
only outputs static information from various of the hash table's
properties when the hash table is created.
Author: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/OSCPR01MB1496650D03FA0293AB9C21416F534A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Previously this was being output to stderr. This commit adjusts things
to use elog(DEBUG4). Here we also adjust the format of the message to
add the hash table name and also put the message on a single line. This
should make grepping the logs for this information easier.
Also get rid of the global hash table statistics. This seems very dated
and didn't fit very well with trying to put all the statistics for a
specific hash table on a single log line.
The main aim here is to allow it so we can have at least one buildfarm
member build with HASH_STATISTICS to help prevent future changes from
breaking things in that area. ca3891251 recently fixed some issues
here.
In passing, switch to using uint64 data types rather than longs for the
usage counters. The long type is 32 bits on some platforms we support.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAApHDvoccvJ9CG5zx+i-EyCzJbcL5K=CzqrnL_YN59qaL5hiaw@mail.gmail.com
Input with zero length can result in a buffer underflow when
accessing *(num + (len - 1)), as (len - 1) would produce a negative
index. Add an assertion for zero-length input to prevent it.
This was found by ALT Linux Team.
Reviewing the call sites shows that get_th() currently cannot be
applied to an empty string: it is always called on a string containing
a number we've just printed. Therefore, an assertion rather than a
user-facing error message is sufficient.
Co-authored-by: Alexander Kuznetsov <kuznetsovam@altlinux.org>
Discussion: https://www.postgresql.org/message-id/flat/e22df993-cdb4-4d0a-b629-42211ebed582@altlinux.org
A patch is under discussion to add more SQL capabilities related to
multixacts, and this move avoids bloating the file more than necessary.
This affects pg_get_multixact_members(). A side effect of this move is
the requirement to add mxstatus_to_string() to multixact.h.
Extracted from a larger patch by the same author, tweaked by me.
Author: Naga Appani <nagnrik@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com
init_params() sets up "last_value" and "is_called" for a sequence
relation holdind its metadata, based on the sequence properties in
pg_sequences. "log_cnt" is the third property that can be updated in
this routine for FormData_pg_sequence_data, tracking when WAL records
should be generated for a sequence after nextval() iterations. This
routine is called when creating or altering a sequence.
This commit refactors init_params() to not depend anymore on
FormData_pg_sequence_data, removing traces of it in sequence.c, making
easier the manipulation of metadata related to sequences. The knowledge
about "log_cnt" is replaced with a more general "reset_state" flag, to
let the caller know if the sequence state should be reset. In the case
of in-core sequences, this relates to WAL logging. We still need to
depend on FormData_pg_sequence.
Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/ZWlohtKAs0uVVpZ3@paquier.xyz
When extracting a timestamp from a UUIDv7, a conversion from
milliseconds to microseconds was using the incorrect constant
NS_PER_US instead of US_PER_MS. Although both constants have the same
value, this fix improves code clarity by using the semantically
correct constant.
Backpatch to v18, where UUIDv7 was introduced.
Author: Erik Nordström <erik@tigerdata.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CACAa4V+i07eaP6h4MHNydZeX47kkLPwAg0sqe67R=M5tLdxNuQ@mail.gmail.com
Backpatch-through: 18
bms_prev_member() could attempt to access memory outside of the words[]
array in cases where the prevbit was a number < -1 or > a->nwords *
BITS_PER_BITMAPWORD + 1.
Here we add the Asserts to help draw attention to bogus callers so we're
more likely to catch them during development.
In passing, fix wording of bms_prev_member's header comment which talks
about how we expect the callers to ensure only valid prevbit values are
used.
Author: Greg Burd <greg@burd.me>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2000A717-1FFE-4031-827B-9330FB2E9065%40getmailspring.com
Doing that seems rather random and unnecessary. This commit removes
those and fixes fallout, which is pretty minimal. We do need to add a
forward declaration of struct TM_IndexDeleteOp (whose full definition
appears in tableam.h) so that _bt_delitems_delete_check()'s declaration
can use it.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/202508051109.lzk3lcuzsaxo@alvherre.pgsql
Make sure the memory allocated by make_absolute_path() is freed
when SelectConfigFiles() fails. Since all the callers will exit
immediately in that case, there's no practical gain here, but
silencing Valgrind leak complaints seems useful. In any case,
it was inconsistent that only one of the failure exits did this.
Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAJ7c6TMByXE8dc7zDvDWTQjk6o-XXAdRg_RAg5CBaUOgFPV3LQ%40mail.gmail.com
This function uses an argument named "maxsize" that is only used in
assertions, being set once outside the assertion area. Recent gcc
versions with -Wunused-but-set-parameter complain about a warning when
building without assertions enabled, because of that.
In order to fix this issue, PG_USED_FOR_ASSERTS_ONLY is added to the
function argument of SerializeClientConnectionInfo(), which is the first
time we are doing so in the tree. The CI is fine with the change, but
let's see what the buildfarm has to say on the matter.
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Jacob Champion <jchampion@postgresql.org>
Discussion: https://postgr.es/m/pevajesswhxafjkivoq3yvwxga77tbncghlf3gq5fvchsvfuda@6uivg25sb3nx
Backpatch-through: 16
Commit 2633dae2e4 standardized LSN formatting but mistakenly changed
the logical snapshot filename format in SnapBuildSnapshotExists() from
"%X-%X.snap" to "%08X-%08X.snap". Other code still used the original
"%X-%X.snap" format, causing the replication slot synchronization worker
to fail to find existing snapshot files and produce excessive log messages.
This commit restores the original "%X-%X.snap" format
in SnapBuildSnapshotExists() to resolve the issue.
Author: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHuHPB-ucAk_Tq3uSs4Fdziu1Jp_AA_RD3m5Ycky7m48w@mail.gmail.com
Remove conditionally-compiled code for the other case.
Replace uses of FLOAT8PASSBYVAL with constant "true", mainly because
it was quite confusing in cases where the type we were dealing with
wasn't float8.
I left the associated pg_control and Pg_magic_struct fields in place.
Perhaps we should get rid of them, but it would save little, so it
doesn't seem worth thinking hard about the compatibility implications.
I just labeled them "vestigial" in places where that seemed helpful.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
Remove conditionally-compiled code for smaller Datum widths,
and simplify comments that describe cases no longer of interest.
I also fixed up a few more places that were not using
DatumGetIntXX where they should, and made some cosmetic
adjustments such as using sizeof(int64) not sizeof(Datum)
in places where that fit better with the surrounding code.
One thing I remembered while preparing this part is that SP-GiST
stores pass-by-value prefix keys as Datums, so that the on-disk
representation depends on sizeof(Datum). That's even more
unfortunate than the existing commentary makes it out to be,
because now there is a hazard that the change of sizeof(Datum)
will break SP-GiST indexes on 32-bit machines. It appears that
there are no existing SP-GiST opclasses that are actually
affected; and if there are some that I didn't find, the number
of installations that are using them on 32-bit machines is
doubtless tiny. So I'm proceeding on the assumption that we
can get away with this, but it's something to worry about.
(gininsert.c looks like it has a similar problem, but it's okay
because the "tuples" it's constructing are just transient data
within the tuplesort step. That's pretty poorly documented
though, so I added some comments.)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
This patch makes sizeof(Datum) be 8 on all platforms including
32-bit ones. The objective is to allow USE_FLOAT8_BYVAL to be true
everywhere, and in consequence to remove a lot of code that is
specific to pass-by-reference handling of float8, int8, etc. The
code for abbreviated sort keys can be simplified similarly. In this
way we can reduce the maintenance effort involved in supporting 32-bit
platforms, without going so far as to actually desupport them. Since
Datum is strictly an in-memory concept, this has no impact on on-disk
storage, though an initdb or pg_upgrade will be needed to fix affected
catalog entries.
We have required platforms to support [u]int64 for ages, so this
breaks no supported platform. We can expect that this change will
make 32-bit builds a bit slower and more memory-hungry, although being
able to use pass-by-value handling of 8-byte types may buy back some
of that. But we stopped optimizing for 32-bit cases a long time ago,
and this seems like just another step on that path.
This initial patch simply forces the correct type definition and
USE_FLOAT8_BYVAL setting, and cleans up a couple of minor compiler
complaints that ensued. This is sufficient for testing purposes.
In the wake of a bunch of Datum-conversion cleanups by Peter
Eisentraut, this now compiles cleanly with gcc on a 32-bit platform.
(I'd only tested the previous version with clang, which it turns out
is less picky than gcc about width-changing coercions.)
There is a good deal of now-dead code that I'll remove in separate
follow-up patches.
A catversion bump is required because this affects initial catalog
contents (on 32-bit machines) in two ways: pg_type.typbyval changes
for some built-in types, and Const nodes in stored views/rules will
now have 8 bytes not 4 for pass-by-value types.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
It turns out that on some platforms (at least current macOS, NetBSD,
OpenBSD) semget(2) will return EINVAL if there is a pre-existing
semaphore set with the same key and too few semaphores. Our code
expects EEXIST in that case and treats EINVAL as a hard failure,
resulting in failure during initdb or postmaster start.
POSIX does document EINVAL for too-few-semaphores-in-set, and is
silent on its priority relative to EEXIST, so this behavior arguably
conforms to spec. Nonetheless it's quite problematic because EINVAL
is also documented to mean that nsems is greater than the system's
limit on the number of semaphores per set (SEMMSL). If that is
where the problem lies, retrying would just become an infinite loop.
To resolve this contradiction, retry after EINVAL, but also install a
loop limit that will make us give up regardless of the specific errno
after trying 1000 different keys. (1000 is a pretty arbitrary number,
but it seems like it should be sufficient.) I like this better than
the previous infinite-looping behavior, since it will also keep us out
of trouble if (say) we get EACCES due to a system-level permissions
problem rather than anything to do with a specific semaphore set.
This problem has only been observed in the field in PG 17, which uses
a higher nsems value than other branches (cf. 38da05346, 810a8b1c8).
That makes it possible to get the failure if a new v17 postmaster
has a key collision with an existing postmaster of another branch.
In principle though, we might see such a collision against a semaphore
set created by some other application, in which case all branches are
vulnerable on these platforms. Hence, backpatch.
Reported-by: Gavin Panella <gavinpanella@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALL7chmzY3eXHA7zHnODUVGZLSvK3wYCSP0RmcDFHJY8f28Q3g@mail.gmail.com
Backpatch-through: 13
fb9f955025 optimized code generation by using specialized variants of
ExecSeqScan* for [not] having a qual, projection etc. This allowed the
compiler to optimize the code out the code for qual / projection. However, as
observed by David Rowley at the time, the compiler couldn't prove the
opposite, i.e. that the qual etc *are* present.
By using pg_assume(), introduced in d65eb5b1b8, we can tell the compiler that
the relevant variables are non-null.
This reduces the code size to a surprising degree and seems to lead to a small
but reproducible performance gain.
Reviewed-by: Amit Langote <amitlangote09@gmail.com> Discussion:
https://postgr.es/m/CA+HiwqFk-MbwhfX_kucxzL8zLmjEt9MMcHi2YF=DyhPrSjsBEA@mail.gmail.com
Without using stamp files, meson lists the generated headers as the dependency
for every .c file, bloating build.ninja by more than 2x. Processing all the
dependencies also increases the time to generate build.ninja.
The immediate benefit is that this makes re-configuring and clean builds a bit
faster. The main motivation however is that I have other patches that
introduce additional build targets that further would increase the size of
build.ninja, making re-configuring more noticeably slower.
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/cgkdgvzdpinkacf4v33mky7tbmk467oda5dd4dlmucjjockxzi@xkqfvjoq4uiy
Commit e2d4ef8de8 (the fix for CVE-2017-7484) added security checks
to the selectivity estimation functions to prevent them from running
user-supplied operators on data obtained from pg_statistic if the user
lacks privileges to select from the underlying table. In cases
involving inheritance/partitioning, those checks were originally
performed against the child RTE (which for plain inheritance might
actually refer to the parent table). Commit 553d2ec271 then extended
that to also check the parent RTE, allowing access if the user had
permissions on either the parent or the child. It turns out, however,
that doing any checks using the child RTE is incorrect, since
securityQuals is set to NULL when creating an RTE for an inheritance
child (whether it refers to the parent table or the child table), and
therefore such checks do not correctly account for any RLS policies or
security barrier views. Therefore, do the security checks using only
the parent RTE. This is consistent with how RLS policies are applied,
and the executor's ACL checks, both of which use only the parent
table's permissions/policies. Similar checks are performed in the
extended stats code, so update that in the same way, centralizing all
the checks in a new function.
In addition, note that these checks by themselves are insufficient to
ensure that the user has access to the table's data because, in a
query that goes via a view, they only check that the view owner has
permissions on the underlying table, not that the current user has
permissions on the view itself. In the selectivity estimation
functions, there is no easy way to navigate from underlying tables to
views, so add permissions checks for all views mentioned in the query
to the planner startup code. If the user lacks permissions on a view,
a permissions error will now be reported at planner-startup, and the
selectivity estimation functions will not be run.
Checking view permissions at planner-startup in this way is a little
ugly, since the same checks will be repeated at executor-startup.
Longer-term, it might be better to move all the permissions checks
from the executor to the planner so that permissions errors can be
reported sooner, instead of creating a plan that won't ever be run.
However, such a change seems too far-reaching to be back-patched.
Back-patch to all supported versions. In v13, there is the added
complication that UPDATEs and DELETEs on inherited target tables are
planned using inheritance_planner(), which plans each inheritance
child table separately, so that the selectivity estimation functions
do not know that they are dealing with a child table accessed via its
parent. Handle that by checking access permissions on the top parent
table at planner-startup, in the same way as we do for views. Any
securityQuals on the top parent table are moved down to the child
tables by inheritance_planner(), so they continue to be checked by the
selectivity estimation functions.
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
Backpatch-through: 13
Security: CVE-2025-8713
The internal queue of buffers could become corrupted in a rare edge case
that failed to invalidate an entry, causing a stale buffer to be
"forwarded" to StartReadBuffers(). This is a simple fix for the
immediate problem.
A small API change might be able to remove this and related fragility
entirely, but that will have to wait a bit.
Defect in commit ed0b87ca.
Bug: 19006
Backpatch-through: 18
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19006-80fcaaf69000377e%40postgresql.org
Fix a couple more places where an explicit Datum conversion
is needed (not clear how we missed these in ff89e182d and
previous commits).
Replace the minority usage "(Datum) NULL" with "(Datum) 0".
The former depends on the assumption that Datum is the same
width as Pointer, the latter doesn't. Anyway consistency
is a good thing.
This is, I believe, the last of the notational mop-up needed
before we can consider changing Datum to uint64 everywhere.
It's also important cleanup for more aggressive ideas such
as making Datum a struct.
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
Discussion: https://postgr.es/m/8246d7ff-f4b7-4363-913e-827dadfeb145@eisentraut.org
Add various missing conversions from and to Datum. The previous code
mostly relied on implicit conversions or its own explicit casts
instead of using the correct DatumGet*() or *GetDatum() functions.
We think these omissions are harmless. Some actual bugs that were
discovered during this process have been committed
separately (80c758a2e1, fd2ab03fea).
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
This function is called from ATExecAttachPartition/ATExecAddInherit,
which prevent tables with row-level triggers with transition tables from
becoming partitions or inheritance children, to check if there is such a
trigger on the given table, but failed to check if a found trigger is
row-level, causing the caller functions to needlessly prevent a table
with only a statement-level trigger with transition tables from becoming
a partition or inheritance child. Repair.
Oversight in commit 501ed02cf.
Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Discussion: https://postgr.es/m/CAPmGK167mXzwzzmJ_0YZ3EZrbwiCxtM1vogH_8drqsE6PtxRYw%40mail.gmail.com
Backpatch-through: 13
Commit 9e6104c66 disallowed transition tables on foreign tables, but
failed to account for cases where a foreign table is a child table of a
partitioned/inherited table on which transition tables exist, leading to
incorrect transition tuples collected from such foreign tables for
queries on the parent table triggering transition capture. This
occurred not only for inherited UPDATE/DELETE but for partitioned INSERT
later supported by commit 3d956d956, which should have handled it at
least for the INSERT case, but didn't.
To fix, modify ExecAR*Triggers to throw an error if the given relation
is a foreign table requesting transition capture. Also, this commit
fixes make_modifytable so that in case of an inherited UPDATE/DELETE
triggering transition capture, FDWs choose normal operations to modify
child foreign tables, not DirectModify; which is needed because they
would otherwise skip the calls to ExecAR*Triggers at execution, causing
unexpected behavior.
Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CAPmGK14QJYikKzBDCe3jMbpGENnQ7popFmbEgm-XTNuk55oyHg%40mail.gmail.com
Backpatch-through: 13
Dropping twice a pgstats entry should not happen, and the error report
generated was missing the "generation" counter (tracking when an entry
is reused) that has been added in 818119afcc.
Like d92573adcb, backpatch down to v15 where this information is
useful to have, to gather more information from instances where the
problem shows up. A report has shown that this error path has been
reached on a standby based on 17.3, for a relation stats entry and an
OID close to wraparound.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAN4RuQvYth942J2+FcLmJKgdpq6fE5eqyFvb_PuskxF2eL=Wzg@mail.gmail.com
Backpatch-through: 15
This adds a few more functions to int128.h, allowing more of numeric.c
to use 128-bit integers on all platforms.
Specifically, int64_div_fast_to_numeric() and the following aggregate
functions can now use 128-bit integers for improved performance on all
platforms, rather than just platforms with native support for int128:
- SUM(int8)
- AVG(int8)
- STDDEV_POP(int2 or int4)
- STDDEV_SAMP(int2 or int4)
- VAR_POP(int2 or int4)
- VAR_SAMP(int2 or int4)
In addition to improved performance on platforms lacking native
128-bit integer support, this significantly simplifies this numeric
code by allowing a lot of conditionally compiled code to be deleted.
A couple of numeric functions (div_var_int64() and sqrt_var()) still
contain conditionally compiled 128-bit integer code that only works on
platforms with native 128-bit integer support. Making those work more
generally would require rolling our own higher precision 128-bit
division, which isn't supported for now.
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CAEZATCWgBMc9ZwKMYqQpaQz2X6gaamYRB+RnMsUNcdMcL2Mj_w@mail.gmail.com
Use Min(NBuffers, MAX_CHECKPOINT_REQUESTS) instead of NBuffers in
CheckpointerShmemSize() to match the actual array size limit set in
CheckpointerShmemInit(). This prevents wasting shared memory when
NBuffers > MAX_CHECKPOINT_REQUESTS. Also, fix the comment.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1439188.1754506714%40sss.pgh.pa.us
Author: Xuneng Zhou <xunengzhou@gmail.com>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
The code used a PG_RETURN_TIMESTAMPTZ() where the return type is
TimestampTz and not a Datum.
On 64-bit systems, there is no effect since this just ends up casting
64-bit integers back and forth. On 32-bit systems, timestamptz is
pass-by-reference. PG_RETURN_TIMESTAMPTZ() allocates new memory and
returns the address, meaning that the caller could interpret this as a
timestamp value.
The effect is using "date_trunc(..., 'infinity'::timestamptz) will
return random values (instead of the correct return value 'infinity').
Bug introduced in commit d85ce012f9.
Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2d320b6f-b4af-4fbc-9eec-5d0fa15d187b@eisentraut.org
Discussion: https://postgr.es/m/4bf60a84-2862-4a53-acd5-8eddf134a60e@eisentraut.org
Backpatch-through: 18
This commit makes use of the existing PqMsg_* macros in more places
and adds new PqReplMsg_* and PqBackupMsg_* macros for use in
special replication and backup messages, respectively.
Author: Dave Cramer <davecramer@gmail.com>
Co-authored-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/aIECfYfevCUpenBT@nathan
Discussion: https://postgr.es/m/CAFcNs%2Br73NOUb7%2BqKrV4HHEki02CS96Z%2Bx19WaFgE087BWwEng%40mail.gmail.com
The name "namspace" looks like a typo, but it was presumably meant
to avoid using the "namespace" C++ keyword. This commit renames
the parameter to "nameSpace" to prevent future confusion while
still avoiding the keyword.
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/aJJxpfsDfiQ1VbJ5%40nathan
These were introduced (commit efdc7d7475) at the same time as we were
moving to using the standard inttypes.h format macros (commit
a0ed19e0a9). It doesn't seem useful to keep a new already-deprecated
interface like this with only a few users, so remove the new symbols
again and have the callers use PRIx64.
(Also, INT64_HEX_FORMAT was kind of a misnomer, since hex formats all
use unsigned types.)
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/0ac47b5d-e5ab-4cac-98a7-bdee0e2831e4%40eisentraut.org
Following commit e035863c9a, building with -O0 began triggering
warnings about potentially uninitialized 'workbuf' usage. While
theoretically the initialization isn't necessary since VARDATA()
doesn't access the contents of the pointed-to object, this commit
explicitly initializes the workbuf variable to suppress the warning.
Buildfarm members adder and flaviventris have shown the warning.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAD21AoCOZxfqnNgfM5yVKJZYnOq5m2Q96fBGy1fovEqQ9V4OZA@mail.gmail.com
The result of "DirectFunctionCall1(numeric_float8, d)" is already in
Datum form, but the code was incorrectly applying PG_RETURN_FLOAT8()
to it. On machines where float8 is pass-by-reference, this would
result in complete garbage, since an unpredictable pointer value
would be treated as an integer and then converted to float. It's not
entirely clear how much of a problem would ensue on 64-bit hardware,
but certainly interpreting a float8 bitpattern as uint64 and then
converting that to float isn't the intended behavior.
As luck would have it, even the complete-garbage case doesn't break
BRIN indexes, since the results are only used to make choices about
how to merge values into ranges: at worst, we'd make poor choices
resulting in an inefficient index. Doubtless that explains the lack
of field complaints. However, users with BRIN indexes that use the
numeric_minmax_multi_ops opclass may wish to reindex in hopes of
making their indexes more efficient.
Author: Peter Eisentraut <peter@eisentraut.org>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2093712.1753983215@sss.pgh.pa.us
Backpatch-through: 14
This commit introduces a new column backup_type that indicates the
type of backup being performed: either 'full' or 'incremental'.
Bump catalog version.
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/CAOzEurQuzbHwTj1ehk1a+eeQDidJPyrE5s6mYumkjwjZnurhkQ@mail.gmail.com
During CREATE DATABASE, if changing the locale provider, require that
a new locale is specified rather than trying to reinterpret the
template's locale using the new provider.
This only affects the behavior when the template uses the builtin
provider and CREATE DATABASE specifies the ICU provider without
specifying the locale. Previously, that may have succeeded due to
loose validation by ICU, whereas now that will cause an error. Because
it can cause an error, backport only to unreleased versions.
Discussion: https://postgr.es/m/5038b33a6dc639009f4b3d43fa6ae0c5ba9e04f7.camel@j-davis.com
Backpatch-through: 18
Macros like VARDATA() and VARSIZE() should be thought of as taking
values of type pointer to struct varlena or some other related struct.
The way they are implemented, you can pass anything to it and it will
cast it right. But this is in principle incorrect. To fix, add the
required DatumGetPointer() calls. Or in a couple of cases, remove
superfluous PointerGetDatum() calls.
It is planned in a subsequent patch to change macros like VARDATA()
and VARSIZE() to inline functions, which will enforce stricter typing.
This is in preparation for that.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/928ea48f-77c6-417b-897c-621ef16685a6%40eisentraut.org
Previously, specifying the publication option 'publish_generated_columns'
without an explicit value would incorrectly default to 'stored', which is
not the intended behavior.
This patch fixes the issue by raising an ERROR when no value is provided
for 'publish_generated_columns', ensuring that users must explicitly
specify a valid option.
Author: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Backpatch-through: 18, where it was introduced
Discussion: https://postgr.es/m/CAHut+PsCUCWiEKmB10DxhoPfXbF6jw5RD9ib2LuaQeA_XraW7w@mail.gmail.com
Previously, enabling sync_replication_slots while wal_level was not set
to logical could cause the server to shut down. This was because
the postmaster performed a configuration check before launching
the slot synchronization worker and raised an ERROR if the settings
were incompatible. Since ERROR is treated as FATAL in the postmaster,
this resulted in the entire server shutting down unexpectedly.
This commit changes the postmaster to log that message with a LOG-level
instead of raising an ERROR, allowing the server to continue running
even with the misconfiguration.
Back-patch to v17, where slot synchronization was introduced.
Reported-by: Hugo DUBOIS <hdubois@scaleway.com>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Hugo DUBOIS <hdubois@scaleway.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAH0PTU_pc3oHi__XESF9ZigCyzai1Mo3LsOdFyQA4aUDkm01RA@mail.gmail.com
Backpatch-through: 17
This enhancement builds upon the infrastructure introduced in commit
228c370868, which enables the preservation of deleted tuples and their
origin information on the subscriber. This capability is crucial for
handling concurrent transactions replicated from remote nodes.
The update introduces support for detecting update_deleted conflicts
during the application of update operations on the subscriber. When an
update operation fails to locate the target row-typically because it has
been concurrently deleted-we perform an additional table scan. This scan
uses the SnapshotAny mechanism and we do this additional scan only when
the retain_dead_tuples option is enabled for the relevant subscription.
The goal of this scan is to locate the most recently deleted tuple-matching
the old column values from the remote update-that has not yet been removed
by VACUUM and is still visible according to our slot (i.e., its deletion
is not older than conflict-detection-slot's xmin). If such a tuple is
found, the system reports an update_deleted conflict, including the origin
and transaction details responsible for the deletion.
This provides a groundwork for more robust and accurate conflict
resolution process, preventing unexpected behavior by correctly
identifying cases where a remote update clashes with a deletion from
another origin.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Coverity complained that the "errtrace" string is leaked if we return
early because backtrace_symbols fails. Another criticism that could
be leveled at this is that not providing any hint of what happened is
user-unfriendly. Fix that.
The odds of a leak here are small, and typically it wouldn't matter
anyway since the leak will be in ErrorContext which will soon get
reset. So I'm not feeling a need to back-patch.
If ndatums is zero, the code would allocate zero-length boundKinds
and boundDatums chunks, which would have nothing pointing to them,
leading to Valgrind complaints. Rearrange the code to avoid the
useless pallocs, and also to not bother computing byval/typlen when
they aren't used.
I'm unsure why I didn't see this in my Valgrind testing back in May.
This code hasn't changed since then, but maybe we added a regression
test that reaches this edge case. Or possibly I just failed to
notice the reports, which do say "0 bytes lost".
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
CompleteCachedPlan intentionally doesn't worry about small
leaks from PlanCacheComputeResultDesc. However, Valgrind
knows nothing of engineering tradeoffs and complains anyway.
Silence it by doing things the hard way if USE_VALGRIND.
I don't really love this patch, because it makes the handling
of plansource->resultDesc different from the handling of query
dependencies and search_path just above, which likewise are willing
to accept small leaks into the cached plan's context. However,
those cases aren't provoking Valgrind complaints. (Perhaps in a
CLOBBER_CACHE_ALWAYS build, they would?) For the moment, this
makes the src/pl/plpgsql tests leak-free according to Valgrind.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
Like the situation with function cache loading, text search
dictionary loading functions tend to leak some cruft into the
dictionary's long-lived cache context. To judge by the examples in
the core regression tests, not very many bytes are at stake.
Moreover, I don't see a way to prevent such leaks without changing the
API for TS template initialization functions: right now they do not
have to worry about making sure that their results are long-lived.
Hence, I think we should install a suppression rule rather than trying
to fix this completely. However, I did grab some low-hanging fruit:
several places were leaking the result of get_tsearch_config_filename.
This seems worth doing mostly because they are inconsistent with other
dictionaries that were freeing it already.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
These changes don't actually fix any leaks. They just make sure that
Valgrind will find pointers to data structures that remain allocated
at process exit, and thus not falsely complain about leaks. In
particular, we are trying to avoid situations where there is no
pointer to the beginning of an allocated block (except possibly
within the block itself, which Valgrind won't count).
* Because dynahash.c never frees hashtable storage except by deleting
the whole hashtable context, it doesn't bother to track the individual
blocks of elements allocated by element_alloc(). This results in
"possibly lost" complaints from Valgrind except when the first element
of each block is actively in use. (Otherwise it'll be on a freelist,
but very likely only reachable via "interior pointers" within element
blocks, which doesn't satisfy Valgrind.)
To fix, if we're building with USE_VALGRIND, expend an extra pointer's
worth of space in each element block so that we can chain them all
together from the HTAB header. Skip this in shared hashtables though:
Valgrind doesn't track those, and we'd need additional locking to make
it safe to manipulate a shared chain.
While here, update a comment obsoleted by 9c911ec06.
* Put the dlist_node fields of catctup and catclist structs first.
This ensures that the dlist pointers point to the starts of these
palloc blocks, and thus that Valgrind won't consider them
"possibly lost".
* The postmaster's PMChild structs and the autovac launcher's
avl_dbase structs also have the dlist_node-is-not-first problem,
but putting it first still wouldn't silence the warning because we
bulk-allocate those structs in an array, so that Valgrind sees a
single allocation. Commonly the first array element will be pointed
to only from some later element, so that the reference would be an
interior pointer even if it pointed to the array start. (This is the
same issue as for dynahash elements.) Since these are pretty simple
data structures, I don't feel too bad about faking out Valgrind by
just keeping a static pointer to the array start.
(This is all quite hacky, and it's not hard to imagine usages where
we'd need some other idea in order to have reasonable leak tracking of
structures that are only accessible via dlist_node lists. But these
changes seem to be enough to silence this class of leakage complaints
for the moment.)
* Free a couple of data structures manually near the end of an
autovacuum worker's run when USE_VALGRIND, and ensure that the final
vac_update_datfrozenxid() call is done in a non-permanent context.
This doesn't have any real effect on the process's total memory
consumption, since we're going to exit as soon as that last
transaction is done. But it does pacify Valgrind.
* Valgrind complains about the postmaster's socket-files and
lock-files lists being leaked, which we can silence by just
not nulling out the static pointers to them.
* Valgrind seems not to consider the global "environ" variable as
a valid root pointer; so when we allocate a new environment array,
it claims that data is leaked. To fix that, keep our own
statically-allocated copy of the pointer, similarly to the previous
item.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
In the current system architecture, none of these are worth obsessing
over; most are once-per-process leaks. However, Valgrind complains
about all of them, and if we get to using threads rather than
processes for backend sessions, it will become more interesting to
avoid per-session leaks.
* Fix leaks in StartupXLOG() and ShutdownWalRecovery().
* Fix leakage of pq_mq_handle in a parallel worker.
While at it, move mq_putmessage's "Assert(pq_mq_handle != NULL)"
to someplace where it's not trivially useless.
* Fix leak in logicalrep_worker_detach().
* Don't leak the startup-packet buffer in ProcessStartupPacket().
* Fix leak in evtcache.c's DecodeTextArrayToBitmapset().
If the presented array is toasted, this neglected to free the
detoasted copy, which was then leaked into EventTriggerCacheContext.
* I'm distressed by the amount of code that BuildEventTriggerCache
is willing to run while switched into a long-lived cache context.
Although the detoasted array is the only leak that Valgrind reports,
let's tighten things up while we're here. (DecodeTextArrayToBitmapset
is still run in the cache context, so doing this doesn't remove the
need for the detoast fix. But it reduces the surface area for other
leaks.)
* load_domaintype_info() intentionally leaked some intermediate cruft
into the long-lived DomainConstraintCache's memory context, reasoning
that the amount of leakage will typically not be much so it's not
worth doing a copyObject() of the final tree to avoid that. But
Valgrind knows nothing of engineering tradeoffs and complains anyway.
On the whole, the copyObject doesn't cost that much and this is surely
not a performance-critical code path, so let's do it the clean way.
* MarkGUCPrefixReserved didn't bother to clean up removed placeholder
GUCs at all, which shows up as a leak in one regression test.
It seems appropriate for it to do as much cleanup as
define_custom_variable does when replacing placeholders, so factor
that code out into a helper function. define_custom_variable's logic
was one brick shy of a load too: it forgot to free the separate
allocation for the placeholder's name.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
Arrange that only the "aligned chunk" part of the allocated space is
included in a Valgrind vchunk. This suppresses complaints about that
vchunk being possibly lost because PG is retaining only pointers to
the aligned chunk. Also make sure that trailing wasted space is
marked NOACCESS.
As a tiny performance improvement, arrange that MCXT_ALLOC_ZERO zeroes
only the returned "aligned chunk", not the wasted padding space.
In passing, fix GetLocalBufferStorage to use MemoryContextAllocAligned
instead of rolling its own implementation, which was equally broken
according to Valgrind.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
When determining whether an allocated chunk is still reachable,
Valgrind will consider only pointers within what it believes to be
allocated chunks. Normally, all of a block obtained from malloc()
would be considered "allocated" --- but it turns out that if we use
VALGRIND_MEMPOOL_ALLOC to designate sub-section(s) of a malloc'ed
block as allocated, all the rest of that malloc'ed block is ignored.
This leads to lots of false positives of course. In particular,
in any multi-malloc-block context, all but the primary block were
reported as leaked. We also had a problem with context "ident"
strings, which were reported as leaked unless there was some other
pointer to them besides the one in the context header.
To fix, we need to use VALGRIND_MEMPOOL_ALLOC to designate
a context's management structs (the context struct itself and
any per-block headers) as allocated chunks. That forces moving
the VALGRIND_CREATE_MEMPOOL/VALGRIND_DESTROY_MEMPOOL calls into
the per-context-type code, so that the pool identifier can be
made as soon as we've allocated the initial block, but otherwise
it's fairly straightforward. Note that in Valgrind's eyes there
is no distinction between these allocations and the allocations
that the mmgr modules hand out to user code. That's fine for
now, but perhaps someday we'll want to do better yet.
When reading this patch, it's helpful to start with the comments
added at the head of mcxt.c.
Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
Discussion: https://postgr.es/m/20210317181531.7oggpqevzz6bka3g@alap3.anarazel.de
In ReorderBufferProcessTXN(), used to send the data of a transaction to
an output plugin, INSERT ON CONFLICT changes (INTERNAL_SPEC_INSERT) are
delayed until a confirmation record arrives (INTERNAL_SPEC_CONFIRM),
updating the change being processed.
8c58624df4 has added an extra step after processing a change to update
the progress of the transaction, by calling the callback
update_progress_txn() based on the LSN stored in a change after a
threshold of CHANGES_THRESHOLD (100) is reached. This logic has missed
the fact that for an INSERT ON CONFLICT change the data is freed once
processed, hence update_progress_txn() could be called pointing to a LSN
value that's already been freed. This could result in random crashes,
depending on the workload.
Per discussion, this issue is fixed by reusing in update_progress_txn()
the LSN from the change processed found at the beginning of the loop,
meaning that for a INTERNAL_SPEC_CONFIRM change the progress is updated
using the LSN of the INTERNAL_SPEC_CONFIRM change, and not the LSN from
its INTERNAL_SPEC_INSERT change. This is actually more correct, as we
want to update the progress to point to the INTERNAL_SPEC_CONFIRM
change.
Masahiko Sawada has found a nice trick to reproduce the issue: hardcode
CHANGES_THRESHOLD at 1 and run test_decoding (test "ddl" being enough)
on an instance running valgrind. The bug has been analyzed by Ethan
Mertz, who also originally suggested the solution used in this patch.
Issue introduced by 8c58624df4, so backpatch down to v16.
Author: Ethan Mertz <ethan.mertz@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/aIsQqDZ7x4LAQ6u1@paquier.xyz
Backpatch-through: 16
Currently, ALTER DATABASE/ROLE/SYSTEM RESET [ALL] with an unknown
custom GUC with a prefix reserved by MarkGUCPrefixReserved() errors
(unless a superuser runs a RESET ALL variant). This is problematic
for cases such as an extension library upgrade that removes a GUC.
To fix, simply make sure the relevant code paths explicitly allow
it. Note that we require superuser or privileges on the parameter
to reset it. This is perhaps a bit more restrictive than is
necessary, but it's not clear whether further relaxing the
requirements is safe.
Oversight in commit 88103567cb. The ALTER SYSTEM fix is dependent
on commit 2d870b4aef, which first appeared in v17. Unfortunately,
back-patching that commit would introduce ABI breakage, and while
that breakage seems unlikely to bother anyone, it doesn't seem
worth the risk. Hence, the ALTER SYSTEM part of this commit is
omitted on v15 and v16.
Reported-by: Mert Alev <mert@futo.org>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Discussion: https://postgr.es/m/18964-ba09dea8c98fccd6%40postgresql.org
Backpatch-through: 15
A deadlock can occur when the DDL command and the apply worker acquire
catalog locks in different orders while dropping replication origins.
The issue is rare in PG16 and higher branches because, in most cases, the
tablesync worker performs the origin drop in those branches, and its
locking sequence does not conflict with DDL operations.
This patch ensures consistent lock acquisition to prevent such deadlocks.
As per buildfarm.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Ajin Cherian <itsajin@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 14, where it was introduced
Discussion: https://postgr.es/m/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com
Commit 719dcf3c42 introduced a field called CachedPlanType in
PlannedStmt to allow extensions to determine whether a cached plan is
generic or custom.
After discussion, the concepts that we want to track are a bit wider
than initially anticipated, as it is closer to knowing from which
"source" or "origin" a PlannedStmt has been generated or retrieved.
Custom and generic cached plans are a subset of that.
Based on the state of HEAD, we have been able to define two more
origins:
- "standard", for the case where PlannedStmt is generated in
standard_planner(), the most common case.
- "internal", for the fake PlannedStmt generated internally by some
query patterns.
This could be tuned in the future depending on what is needed. This
looks like a good starting point, at least. The default value is called
"UNKNOWN", provided as fallback value. This value is not used in the
core code, the idea is to let extensions building their own PlannedStmts
know about this new field.
Author: Michael Paquier <michael@paquier.xyz>
Co-authored-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aILaHupXbIGgF2wJ@paquier.xyz
If the client sent a query cancel request with backend PID 0, it
tripped an assertion. With assertions disabled, you got this in the
log instead:
LOG: invalid cancel request with PID 0
LOG: wrong key in cancel request for process 0
Query cancellations don't even require authentication, so we better
tolerate bogus requests. Fix by turning the assertion into a regular
runtime check.
Spotted while testing libpq behavior with a modified server that
didn't send BackendKeyData to the client.
Backpatch-through: 18
For many optional libraries, we extract the -L and -l switches needed
to link the library from a helper program such as llvm-config. In
some cases we put the resulting -L switches into LDFLAGS ahead of
-L switches specified via --with-libraries. That risks breaking
the user's intention for --with-libraries.
It's not such a problem if the library's -L switch points to a
directory containing only that library, but on some platforms a
library helper may "helpfully" offer a switch such as -L/usr/lib
that points to a directory holding all standard libraries. If the
user specified --with-libraries in hopes of overriding the standard
build of some library, the -L/usr/lib switch prevents that from
happening since it will come before the user-specified directory.
To fix, avoid inserting these switches directly into LDFLAGS during
configure, instead adding them to LIBDIRS or SHLIB_LINK. They will
still eventually get added to LDFLAGS, but only after the switches
coming from --with-libraries.
The same problem exists for -I switches: those coming from
--with-includes should appear before any coming from helper programs
such as llvm-config. We have not heard field complaints about this
case, but it seems certain that a user attempting to override a
standard library could have issues.
The changes for this go well beyond configure itself, however,
because many Makefiles have occasion to manipulate CPPFLAGS to
insert locally-desirable -I switches, and some of them got it wrong.
The correct ordering is any -I switches pointing at within-the-
source-tree-or-build-tree directories, then those from the tree-wide
CPPFLAGS, then those from helper programs. There were several places
that risked pulling in a system-supplied copy of libpq headers, for
example, instead of the in-tree files. (Commit cb36f8ec2 fixed one
instance of that a few months ago, but this exercise found more.)
The Meson build scripts may or may not have any comparable problems,
but I'll leave it to someone else to investigate that.
Reported-by: Charles Samborski <demurgos@demurgos.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/70f2155f-27ca-4534-b33d-7750e20633d7@demurgos.net
Backpatch-through: 13
When I prepared 71c0921b6 et al yesterday, I was thinking that the
logic involving explicitly freeing the node_list output was still
needed to dodge leakage bugs in libxml2. But I was misremembering:
we introduced that only because with early 2.13.x releases we could
not trust xmlParseBalancedChunkMemory's result code, so we had to
look to see if a node list was returned or not. There's no reason
to believe that xmlParseBalancedChunkMemory will fail to clean up
the node list when required, so simplify. (This essentially
completes reverting all the non-cosmetic changes in 6082b3d5d.)
Reported-by: Jim Jones <jim.jones@uni-muenster.de>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/997668.1753802857@sss.pgh.pa.us
Backpatch-through: 13
There've been a few complaints that it can be overly difficult to figure
out why the planner picked a Memoize plan. To help address that, here we
adjust the EXPLAIN output to display the following additional details:
1) The estimated number of cache entries that can be stored at once
2) The estimated number of unique lookup keys that we expect to see
3) The number of lookups we expect
4) The estimated hit ratio
Technically #4 can be calculated using #1, #2 and #3, but it's not a
particularly obvious calculation, so we opt to display it explicitly.
The original patch by Lukas Fittl only displayed the hit ratio, but
there was a fear that might lead to more questions about how that was
calculated. The idea with displaying all 4 is to be transparent which
may allow queries to be tuned more easily. For example, if #2 isn't
correct then maybe extended statistics or a manual n_distinct estimate can
be used to help fix poor plan choices.
Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAP53Pky29GWAVVk3oBgKBDqhND0BRBN6yTPeguV_qSivFL5N_g%40mail.gmail.com
This mostly reverts commit 6082b3d5d, "Use xmlParseInNodeContext
not xmlParseBalancedChunkMemory". It turns out that
xmlParseInNodeContext will reject text chunks exceeding 10MB, while
(in most libxml2 versions) xmlParseBalancedChunkMemory will not.
The bleeding-edge libxml2 bug that we needed to work around a year
ago is presumably no longer a factor, and the argument that
xmlParseBalancedChunkMemory is semi-deprecated is not enough to
justify a functionality regression. Hence, go back to doing it
the old way.
Reported-by: Michael Paquier <michael@paquier.xyz>
Author: Michael Paquier <michael@paquier.xyz>
Co-authored-by: Erik Wienhold <ewie@ewie.name>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/aIGknLuc8b8ega2X@paquier.xyz
Backpatch-through: 13
Commit ffae5cc5a6 added this hint in 2006,
but it's now obsolete and doesn't reflect what users should really check
in this situation. We were not able to agree on a new hint, so just delete
the existing one and update the comments to mention one possibility that
is known to cause problems of this kind: something other than PostgreSQL
is modifying files in the PostgreSQL data directory.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/CAKZiRmxNbcaL76x=09Sxf7aUmrRQJBf8drzDdUHo+j9_eM+VMg@mail.gmail.com
Commit 473a575e05 purported to make this
function stash the error message in *syncrep_parse_result_p, but
it didn't actually.
As a result, an attempt to set synchronous_standby_names to any value
that does not parse resulted in a generic "parser failed." message
rather than anything more specific. This fixes that.
Discussion: http://postgr.es/m/CA+TgmoYF9wPNZ-Q_EMfib_espgHycY-eX__6Tzo2GpYpVXqCdQ@mail.gmail.com
Backpatch-through: 18
The callback added in fc415edf8c used to check if there is any pending
data to flush for fixed-numbered statistics, done by looping across all
the builtin and custom stats kinds with a call to have_fixed_pending_cb,
is proving to able to show in workloads that do not report any stats
(read-only, no function calls, no WAL, no IO, etc). The code used in
v17 was cheaper than that what HEAD has introduced, relying on three
boolean checks for WAL, SLRU and IO stats.
This commit switches the code to use a more efficient approach than
fc415edf8c, with a single boolean flag that can be switched to "true"
by any fixed-numbered stats kinds to force pgstat_report_stat() to go
through one round of reports. The flag is reset by pgstat_report_stat()
once a full round of reports is done. The flag being false means that
fixed-numbered stats kinds saw no activity, and that there is no pending
data to flush.
ac000fca74 took one step in improving the performance by reducing the
number of stats kinds that the backend can hold. This commit takes a
more drastic step by bringing back the code efficiency to what it was
before v18 with a cheap check at the beginning of pgstat_report_stat()
for its fast-exit path.
The callback have_static_pending_cb is removed as an effect of all that.
Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/eb224uegsga2hgq7dfq3ps5cduhpqej7ir2hjxzzozjthrekx5@dysei6buqthe
Backpatch-through: 18
If the number of sync requests is big enough, the palloc() call in
AbsorbSyncRequests() will attempt to allocate more than 1 GB of memory,
resulting in failure. This can lead to an infinite loop in the checkpointer
process, as it repeatedly fails to absorb the pending requests.
This commit introduces the following changes to cope with this problem:
1. Turn pending checkpointer requests array in shared memory into a bounded
ring buffer.
2. Limit maximum ring buffer size to 10M items.
3. Make AbsorbSyncRequests() process requests incrementally in 10K batches.
Even #2 makes the whole queue size fit the maximum palloc() size of 1 GB.
of continuous lock holding.
This commit is for master only. Simpler fix, which just limits a request
queue size to 10M, will be backpatched.
Reported-by: Ekaterina Sokolova <e.sokolova@postgrespro.ru>
Discussion: https://postgr.es/m/db4534f83a22a29ab5ee2566ad86ca92%40postgrespro.ru
Author: Maxim Orlov <orlovmg@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Similar checks are done for the mandatory table AM callbacks. A portion
of the index AM callbacks are optional and can be NULL; the rest is
mandatory and is documented as such in the documentation and in amapi.h.
These checks are useful to detect quickly if all the mandatory callbacks
are defined when implementing a new index access method, as the
assertions are run when loading the AM.
Author: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/ME0P300MB0445795D31CEAB92C58B41FDB651A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
Remove a bunch of PG_TRY constructs, de-volatilize related
variables, remove some PQclear calls in error paths.
Aside from making the code simpler and shorter, this should
provide some marginal performance gains.
For ease of review, I did not re-indent code within the removed
PG_TRY constructs. That'll be done in a separate patch.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/2976982.1748049023@sss.pgh.pa.us
Commit 232d8caea fixed a case where postgres_fdw could lose track
of a PGresult object, resulting in a process-lifespan memory leak.
But I have little faith that there aren't other potential PGresult
leakages, now or in future, in the backend modules that use libpq.
Therefore, this patch proposes infrastructure that makes all
PGresults returned from libpq act as though they are palloc'd
in the CurrentMemoryContext (with the option to relocate them to
another context later). This should greatly reduce the risk of
careless leaks, and it also permits removal of a bunch of code
that attempted to prevent such leaks via PG_TRY blocks.
This patch adds infrastructure that wraps each PGresult in a
"libpqsrv_PGresult" that provides a memory context reset callback
to PQclear the PGresult. Code using this abstraction is inherently
memory-safe to the same extent as we are accustomed to in most backend
code. Furthermore, we add some macros that automatically redirect
calls of the libpq functions concerned with PGresults to use this
infrastructure, so that almost no source-code changes are needed to
wheel this infrastructure into place in all the backend code that
uses libpq.
Perhaps in future we could create similar infrastructure for
PGconn objects, but there seems less need for that.
This patch just creates the infrastructure and makes relevant code
use it, including reverting 232d8caea in favor of this mechanism.
A good deal of follow-on simplification is possible now that we don't
have to be so cautious about freeing PGresults, but I'll put that in
a separate patch.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/2976982.1748049023@sss.pgh.pa.us
This flag was effectively a no-op in EXEC_BACKEND (ie, Windows)
builds, because it was kept in the process-local HTAB struct,
and it could only ever become set in the postmaster's copy.
The simplest fix is to move it to the shared HASHHDR struct.
We could keep a copy in HTAB as well, as we do with keysize
and some other fields, but the "too much contention" argument
doesn't seem to apply here: we only examine isfixed during
element_alloc(), which had better not get hit very often for
a shared hashtable.
This oversight dates to 7c797e719 which invented the option.
But back-patching doesn't seem appropriate given the lack of
field complaints. If there is anyone running an affected
workload on Windows, they might be unhappy about the behavior
changing in a minor release.
Author: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/4d0cb35ff01c5c74d2b9a582ecb73823@postgrespro.ru
This changes the grammar for REINDEX, CHECKPOINT, CLUSTER, ANALYZE/ANALYSE;
they still accept the same options as before, but the grammar is written
differently for convenience of future development.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/202507231538.ir7pjzoow6oe@alvherre.pgsql
Previously, if a background worker crashed (e.g., due to a SIGKILL) and
the server restarted due to restart_after_crash being enabled,
the worker was not restarted as expected. Background workers without
the never-restart flag should automatically restart in this case.
This issue was introduced in commit 28a520c0b7, which failed to reset
the rw_pid field in the RegisteredBgWorker struct for the crashed worker.
This commit fixes the problem by resetting rw_pid for all eligible
background workers during the crash-and-restart cycle.
Back-patched to v18, where the bug was introduced.
Bug fix patches were proposed by Andrey Rudometov and ChangAo Chen,
but this commit uses a different approach.
Reported-by: Andrey Rudometov <unlimitedhikari@gmail.com>
Reported-by: ChangAo Chen <cca5507@qq.com>
Author: Andrey Rudometov <unlimitedhikari@gmail.com>
Author: ChangAo Chen <cca5507@qq.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAF6JsWiO=i24qYitWe6ns1sXqcL86rYxdyU+pNYk-WueKPSySg@mail.gmail.com
Discussion: https://postgr.es/m/tencent_E00A056B3953EE6440F0F40F80EC30427D09@qq.com
Backpatch-through: 18
LatchWaitSetPostmasterDeathPos, the latch event position for the
postmaster death event, is initialized under IsUnderPostmaster.
WaitLatch() considered it as a valid wait target in single-user mode
(!IsUnderPostmaster), which was incorrect.
One code path found to fail with an assertion failure is a database drop
in single-user mode while waiting in WaitForProcSignalBarrier() after
the drop.
Oversight in commit 84e5b2f07a.
Author: Patrick Stählin <me@packi.ch>
Co-authored-by: Ronan Dunklau <ronan.dunklau@aiven.io>
Discussion: https://postgr.es/m/18996-3a2744c8140488de@postgresql.org
Backpatch-through: 18
This function is declared as returning a uint8, but it returns a
bool in one code path. To fix, return (uint8) 0 instead of false
there. This should behave exactly the same as before, but it might
prevent future compiler complaints.
Oversight in commit a892234f83.
Author: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://postgr.es/m/aIHluT2isN58jqHV%40jrouhaud
PlannedStmt gains a new field, called CachedPlanType, able to track if a
given plan tree originates from the cache and if we are dealing with a
generic or custom cached plan.
This field can be used for monitoring or statistical purposes, in the
executor hooks, for example, based on the planned statement attached to
a QueryDesc. A patch is under discussion for pg_stat_statements to
provide an equivalent of the counters in pg_prepared_statements for
custom and generic plans, to provide a more global view of such data, as
this data is now restricted to the current session.
The concept introduced in this commit is useful on its own, and has been
extracted from a larger patch by the same author.
Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAA5RZ0uFw8Y9GCFvafhC=OA8NnMqVZyzXPfv_EePOt+iv1T-qQ@mail.gmail.com
Solaris has never bothered to add "const" to the second argument
of PAM conversation procs, as all other Unixen did decades ago.
This resulted in an "incompatible pointer" compiler warning when
building --with-pam, but had no more serious effect than that,
so we never did anything about it. However, as of GCC 14 the
case is an error not warning by default.
To complicate matters, recent OpenIndiana (and maybe illumos
in general?) *does* supply the "const" by default, so we can't
just assume that platforms using our solaris template need help.
What we can do, short of building a configure-time probe,
is to make solaris.h #define _PAM_LEGACY_NONCONST, which
causes OpenIndiana's pam_appl.h to revert to the traditional
definition, and hopefully will have no effect anywhere else.
Then we can use that same symbol to control whether we include
"const" in the declaration of pam_passwd_conv_proc().
Bug: #18995
Reported-by: Andrew Watkins <awatkins1966@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18995-82058da9ab4337a7@postgresql.org
Backpatch-through: 13
lwlock.c, lwlock.h, and wait_event_names.txt each contain a list of
built-in LWLock tranches. It is easy to miss one or the other when
adding or removing tranches, and discrepancies have adverse effects
(e.g., breaking JOINs between pg_stat_activity and pg_wait_events).
This commit moves the lists of built-in tranches in lwlock.{c,h} to
lwlocklist.h and adds a cross-check to the script that generates
lwlocknames.h. If the lists do not match exactly, building will
fail.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aHpOgwuFQfcFMZ/B%40ip-10-97-1-34.eu-west-3.compute.internal
Logical replication requires reliable conflict detection to maintain data
consistency across nodes. To achieve this, we must prevent premature
removal of tuples deleted by other origins and their associated commit_ts
data by VACUUM, which could otherwise lead to incorrect conflict reporting
and resolution.
This patch introduces a mechanism to retain deleted tuples on the
subscriber during the application of concurrent transactions from remote
nodes. Retaining these tuples allows us to correctly ignore concurrent
updates to the same tuple. Without this, an UPDATE might be misinterpreted
as an INSERT during resolutions due to the absence of the original tuple.
Additionally, we ensure that origin metadata is not prematurely removed by
vacuum freeze, which is essential for detecting update_origin_differs and
delete_origin_differs conflicts.
To support this, a new replication slot named pg_conflict_detection is
created and maintained by the launcher on the subscriber. Each apply
worker tracks its own non-removable transaction ID, which the launcher
aggregates to determine the appropriate xmin for the slot, thereby
retaining necessary tuples.
Conflict information retention (deleted tuples and commit_ts) can be
enabled per subscription via the retain_conflict_info option. This is
disabled by default to avoid unnecessary overhead for configurations that
do not require conflict resolution or logging.
During upgrades, if any subscription on the old cluster has
retain_conflict_info enabled, a conflict detection slot will be created to
protect relevant tuples from deletion when the new cluster starts.
This is a foundational work to correctly detect update_deleted conflict
which will be done in a follow-up patch.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Commit 112faf1378 introduced a translation marker in libpq-be-fe-helpers.h,
but this caused build failures on some platforms—such as the one reported
by buildfarm member indri—due to linker issues with dblink. This is the same
problem previously addressed in commit 213c959a29.
To fix the issue, this commit removes the translation marker from
libpq-be-fe-helpers.h, following the approach used in 213c959a29.
It also removes the associated gettext_noop() calls added in commit
112faf1378, as they are no longer needed.
While reviewing this, a gettext_noop() call was also found in
contrib/basic_archive. Since contrib modules don't support translation,
this call has been removed as well.
Per buildfarm member indri.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/0e6299d9-608a-4ffa-aeb1-40cb8a99000b@oss.nttdata.com
Previously, NOTICE, WARNING, and similar messages received from remote
servers over replication, postgres_fdw, or dblink connections were printed
directly to stderr on the local server (e.g., the subscriber). As a result,
these messages lacked log prefixes (e.g., timestamp), making them harder
to trace and correlate with other log entries.
This commit addresses the issue by introducing a custom notice receiver
for replication, postgres_fdw, and dblink connections. These messages
are now logged via ereport(), ensuring they appear in the logs with proper
formatting and context, which improves clarity and aids in debugging.
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CALDaNm2xsHpWRtLm-VL_HJCsaE3+1Y_n-jDEAr3-suxVqc3xoQ@mail.gmail.com
In commit b262ad440, we introduced an optimization that reduces an IS
[NOT] NULL qual on a NOT NULL column to constant true or constant
false, provided we can prove that the input expression of the NullTest
is not nullable by any outer joins or grouping sets. This deduction
happens quite late in the planner, during the distribution of quals to
rels in query_planner. However, this approach has some drawbacks: we
can't perform any further folding with the constant, and it turns out
to be prone to bugs.
Ideally, this deduction should happen during constant folding.
However, the per-relation information about which columns are defined
as NOT NULL is not available at that point. This information is
currently collected from catalogs when building RelOptInfos for base
or "other" relations.
This patch moves the collection of NOT NULL attribute information for
relations before pull_up_sublinks, storing it in a hash table keyed by
relation OID. It then uses this information to perform the NullTest
deduction for Vars during constant folding. This also makes it
possible to leverage this information to pull up NOT IN subqueries.
Note that this patch does not get rid of restriction_is_always_true
and restriction_is_always_false. Removing them would prevent us from
reducing some IS [NOT] NULL quals that we were previously able to
reduce, because (a) the self-join elimination may introduce new IS NOT
NULL quals after constant folding, and (b) if some outer joins are
converted to inner joins, previously irreducible NullTest quals may
become reducible.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com
There are several pieces of catalog information that need to be
retrieved for a relation during the early stage of planning. These
include relhassubclass, which is used to clear the inh flag if the
relation has no children, as well as a column's attgenerated and
default value, which are needed to expand virtual generated columns.
More such information may be required in the future.
Currently, these pieces of catalog data are collected in multiple
places, resulting in repeated table_open/table_close calls for each
relation in the rangetable. This patch centralizes the collection of
all required early-stage catalog information into a single loop over
the rangetable, allowing each relation to be opened and closed only
once.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com
Currently, we expand virtual generated columns after we have pulled up
any SubLinks within the query's quals. This ensures that the virtual
generated column references within SubLinks that should be transformed
into joins are correctly expanded. This approach works well and has
posed no issues.
In an upcoming patch, we plan to centralize the collection of catalog
information needed early in the planner. This will help avoid
repeated table_open/table_close calls for relations in the rangetable.
Since this information is required during sublink pull-up, we are
moving the expansion of virtual generated columns to occur beforehand.
To achieve this, if any EXISTS SubLinks can be pulled up, their
rangetables are processed just before pulling them up.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com
The only practical effect of these changes is to avoid a useless
list_copy() operation when there is a single hashclause. That's
never going to make any noticeable performance difference, but
the code is arguably clearer this way, especially if we take the
opportunity to add some comments so that readers don't have to
reverse-engineer the usage of these local variables. Also add
some braces for better/more consistent style.
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAHewXNnHBOO9NEa=NBDYOrwZL4oHu2NOcTYvqyNyWEswo8f5OQ@mail.gmail.com
If a crash occurs while writing a WAL record that spans multiple pages, the
recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD
flag. However, logical decoding currently attempts to read the full WAL
record based on its expected size before checking this flag, which can lead
to an infinite wait if the remaining data is never written (e.g., no activity
after crash).
This patch updates the logic first to read the page header and check for
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct
the full WAL record. If the flag is set, decoding correctly identifies
the record as incomplete and avoids waiting for WAL data that will never
arrive.
Discussion: https://postgr.es/m/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Backpatch-through: 13
Instead of laboriously computing the exact output length, use strlen
to get an upper bound cheaply. (This is still O(N) of course, but
the constant factor is a lot less.) This will typically result in
overallocating the output datum, but that's of little concern since
it's a short-lived allocation in just about all use-cases.
A simple microbenchmark showed about 40% speedup for long input
strings.
While here, make some cosmetic cleanups and add a test case that
covers the double-backslash code path in byteain and byteaout.
Author: Steven Niu <niushiji@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Stepan Neretin <slpmcf@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/ca315729-140b-426e-81a6-6cd5cfe7ecc5@gmail.com
If a MERGE inside a CTE attempts an UPDATE or DELETE on a table with
BEFORE ROW triggers, and a concurrent UPDATE or DELETE happens, the
merge code would fail (crashing in the case of an UPDATE action, and
potentially executing the wrong action for a DELETE action).
This is the same issue that 9321c79c86 attempted to fix, except now
for a MERGE inside a CTE. As noted in 9321c79c86, what needs to happen
is for the trigger code to exit early, returning the TM_Result and
TM_FailureData information to the merge code, if a concurrent
modification is detected, rather than attempting to do an EPQ
recheck. The merge code will then do its own rechecking, and rescan
the action list, potentially executing a different action in light of
the concurrent update. In particular, the trigger code must never call
ExecGetUpdateNewTuple() for MERGE, since that is bound to fail because
MERGE has its own per-action projection information.
Commit 9321c79c86 did this using estate->es_plannedstmt->commandType
in the trigger code to detect that a MERGE was being executed, which
is fine for a plain MERGE command, but does not work for a MERGE
inside a CTE. Fix by passing that information to the trigger code as
an additional parameter passed to ExecBRUpdateTriggers() and
ExecBRDeleteTriggers().
Back-patch as far as v17 only, since MERGE cannot appear inside a CTE
prior to that. Additionally, take care to preserve the trigger ABI in
v17 (though not in v18, which is still in beta).
Bug: #18986
Reported-by: Yaroslav Syrytsia <me@ys.lc>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18986-e7a8aac3d339fa47@postgresql.org
Backpatch-through: 17
We have an assertion to ensure that a command tag has been assigned by
the time we're done executing, but if we happen to execute a command
with no queries, the assertion would fail. Per discussion, rather than
contort things to get a tag assigned, just remove the assertion.
Oversight in 2f9661311b. That commit also retained a comment that
explained logic that had been adjacent to it but diffused into various
places, leaving none apt to keep part of the comment. Remove that part,
and rewrite what remains for extra clarity.
Bug: #18984
Backpatch-through: 13
Reported-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michaël Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org
The terms used in wait_event_names.txt and lwlock.c were inconsistent
for MultiXactOffsetSLRU and MultiXactMemberSLRU, which could cause joins
between pg_wait_events and pg_stat_activity to fail. lwlock.c is
adjusted in this commit to what the historical name of the event has
always been, and what is documented.
Oversight in 53c2a97a92. 08b9b9e043 has fixed a similar
inconsistency some time ago.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/aHdxN0D0hKXzHFQG@ip-10-97-1-34.eu-west-3.compute.internal
Backpatch-through: 17
Avoid dependence on setlocale().
strcoll(), etc., are not called directly; all collation-sensitive
calls should go through pg_locale.c and use the appropriate
provider. By setting LC_COLLATE to C, we avoid accidentally depending
on libc behavior when using a different provider.
No behavior change in the backend, but it's possible that some
extensions will be affected. Such extensions should be updated to use
the pg_locale_t APIs.
Discussion: https://postgr.es/m/9875f7f9-50f1-4b5d-86fc-ee8b03e8c162@eisentraut.org
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
_bt_first need only store one ScanKeyData struct on the stack for the
purposes of building an IS NOT NULL key based on an implied NOT NULL
constraint. We don't need INDEX_MAX_KEYS-many ScanKeyData structs.
This saves us a little over 2KB in stack space. It's possible that this
has some performance benefit. It also seems simpler and more direct.
It isn't possible for more than a single index attribute to need its own
implied IS NOT NULL key: the first such attribute/IS NOT NULL key always
makes _bt_first stop adding additional boundary keys to startKeys[].
Using INDEX_MAX_KEYS-many ScanKeyData entries was (at best) misleading.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Mircea Cadariu <cadariu.mircea@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm=1kJMSZhhTLoM5BPbwQNWxUj-ynOEh=89ptDZAVgauw@mail.gmail.com
This code used a NO_LZ4_SUPPORT() macro to issue an error in the code
paths where LZ4 [de]compression is attempted but the build does not
support it. This commit refactors the code to use a more flexible error
message so as it can be used for other compression methods, where the
method is given in input of macro.
Extracted from a larger patch by the same author.
Author: Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com
The pgoutput plugin initializes optional parameters like "binary" with
default values at the start of processing. However, the "origin"
parameter was previously missed and left without explicit initialization.
Although the PGOutputData struct, which holds these settings,
is zero-initialized at allocation (resulting in publish_no_origin field
for "origin" parameter being false by default), this default was not
set explicitly, unlike other parameters.
This commit adds explicit initialization of the "origin" parameter to
ensure consistency and clarity in how defaults are handled.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/d2790f10-238d-4cb5-a743-d9d2a9dd900f@oss.nttdata.com
The grammar was a little shaky and confusing here, so word-smith it
a bit. Also, adjust the comments in pg_ident.conf.sample to use the
same terminology as the SGML docs, in particular "DATABASE-USERNAME"
not "PG-USERNAME".
Back-patch appropriate subsets. I did not risk changing
pg_ident.conf.sample in released branches, but it still seems OK
to change it in v18.
Reported-by: Alexey Shishkin <alexey.shishkin@enterprisedb.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Discussion: https://postgr.es/m/175206279327.3157504.12519088928605422253@wrigleys.postgresql.org
Backpatch-through: 13
It's impossible to reach this case with either ra or rb being
WJB_DONE, because our earlier checks that the structure and
length of the inputs match should guarantee that we reach their
ends simultaneously. However, the comment completely fails to
explain this, and the Asserts don't cover it either. The comment
is pretty obscure anyway, so rewrite it, and extend the Asserts
to reject WJB_DONE.
This is only cosmetic, so no need for back-patch.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/0c623e8a204187b87b4736792398eaf1@postgrespro.ru
Because not every path through JsonbIteratorNext() sets val->type,
some compilers complain that compareJsonbContainers() is comparing
possibly-uninitialized values. The paths that don't set it return
WJB_DONE, WJB_END_ARRAY, or WJB_END_OBJECT, so it's clear by
manual inspection that the "(ra == rb)" code path is safe, and
indeed we aren't seeing warnings about that. But the (ra != rb)
case is much less obviously safe. In Assert-enabled builds it
seems that the asserts rejecting WJB_END_ARRAY and WJB_END_OBJECT
persuade gcc 15.x not to warn, which makes little sense because
it's impossible to believe that the compiler can prove of its
own accord that ra/rb aren't WJB_DONE here. (In fact they never
will be, so the code isn't wrong, but why is there no warning?)
Without Asserts, the appearance of warnings is quite unsurprising.
We discussed fixing this by converting those two Asserts into
pg_assume, but that seems not very satisfactory when it's so unclear
why the compiler is or isn't warning: the warning could easily
reappear with some other compiler version. Let's fix it in a less
magical, more future-proof way by changing JsonbIteratorNext()
so that it always does set val->type. The cost of that should be
pretty negligible, and it makes the function's API spec less squishy.
Reported-by: Erik Rijkers <er@xs4all.nl>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/988bf1bc-3f1f-99f3-bf98-222f1cd9dc5e@xs4all.nl
Discussion: https://postgr.es/m/0c623e8a204187b87b4736792398eaf1@postgrespro.ru
Backpatch-through: 13
If the system-name field of a pg_ident.conf line is a regex
containing capturing parentheses, you can write \1 in the
user-name field to represent the captured part of the system
name. But what happens if you write \1 more than once?
The only reasonable expectation IMO is that each \1 gets
replaced, but presently our code replaces only the first.
Fix that.
Also, improve the tests for this feature to exercise cases
where a non-empty string needs to be substituted for \1.
The previous testing didn't inspire much faith that it
was verifying correct operation of the substitution code.
Given the lack of field complaints about this, I don't
feel a need to back-patch.
Reported-by: David G. Johnston <david.g.johnston@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAKFQuwZu6kZ8ZPvJ3pWXig+6UX4nTVK-hdL_ZS3fSdps=RJQQQ@mail.gmail.com
The values of the "result" variables in these functions are
always integers; using a float8 variable accomplishes nothing
except to incur useless conversions to and from float. While
that wastes a few nanoseconds, these functions aren't all that
time-critical. But it seems worth fixing to remove possible
reader confusion.
Also, in the case of date2isoyear(), "result" is a very poorly
chosen variable name because it is *not* the function's result.
Rename it to "week", and do the same in date2isoweek() for
consistency.
Since this is mostly cosmetic, there seems little need
for back-patch.
Author: Sergey Fukanchik <s.fukanchik@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6323a-68726500-1-7def9d00@137821581
method_worker.c installed SignalHandlerForConfigReload, but it failed to
actually process reload requests. That hasn't yet produced any concrete
problem reports in terms of GUC changes it should have cared about in
v18, but it was inconsistent.
It did cause problems for a couple of patches in development that need
IO workers to react to ALTER SYSTEM + pg_reload_conf(). Fix extracted
from one of those patches.
Back-patch to 18.
Reported-by: Dmitry Dolgov <9erthalion6@gmail.com>
Discussion: https://postgr.es/m/sh5uqe4a4aqo5zkkpfy5fobe2rg2zzouctdjz7kou4t74c66ql%40yzpkxb7pgoxf
In an ancient ancestor of this code, the postmaster assigned IDs to IO
workers. Now it tracks them in an unordered array and doesn't know
their IDs, so it might be confusing to readers that it still referred to
their indexes as IDs.
No change in behavior, just variable name and error message cleanup.
Back-patch to 18.
Discussion: https://postgr.es/m/CA%2BhUKG%2BwbaZZ9Nwc_bTopm4f-7vDmCwLk80uKDHj9mq%2BUp0E%2Bg%40mail.gmail.com
Adopt PgAioXXX convention for pgaio module type names. Rename a
function that didn't use a pgaio_worker_ submodule prefix. Rename the
internal submit function's arguments to match the indirectly relevant
function pointer declaration and nearby examples. Rename the array of
handle IDs in PgAioSubmissionQueue to sqes, a term of art seen in the
systems it emulates, also clarifying that they're not IO handle
pointers as the old name might imply.
No change in behavior, just type, variable and function name cleanup.
Back-patch to 18.
Discussion: https://postgr.es/m/CA%2BhUKG%2BwbaZZ9Nwc_bTopm4f-7vDmCwLk80uKDHj9mq%2BUp0E%2Bg%40mail.gmail.com
getid() and putid(), which parse and deparse role names within ACL
input/output, applied isalnum() to see if a character within a role
name requires quoting. They did this even for non-ASCII characters,
which is problematic because the results would depend on encoding,
locale, and perhaps even platform. So it's possible that putid()
could elect not to quote some string that, later in some other
environment, getid() will decide is not a valid identifier, causing
dump/reload or similar failures.
To fix this in a way that won't risk interoperability problems
with unpatched versions, make getid() treat any non-ASCII as a
legitimate identifier character (hence not requiring quotes),
while making putid() treat any non-ASCII as requiring quoting.
We could remove the resulting excess quoting once we feel that
no unpatched servers remain in the wild, but that'll be years.
A lesser problem is that getid() did the wrong thing with an input
consisting of just two double quotes (""). That has to represent an
empty string, but getid() read it as a single double quote instead.
The case cannot arise in the normal course of events, since we don't
allow empty-string role names. But let's fix it while we're here.
Although we've not heard field reports of problems with non-ASCII
role names, there's clearly a hazard there, so back-patch to all
supported versions.
Reported-by: Peter Eisentraut <peter@eisentraut.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3792884.1751492172@sss.pgh.pa.us
Backpatch-through: 13
This option, which is disabled by default, can be used to request
the checkpoint also flush dirty buffers of unlogged relations. As
with the MODE option, the server may consolidate the options for
concurrently requested checkpoints. For example, if one session
uses (FLUSH_UNLOGGED FALSE) and another uses (FLUSH_UNLOGGED TRUE),
the server may perform one checkpoint with FLUSH_UNLOGGED enabled.
Author: Christoph Berg <myon@debian.org>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
This option may be set to FAST (the default) to request the
checkpoint be completed as fast as possible, or SPREAD to request
the checkpoint be spread over a longer interval (based on the
checkpoint-related configuration parameters). Note that the server
may consolidate the options for concurrently requested checkpoints.
For example, if one session requests a "fast" checkpoint and
another requests a "spread" checkpoint, the server may perform one
"fast" checkpoint.
Author: Christoph Berg <myon@debian.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
This commit adds the boilerplate code for supporting a list of
options in CHECKPOINT commands. No actual options are supported
yet, but follow-up commits will add support for MODE and
FLUSH_UNLOGGED. While at it, this commit refactors the code for
executing CHECKPOINT commands to its own function since it's about
to become significantly larger.
Author: Christoph Berg <myon@debian.org>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
The new name more accurately reflects the effects of this flag on a
requested checkpoint. Checkpoint-related log messages (i.e., those
controlled by the log_checkpoints configuration parameter) will now
say "fast" instead of "immediate", too. Likewise, references to
"immediate" checkpoints in the documentation have been updated to
say "fast". This is preparatory work for a follow-up commit that
will add a MODE option to the CHECKPOINT command.
Author: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
The new name more accurately relects the effects of this flag on a
requested checkpoint. Checkpoint-related log messages (i.e., those
controlled by the log_checkpoints configuration parameter) will now
say "flush-unlogged" instead of "flush-all", too. This is
preparatory work for a follow-up commit that will add a
FLUSH_UNLOGGED option to the CHECKPOINT command.
Author: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
Previously, the check_hook functions for max_slot_wal_keep_size and
idle_replication_slot_timeout would incorrectly raise an ERROR for values
set in postgresql.conf during upgrade, even though those values were not
actively used in the upgrade process.
To prevent logical slot invalidation during upgrade, we used to set
special values for these GUCs. Now, instead of relying on those values, we
directly prevent WAL removal and logical slot invalidation caused by
max_slot_wal_keep_size and idle_replication_slot_timeout.
Note: PostgreSQL 17 does not include the idle_replication_slot_timeout
GUC, so related changes were not backported.
BUG #18979
Reported-by: jorsol <jorsol@gmail.com>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed by: vignesh C <vignesh21@gmail.com>
Reviewed by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Backpatch-through: 17, where it was introduced
Discussion: https://postgr.es/m/219561.1751826409@sss.pgh.pa.us
Discussion: https://postgr.es/m/18979-a1b7fdbb7cd181c6@postgresql.org
This commit updates the documentation to clarify that "idle" in
idle_replication_slot_timeout means the replication slot is inactive,
that is, not currently used by any replication connection.
Without this clarification, "idle" could be misinterpreted to mean
that the slot is not advancing or that no data is being streamed,
even if a connection exists.
Back-patch to v18 where idle_replication_slot_timeout was added.
Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Gunnar Morling <gunnar.morling@googlemail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CADGJaX_0+FTguWpNSpgVWYQP_7MhoO0D8=cp4XozSQgaZ40Odw@mail.gmail.com
Backpatch-through: 18
Previously, the idle_replication_slot_timeout parameter used minutes
as its unit, based on the assumption that values would typically exceed
one minute in production environments. However, this caused unexpected
behavior: specifying a value below 30 seconds would round down to 0,
effectively disabling the timeout. This could be surprising to users.
To allow finer-grained control and avoid such confusion, this commit changes
the unit of idle_replication_slot_timeout to seconds. Larger values can
still be specified easily using standard time suffixes, for example,
'24h' for 24 hours.
Back-patch to v18 where idle_replication_slot_timeout was added.
Reported-by: Gunnar Morling <gunnar.morling@googlemail.com>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CADGJaX_0+FTguWpNSpgVWYQP_7MhoO0D8=cp4XozSQgaZ40Odw@mail.gmail.com
Backpatch-through: 18
This commit adds a new system view that provides information about
entries in the dynamic shared memory (DSM) registry. Specifically,
it returns the name, type, and size of each entry. Note that since
we cannot discover the size of dynamic shared memory areas (DSAs)
and hash tables backed by DSAs (dshashes) without first attaching
to them, the size column is left as NULL for those.
Bumps catversion.
Author: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Sungwoo Chang <swchangdev@gmail.com>
Discussion: https://postgr.es/m/4D445D3E-81C5-4135-95BB-D414204A0AB4%40gmail.com
xmltotext_with_options() did not consider the possibility that
pg_xml_init() could fail --- most likely due to OOM. If that
happened, the already-parsed xmlDoc structure would be leaked.
Oversight in commit 483bdb2af.
Bug: #18981
Author: Dmitry Kovalenko <d.kovalenko@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18981-9bc3c80f107ae925@postgresql.org
Backpatch-through: 16
By default io_uring creates a shared memory mapping for each io_uring
instance, leading to a large number of memory mappings. Unfortunately a large
number of memory mappings slows things down, backend exit is particularly
affected. To address that, newer kernels (6.5) support using user-provided
memory for the memory. By putting the relevant memory into shared memory we
don't need any additional mappings.
On a system with a new enough kernel and liburing, there is no discernible
overhead when doing a pgbench -S -C anymore.
Reported-by: MARK CALLAGHAN <mdcallag@gmail.com>
Reviewed-by: "Burd, Greg" <greg@burd.me>
Reviewed-by: Jim Nasby <jnasby@upgrade.com>
Discussion: https://postgr.es/m/CAFbpF8OA44_UG+RYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA@mail.gmail.com
Backpatch-through: 18
For an ordered Append or MergeAppend, we need to inject an explicit
sort into any subpath that is not already well enough ordered.
Currently, only explicit full sorts are considered; incremental sorts
are not yet taken into account.
In this patch, for subpaths of an ordered Append or MergeAppend, we
choose to use explicit incremental sort if it is enabled and there are
presorted keys.
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. In addition, the
current cost model tends to favor incremental sort as being cheaper
than full sort in the presence of presorted keys, making it reasonable
not to consider full sort in such cases.
No backpatch as this could result in plan changes.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4_V7a2enTR+T3pOY_YZ-FU8ZsFYym2swOz4jNMqmSgyuw@mail.gmail.com
Functions to bootstrap and zero pages in various SLRU callers were
fairly duplicative. We can slash almost two hundred lines with a couple
of simple helpers:
- SimpleLruZeroAndWritePage: Does the equivalent of SimpleLruZeroPage
followed by flushing the page to disk
- XLogSimpleInsertInt64: Does a XLogBeginInsert followed by XLogInsert
of a trivial record whose data is just an int64.
Author: Evgeny Voropaev <evgeny.voropaev@tantorlabs.com>
Reviewed by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/97820ce8-a1cd-407f-a02b-47368fadb14b%40tantorlabs.com
This commit standardizes the output format for LSNs to ensure consistent
representation across various tools and messages. Previously, LSNs were
inconsistently printed as `%X/%X` in some contexts, while others used
zero-padding. This often led to confusion when comparing.
To address this, the LSN format is now uniformly set to `%X/%08X`,
ensuring the lower 32-bit part is always zero-padded to eight
hexadecimal digits.
Author: Japin Li <japinli@hotmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/ME0P300MB0445CA53CA0E4B8C1879AF84B641A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
This refactoring is a follow-up of the work done in 5a1dfde833, that
has switched 2PC file names to use FullTransactionIds when written on
disk. This will help with the integration of a follow-up solution
related to the handling of two-phase files during recovery, to address
older defects while reading these from disk after a crash.
This change is useful in itself as it reduces the need to build the
file names from epoch numbers and TransactionIds, because we can use
directly FullTransactionIds from which the 2PC file names are guessed.
So this avoids a lot of back-and-forth between the FullTransactionIds
retrieved from the file names and how these are passed around in the
internal 2PC logic.
Note that the core of the change is the use of a FullTransactionId
instead of a TransactionId in GlobalTransactionData, that tracks 2PC
file information in shared memory. The change in TwoPhaseCallback makes
this commit unfit for stable branches.
Noah has contributed a good chunk of this patch. I have spent some time
on it as well while working on the issues with two-phase state files and
recovery.
Author: Noah Misch <noah@leadboat.com>
Co-Authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/Z5sd5O9JO7NYNK-C@paquier.xyz
Discussion: https://postgr.es/m/20250116205254.65.nmisch@google.com
Attempting to use commit timestamps during bootstrapping leads to an
assertion failure, that can be reached for example with an initdb -c
that enables track_commit_timestamp. It makes little sense to register
a commit timestamp for a BootstrapTransactionId, so let's disable the
activation of the module in this case.
This problem has been independently reported once by each author of this
commit. Each author has proposed basically the same patch, relying on
IsBootstrapProcessingMode() to skip the use of commit_ts during
bootstrap. The test addition is a suggestion by me, and is applied down
to v16.
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Author: Andy Fan <zhihuifan1213@163.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/OSCPR01MB14966FF9E4C4145F37B937E52F5102@OSCPR01MB14966.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/87plejmnpy.fsf@163.com
Backpatch-through: 13
Previously, truncating a temporary relation required scanning the entire
local buffer pool once per relation fork to invalidate buffers. This could
be slow, especially with a large local buffers, as the scan was repeated
multiple times.
A similar issue with regular tables (shared buffers) was addressed in
commit 6d05086c0a by scanning the buffer pool only once for all forks.
This commit applies the same optimization to temporary relations,
improving truncation performance.
Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://postgr.es/m/CAJDiXggNqsJOH7C5co4jA8nDk8vw-=sokyh5s1_TENWnC6Ofcg@mail.gmail.com
If, after removal of useless null-constant arguments, a CoalesceExpr
has exactly one remaining argument, we can just take that argument as
the result, without bothering to wrap a new CoalesceExpr around it.
This isn't likely to produce any great improvement in runtime per se,
but it can lead to better plans since the planner no longer has to
treat the expression as non-strict.
However, there were a few regression test cases that intentionally
wrote COALESCE(x) as a shorthand way of creating a non-strict
subexpression. To avoid ruining the intent of those tests, write
COALESCE(x,x) instead. (If anyone ever proposes de-duplicating
COALESCE arguments, we'll need another iteration of this arms race.
But it seems pretty unlikely that such an optimization would be
worthwhile.)
Author: Maksim Milyutin <maksim.milyutin@tantorlabs.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/8e8573c3-1411-448d-877e-53258b7b2be0@tantorlabs.ru
Previous commits invented timestamp2timestamptz_opt_overflow,
date2timestamp_opt_overflow, and date2timestamptz_opt_overflow
functions to perform non-error-throwing conversions between
datetime types. This patch completes the set by adding
timestamp2date_opt_overflow, timestamptz2date_opt_overflow,
and timestamptz2timestamp_opt_overflow.
In addition, adjust timestamp2timestamptz_opt_overflow so that it
doesn't throw error if timestamp2tm fails, but treats that as an
overflow case. The situation probably can't arise except with an
invalid timestamp value, and I can't think of a way that that would
happen except data corruption. However, it's pretty silly to have a
function whose entire reason for existence is to not throw errors for
out-of-range inputs nonetheless throw an error for out-of-range input.
The new APIs are not used in this patch, but will be needed in
upcoming btree_gin changes.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Discussion: https://postgr.es/m/262624.1738460652@sss.pgh.pa.us
Commits 8319e5cb5 et al missed the fact that ATPostAlterTypeCleanup
contains three calls to ATPostAlterTypeParse, and the other two
also need protection against passing a relid that we don't yet
have lock on. Add similar logic to those code paths, and add
some test cases demonstrating the need for it.
In v18 and master, the test cases demonstrate that there's a
behavioral discrepancy between stored generated columns and virtual
generated columns: we disallow changing the expression of a stored
column if it's used in any composite-type columns, but not that of
a virtual column. Since the expression isn't actually relevant to
either sort of composite-type usage, this prohibition seems
unnecessary; but changing it is a matter for separate discussion.
For now we are just documenting the existing behavior.
Reported-by: jian he <jian.universality@gmail.com>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: CACJufxGKJtGNRRSXfwMW9SqVOPEMdP17BJ7DsBf=tNsv9pWU9g@mail.gmail.com
Backpatch-through: 13
If certain constraint characteristic clauses (NO INHERIT, NOT VALID, NOT
ENFORCED) are given to CREATE CONSTRAINT TRIGGER, the resulting error
message is
ERROR: TRIGGER constraints cannot be marked NO INHERIT
which is a bit silly, because these aren't "constraints of type
TRIGGER". Hardcode a better error message to prevent it. This is a
cosmetic fix for quite a fringe problem with no known complaints from
users, so no backpatch.
While at it, silently accept ENFORCED if given.
Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAAJ_b97hd-jMTS7AjgU6TDBCzDx_KyuKxG+K-DtYmOieg+giyQ@mail.gmail.com
Discussion: https://postgr.es/m/CACJufxHSp2puxP=q8ZtUGL1F+heapnzqFBZy5ZNGUjUgwjBqTQ@mail.gmail.com
AlterDomainStmt.subtype used characters for its subtypes of commands,
SET|DROP DEFAULT|NOT NULL and ADD|DROP|VALIDATE CONSTRAINT, which were
hardcoded in a couple of places of the code. The code is improved by
using an enum instead, with the same character values as the original
code.
Note that the field was documented in parsenodes.h and that it forgot to
mention 'V' (VALIDATE CONSTRAINT).
Author: Quan Zongliang <quanzongliang@yeah.net>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/41ff310b-16bd-44b9-a3ef-97e20f14b709@yeah.net
The COPY FROM command now accepts a non-negative integer for the HEADER option,
allowing multiple header lines to be skipped. This is useful when the input
contains multi-line headers that should be ignored during data import.
Author: Shinya Kato <shinya11.kato@gmail.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/CAOzEurRPxfzbxqeOPF_AGnAUOYf=Wk0we+1LQomPNUNtyZGBZw@mail.gmail.com
Currently check_recovery_target_timeline() converts any value that is
not "current", "latest", or a valid integer to 0. So, for example, the
following configuration added to postgresql.conf followed by a startup:
recovery_target_timeline = 'bogus'
recovery_target_timeline = '9999999999'
... results in the following error patterns:
FATAL: 22023: recovery target timeline 0 does not exist
FATAL: 22023: recovery target timeline 1410065407 does not exist
This is confusing, because the server does not reflect the intention of
the user, and just reports incorrect data unrelated to the GUC.
The origin of the problem is that we do not perform a range check in the
GUC value passed-in for recovery_target_timeline. This commit improves
the situation by using strtou64() and by providing stricter range
checks. Some test cases are added for the cases of an incorrect, an
upper-bound and a lower-bound timeline value, checking the sanity of the
reports based on the contents of the server logs.
Author: David Steele <david@pgmasters.net>
Discussion: https://postgr.es/m/e5d472c7-e9be-4710-8dc4-ebe721b62cea@pgbackrest.org
Currently, we do not support Memoize for SEMI and ANTI joins because
nested loop SEMI/ANTI joins do not scan the inner relation to
completion, which prevents Memoize from marking the cache entry as
complete. One might argue that we could mark the cache entry as
complete after fetching the first inner tuple, but that would not be
safe: if the first inner tuple and the current outer tuple do not
satisfy the join clauses, a second inner tuple matching the parameters
would find the cache entry already marked as complete.
However, if the inner side is provably unique, this issue doesn't
arise, since there would be no second matching tuple. That said, this
doesn't help in the case of SEMI joins, because a SEMI join with a
provably unique inner side would already have been reduced to an inner
join by reduce_unique_semijoins.
Therefore, in this patch, we check whether the inner relation is
provably unique for ANTI joins and enable the use of Memoize in such
cases.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48FdLiMNrmJL-g6mDvoQVt0yNyJAqMkv4e2Pk-5GKCZLA@mail.gmail.com
This routine has come as a useful piece to be able to know the list of
injection points currently attached in a system. One area would be to
use it in a set-returning function, or just let out-of-core code play
with it.
This hides the internals of the shared memory array lookup holding the
information about the injection points (point name, library and function
name), allocating the result in a palloc'd List consumable by the
caller.
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Discussion: https://postgr.es/m/Z_xYkA21KyLEHvWR@paquier.xyz
Discussion: https://postgr.es/m/aBG2rPwl3GE7m1-Q@paquier.xyz
Presently, the dynamic shared memory (DSM) registry only provides
GetNamedDSMSegment(), which allocates a fixed-size segment. To use
the DSM registry for more sophisticated things like dynamic shared
memory areas (DSAs) or a hash table backed by a DSA (dshash), users
need to create a DSM segment that stores various handles and LWLock
tranche IDs and to write fairly complicated initialization code.
Furthermore, there is likely little variation in this
initialization code between libraries.
This commit introduces functions that simplify allocating a DSA or
dshash within the DSM registry. These functions are very similar
to GetNamedDSMSegment(). Notable differences include the lack of
an initialization callback parameter and the prohibition of calling
the functions more than once for a given entry in each backend
(which should be trivially avoidable in most circumstances). While
at it, this commit bumps the maximum DSM registry entry name length
from 63 bytes to 127 bytes.
Also note that even though one could presumably detach/destroy the
DSAs and dshashes created in the registry, such use-cases are not
yet well-supported, if for no other reason than the associated DSM
registry entries cannot be removed. Adding such support is left as
a future exercise.
The test_dsm_registry test module contains tests for the new
functions and also serves as a complete usage example.
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Discussion: https://postgr.es/m/aEC8HGy2tRQjZg_8%40nathan
Restore nbtree preprocessing comments describing how we mark nbtree row
compare members required to how they were prior to 2016 bugfix commit
a298a1e0.
Oversight in commit bd3f59fd, which made nbtree preprocessing revert to
the original 2006 rules, but neglected to revert these comments.
Backpatch-through: 18
The array-based variant of width_bucket() has always accepted NaN
inputs, treating them as equal but larger than any non-NaN,
as we do in ordinary comparisons. But up to now, the four-argument
variants threw errors for a NaN operand. This is inconsistent
and unnecessary, since we can perfectly well regard NaN as falling
after the last bucket.
We do still throw error for NaN or infinity histogram-bound inputs,
since there's no way to compute sensible bucket boundaries.
Arguably this is a bug fix, but given the lack of field complaints
I'm content to fix it in master.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/2822872.1750540911@sss.pgh.pa.us
Trying to alter a constraint so that it becomes NOT VALID results in an
error that assumes the constraint is a foreign key. This is potentially
wrong, so give a more generic error message.
While at it, give CREATE CONSTRAINT TRIGGER a better error message as
well.
Co-authored-by: jian he <jian.universality@gmail.com>
Co-authored-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Co-authored-by: Amul Sul <sulamul@gmail.com>
Discussion: https://postgr.es/m/CACJufxHSp2puxP=q8ZtUGL1F+heapnzqFBZy5ZNGUjUgwjBqTQ@mail.gmail.com
Recent nbtree bugfix commit 5f4d98d4 added a special case to the code
that sets up a page-level prefix of keys that are definitely satisfied
by every tuple on the page: whenever _bt_set_startikey reached a row
compare key, we'd refuse to apply the pstate.forcenonrequired behavior
in scans where that usually happens (scans with a higher-order array
key). That hack made the scan avoid essentially the same infinite
cycling behavior that also affected nbtree scans with redundant keys
(keys that preprocessing could not eliminate) prior to commit f09816a0.
There are now serious doubts about this row compare workaround.
Testing has shown that a scan with a row compare key and an array key
could still read the same leaf page twice (without the scan's direction
changing), which isn't supposed to be possible following the SAOP
enhancements added by Postgres 17 commit 5bf748b8. Also, we still
allowed a required row compare key to be used with forcenonrequired mode
when its header key happened to be beyond the pstate.ikey set by
_bt_set_startikey, which was complicated and brittle.
The underlying problem was that row compares had inconsistent rules
around how scans start (which keys can be used for initial positioning
purposes) and how scans end (which keys can set continuescan=false).
Quals with redundant keys that could not be eliminated by preprocessing
also had that same quality to them prior to today's bugfix f09816a0. It
now seems prudent to bring row compare keys in line with the new charter
for required keys, by making the start and end rules symmetric.
This commit fixes two points of disagreement between _bt_first and
_bt_check_rowcompare. Firstly, _bt_check_rowcompare was capable of
ending the scan at the point where it needed to compare an ISNULL-marked
row compare member that came immediately after a required row compare
member. _bt_first now has symmetric handling for NULL row compares.
Secondly, _bt_first had its own ideas about which keys were safe to use
for initial positioning purposes. It could use fewer or more keys than
_bt_check_rowcompare. _bt_first now uses the same requiredness markings
as _bt_check_rowcompare for this.
Now that _bt_first and _bt_check_rowcompare agree on how to start and
end scans, we can get rid of the forcenonrequired special case, without
any risk of infinite cycling. This approach also makes row compare keys
behave more like regular scalar keys, particularly within _bt_first.
Fixing these inconsistencies necessitates dealing with a related issue
with the way that row compares were marked required by preprocessing: we
didn't mark any lower-order row members required following 2016 bugfix
commit a298a1e0. That approach was over broad. The bug in question was
actually an oversight in how _bt_check_rowcompare dealt with tuple NULL
values that failed to satisfy a scan key marked required in the opposite
scan direction (it was a bug in 2011 commits 6980f817 and 882368e8, not
a bug in 2006 commit 3a0a16cb). Go back to marking row compare members
as required using the original 2006 rules, and fix the 2016 bug in a
more principled way: by limiting use of the "set continuescan=false with
a key required in the opposite scan direction upon encountering a NULL
tuple value" optimization to the first/most significant row member key.
While it isn't safe to use an implied IS NOT NULL qualifier to end the
scan when it comes from a required lower-order row compare member key,
it _is_ generally safe for such a required member key to end the scan --
provided the key is marked required in the _current_ scan direction.
This fixes what was arguably an oversight in either commit 5f4d98d4 or
commit 8a510275. It is a direct follow-up to today's commit f09816a0.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=pcijHL_mA0_TJ5LiTB28QpQ0cGtT-ccFV=KzuunNDDQ@mail.gmail.com
Backpatch-through: 18
nbtree preprocessing's handling of redundant (and contradictory) keys
created problems for scans with = arrays. It was just about possible
for a scan with an = array key and one or more redundant keys (keys that
preprocessing could not eliminate due an incomplete opfamily and a
cross-type key) to get stuck. Testing has shown that infinite cycling
where the scan never manages to make forward progress was possible.
This could happen when the scan's arrays were reset in _bt_readpage's
forcenonrequired=true path (added by bugfix commit 5f4d98d4) when the
arrays weren't at least advanced up to the same point that they were in
at the start of the _bt_readpage call. Earlier redundant keys prevented
the finaltup call to _bt_advance_array_keys from reaching lower-order
keys that needed to be used to sufficiently advance the scan's arrays.
To fix, make preprocessing leave the scan's keys in a state that is as
close as possible to how it'll usually leave them (in the common case
where there's no redundant keys that preprocessing failed to eliminate).
Now nbtree preprocessing _reliably_ leaves behind at most one required
>/>= key per index column, and at most one required </<= key per index
column. Columns that have one or more = keys that are eligible to be
marked required (based on the traditional rules) prioritize the = keys
over redundant inequality keys; they'll _reliably_ be left with only one
of the = keys as the index column's only required key.
Keys that are not marked required (whether due to the new preprocessing
step running or for some other reason) are relocated to the end of the
so->keyData[] array as needed. That way they'll always be evaluated
after the scan's required keys, and so cannot prevent code in places
like _bt_advance_array_keys and _bt_first from reaching a required key.
Also teach _bt_first to decide which initial positioning keys to use
based on the same requiredness markings that have long been used by
_bt_checkkeys/_bt_advance_array_keys. This is a necessary condition for
reliably avoiding infinite cycling. _bt_advance_array_keys expects to
be able to reason about what'll happen in the next _bt_first call should
it start another primitive index scan, by evaluating inequality keys
that were marked required in the opposite-to-scan scan direction only.
Now everybody (_bt_first, _bt_checkkeys, and _bt_advance_array_keys)
will always agree on which exact key will be used on each index column
to start and/or end the scan (except when row compare keys are involved,
which have similar problems not addressed by this commit).
An upcoming commit will finish off the work started by this commit by
harmonizing how _bt_first, _bt_checkkeys, and _bt_advance_array_keys
apply row compare keys to start and end scans.
This fixes what was arguably an oversight in either commit 5f4d98d4 or
commit 8a510275.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=ds4M+3NXMgwxYxqU8MULaLf696_v5g=9WNmWL2=Uo2A@mail.gmail.com
Backpatch-through: 18
The previous minimum was to maintain support for Python 3.5, but we
now require Python 3.6 anyway (commit 45363fca63), so that reason is
obsolete. A small raise to Meson 0.57 allows getting rid of a fair
amount of version conditionals and silences some future-deprecated
warnings.
With the version bump, the following deprecation warnings appeared and
are fixed:
WARNING: Project targets '>=0.57' but uses feature deprecated since '0.55.0': ExternalProgram.path. use ExternalProgram.full_path() instead
WARNING: Project targets '>=0.57' but uses feature deprecated since '0.56.0': meson.build_root. use meson.project_build_root() or meson.global_build_root() instead.
It turns out that meson 0.57.0 and 0.57.1 are buggy for our use, so
the minimum is actually set to 0.57.2. This is specific to this
version series; in the future we won't necessarily need to be this
precise.
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/42e13eb0-862a-441e-8d84-4f0fd5f6def0%40eisentraut.org
Commit c120550edb optimized the vacuuming of relations without
indexes (a.k.a. one-pass strategy) by directly marking dead item IDs
as LP_UNUSED. However, the periodic FSM vacuum was still checking if
dead item IDs had been marked as LP_DEAD when attempting to vacuum the
FSM every VACUUM_FSM_EVERY_PAGES blocks. This condition was never met
due to the optimization, resulting in missed FSM vacuum
opportunities.
This commit modifies the periodic FSM vacuum condition to use the
number of tuples deleted during HOT pruning. This count includes items
marked as either LP_UNUSED or LP_REDIRECT, both of which are expected
to result in new free space to report.
Back-patch to v17 where the vacuum optimization for tables with no
indexes was introduced.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoBL8m6B9GSzQfYxVaEgvD7-Kr3AJaS-hJPHC+avm-29zw@mail.gmail.com
Backpatch-through: 17
This commit moves all the routines related to the bytea data type into
its own new file, called bytea.c, clearing some of the bloat in
varlena.c. This includes the routines for:
- Input, output, receive and send
- Comparison
- Casts to integer types
- bytea-specific functions
The internals of the routines moved here are unchanged, with one
exception. This comes with a twist in bytea_string_agg_transfn(), where
the call to makeStringAggState() is replaced by the internals of this
routine, still located in varlena.c. This simplifies the move to the
new file by not having to expose makeStringAggState().
Author: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/CAJ7c6TMPVPJ5DL447zDz5ydctB8OmuviURtSwd=PHCRFEPDEAQ@mail.gmail.com
Prior to this patch, every FETCH call would generate a unique queryId
with a different size specified. Depending on the workloads, this could
lead to a significant bloat in pg_stat_statements, as repeatedly calling
a specific cursor would result in a new queryId each time. For example,
FETCH 1 c1; and FETCH 2 c1; would produce different queryIds.
This patch improves the situation by normalizing the fetch size, so as
semantically similar statements generate the same queryId. As a result,
statements like the below, which differ syntactically but have the same
effect, will now share a single queryId:
FETCH FROM c1
FETCH NEXT c1
FETCH 1 c1
In order to do a normalization based on the keyword used in FETCH,
FetchStmt is tweaked with a new FetchDirectionKeywords. This matters
for "howMany", which could be set to a negative value depending on the
direction, and we want to normalize the queries with enough information
about the direction keywords provided, including RELATIVE, ABSOLUTE or
all the ALL variants.
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0tA6LbHCg2qSS+KuM850BZC_+ZgHV7Ug6BXw22TNyF+MA@mail.gmail.com
The existing code assumed that O_RDONLY is defined as 0, but this is
not required by POSIX and is not true on GNU Hurd. We can avoid
the assumption by relying on O_ACCMODE to mask the fcntl() result.
(Hopefully, all supported platforms define that.)
Author: Michael Banck <mbanck@gmx.net>
Co-authored-by: Samuel Thibault
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6862e8d1.050a0220.194b8d.76fa@mx.google.com
Discussion: https://postgr.es/m/68480868.5d0a0220.1e214d.68a6@mx.google.com
Backpatch-through: 13
Previously, pattern matching and case mapping behavior branched based
on the provider. Refactor to use a method table, which is less
error-prone.
This is also a step toward multiple provider versions, which we may
want to support in the future.
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/2830211e1b6e6a2e26d845780b03e125281ea17b.camel%40j-davis.com
When querying NUMA status of pages in shared memory, we need to touch
the memory first to get valid results. This may trigger valgrind
reports, because some of the memory (e.g. unpinned buffers) may be
marked as noaccess.
Solved by adding a valgrind suppresion. An alternative would be to
adjust the access/noaccess status before touching the memory, but that
seems far too invasive. It would require all those places to have
detailed knowledge of what the shared memory stores.
The pg_numa_touch_mem_if_required() macro is replaced with a function.
Macros are invisible to suppressions, so it'd have to suppress reports
for the caller - e.g. pg_get_shmem_allocations_numa(). So we'd suppress
reports for the whole function, and that seems to heavy-handed. It might
easily hide other valid issues.
Reviewed-by: Christoph Berg <myon@debian.org>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aEtDozLmtZddARdB@msg.df7cb.de
Backpatch-through: 18
This commit fixes some defects in the backend's xml.c, found upon
inspection of the internals of libxml2:
- xmlEncodeSpecialChars() can fail on malloc(), returning NULL back to
the caller. xmltext() assumed that this could never happen. Like other
code paths, a TRY/CATCH block is added there, covering also the fact
that cstring_to_text_with_len() could fail a memory allocation, where
the backend would miss to free the buffer allocated by
xmlEncodeSpecialChars().
- Some libxml2 routines called in xmlelement() can return NULL, like
xmlAddChildList() or xmlTextWriterStartElement(). Dedicated errors are
added for them.
- xml_xmlnodetoxmltype() missed that xmlXPathCastNodeToString() can fail
on an allocation failure. In this case, the call can just be moved to
the existing TRY/CATCH block.
All these code paths would cause the server to crash. As this is
unlikely a problem in practice, no backpatch is done. Jim and I have
caught these defects, not sure who has scored the most. The contrib
module xml2/ has similar defects, which will be addressed in a separate
change.
Reported-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/aEEingzOta_S_Nu7@paquier.xyz
This provides a convenient way to look up a database's OID. For
example, the query
SELECT * FROM pg_shdepend
WHERE dbid = (SELECT oid FROM pg_database
WHERE datname = current_database());
can now be simplified to
SELECT * FROM pg_shdepend
WHERE dbid = current_database()::regdatabase;
Like the regrole type, regdatabase has cluster-wide scope, so we
disallow regdatabase constants from appearing in stored
expressions.
Bumps catversion.
Author: Ian Lawrence Barwick <barwick@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/aBpjJhyHpM2LYcG0%40nathan
This commit refactors the vacuum routines that rely on VacuumParams,
adding const markers where necessary to force a new policy in the code.
This structure should not use a pointer as it may be used across
multiple relations, and its contents should never be updated.
vacuum_rel() stands as an exception as it touches the "index_cleanup"
and "truncate" options.
VacuumParams has been introduced in 0d83138974, and 661643deda has
fixed a bug impacting VACUUM operating on multiple relations. The
changes done in tableam.h break ABI compatibility, so this commit can
only happen on HEAD.
Author: Shihao Zhong <zhong950419@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAGRkXqTo+aK=GTy5pSc-9cy8H2F2TJvcrZ-zXEiNJj93np1UUw@mail.gmail.com
In the wake of commit a16ef313f, we need to deal with more cases
involving PlaceHolderVars in NestLoopParams than we did before.
For one thing, a16ef313f was incorrect to suppose that we could
rely on the required-outer relids of the lefthand path to decide
placement of nestloop-parameter PHVs. As Richard Guo argued at
the time, we must look at the required-outer relids of the join
path itself.
For another, we have to apply replace_nestloop_params() to such
a PHV's expression, in case it contains references to values that
will be supplied from NestLoopParams of higher-level nestloops.
For another, we need to be more careful about the phnullingrels
of the PHV than we were being. identify_current_nestloop_params
only bothered to ensure that the phnullingrels didn't contain
"too many" relids, but now it has to be exact, because setrefs.c
will apply both NRM_SUBSET and NRM_SUPERSET checks in different
places. We can compute the correct relids by determining the
set of outer joins that should be able to null the PHV and then
subtracting whatever's been applied at or below this join.
Do the same for plain Vars, too. (This should make it possible
to use NRM_EQUAL to process nestloop params in setrefs.c, but
I won't risk making such a change in v18 now.)
Lastly, if a nestloop parameter PHV was pulled up out of a subquery
and it contains a subquery that was originally pushed down from this
query level, then that will still be represented as a SubLink, because
SS_process_sublinks won't recurse into outer PHVs, so it didn't get
transformed during expression preprocessing in the subquery. We can
substitute the version of the PHV's expression appearing in its
PlaceHolderInfo to ensure that that preprocessing has happened.
(Seems like this processing sequence could stand to be redesigned,
but again, late in v18 development is not the time for that.)
It's not very clear to me why the old have_dangerous_phv join-order
restriction prevented us from seeing the last three of these problems.
But given the lack of field complaints, it must have done so.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18953-1c9883a9d4afeb30@postgresql.org
Sometimes a table's constraint may depend on a column of another
table, so that we have to update the constraint when changing the
referenced column's type. We need to have lock on the constraint's
table to do that. ATPostAlterTypeCleanup believed that this case
was only possible for FOREIGN KEY constraints, but it's wrong at
least for CHECK and EXCLUDE constraints; and in general, we'd
probably need exclusive lock to alter any sort of constraint.
So just remove the contype check and acquire lock for any other
table. This prevents a "you don't have lock" assertion failure,
though no ill effect is observed in production builds.
We'll error out later anyway because we don't presently support
physically altering column types within stored composite columns.
But the catalog-munging is basically all there, so we may as well
make that part work.
Bug: #18970
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Diagnosed-by: jian he <jian.universality@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18970-a7d1cfe1f8d5d8d9@postgresql.org
Backpatch-through: 13
ca307d5cec made CheckPointReplicationSlots() unconditionally call
ReplicationSlotsComputeRequiredLSN(). It causes an assertion trap when
max_replication_slots equals 0. This commit makes
CheckPointReplicationSlots() call ReplicationSlotsComputeRequiredLSN() only
when at least one slot gets its last_saved_restart_lsn updated. That avoids
an assert trap and also saves some cycles when no one slot has
last_saved_restart_lsn updated.
Based on ideas from Dilip Kumar <dilipbalaut@gmail.com> and
Hayato Kuroda <kuroda.hayato@fujitsu.com>.
Reported-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BB506AF934376FF3A8BB947BA%40OS0PR01MB5716.jpnprd01.prod.outlook.com
Commit 7d6d2c4bbd dropped opcintype from the index AM strategy
translation API. But some error messages about failed lookups still
mentioned it, even though it was not used for the lookup. Fix by
removing ipcintype from the error messages as well.
8e03eb92e9 reverted the commit 39b66a91bd which allowed freezing
in the heap_insert() code path but forgot to remove the corresponding
check in heap_xlog_insert(). This code is extraneous but not harmful.
However, cleaning it up makes it very clear that, as of now, we do not
support any freezing of pages in the heap_insert() path.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/flat/CAAKRu_Zp4Pi-t51OFWm1YZ-cctDfBhHCMZ%3DEx6PKxv0o8y2GvA%40mail.gmail.com
Backpatch-through: 14
We can simplify the VM counters added in dc6acfd910 to
lazy_vacuum_heap_page() and lazy_scan_new_or_empty().
We won't invoke lazy_vacuum_heap_page() unless there are dead line
pointers, so we know the page can't be all-visible.
In lazy_scan_new_or_empty(), we only update the VM if the page-level
hint PD_ALL_VISIBLE is clear, and the VM bit cannot be set if the page
level bit is clear because a subsequent page update would fail to clear
the visibility map bit.
Simplify the logic for determining which log counters to increment based
on this knowledge. Doing so is worthwhile because the old logic was
confusing and misguided.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_a9w_n2mwY%3DG4LjfWTvRTJtjbfvnYAKi4WjO8QXHHrA0g%40mail.gmail.com
Commit 14e87ffa5c introduced support for adding comments to NOT NULL
constraints. However, CREATE TABLE LIKE INCLUDING COMMENTS did not copy
these comments to the new table. This was an oversight in that commit.
This commit corrects the behavior by ensuring CREATE TABLE LIKE to also copy
the comments on NOT NULL constraints when INCLUDING COMMENTS is specified.
Author: Jian He <jian.universality@gmail.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/127debef-e558-4784-9e24-0d5eaf91e2d1@oss.nttdata.com
For the subcommand ALTER COLUMN TYPE of the ALTER TABLE command, the
USING expression may reference virtual generated columns. These
columns must be expanded before the expression is fed through
expression_planner and the expression-execution machinery. Failing to
do so can result in incorrect rewrite decisions, and can also lead to
"ERROR: unexpected virtual generated column reference".
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/b5f96b24-ccac-47fd-9e20-14681b894f36@gmail.com
Just like selecting from a view is exploitable (CVE-2024-7348),
selecting from a table with virtual generated columns is exploitable.
Users who are concerned about this can avoid selecting from views, but
telling them to avoid selecting from tables is less practical.
To address this, this changes it so that generation expressions for
virtual generated columns are restricted to using built-in functions
and types, and the columns are restricted to having a built-in type.
We assume that built-in functions and types cannot be exploited for
this purpose.
In the future, this could be expanded by some new mechanism to declare
other functions and types as safe or trusted for this purpose, but
that is to be designed.
(An alternative approach might have been to expand the
restrict_nonsystem_relation_kind GUC to handle this, like the fix for
CVE-2024-7348. But that is kind of an ugly approach. That fix had to
fit in the constraints of fixing an ancient vulnerability in all
branches. Since virtual generated columns are new, we're free from
the constraints of the past, and we can and should use cleaner
options.)
Reported-by: Feike Steenbergen <feikesteenbergen@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAK_s-G2Q7de8Q0qOYUR%3D_CTB5FzzVBm5iZjOp%2BmeVWpMpmfO0w%40mail.gmail.com
This fixes two issues with the handling of VacuumParams in vacuum_rel().
This code path has the idea to change the passed-in pointer of
VacuumParams for the "truncate" and "index_cleanup" options for the
relation worked on, impacting the two following scenarios where
incorrect options may be used because a VacuumParams pointer is shared
across multiple relations:
- Multiple relations in a single VACUUM command.
- TOAST relations vacuumed with their main relation.
The problem is avoided by providing to the two callers of vacuum_rel()
copies of VacuumParams, before the pointer is updated for the "truncate"
and "index_cleanup" options.
The refactoring of the VACUUM option and parameters done in 0d83138974
did not introduce an issue, but it has encouraged the problem we are
dealing with in this commit, with b84dbc8eb8 for "truncate" and
a96c41feec for "index_cleanup" that have been added a couple of years
after the initial refactoring. HEAD will be improved with a different
patch that hardens the uses of VacuumParams across the tree. This
cannot be backpatched as it introduces an ABI breakage.
The backend portion of the patch has been authored by Nathan, while I
have implemented the tests. The tests rely on injection points to check
the option values, making them faster, more reliable than the tests
originally proposed by Shihao, and they also provide more coverage.
This part can only be backpatched down to v17.
Reported-by: Shihao Zhong <zhong950419@gmail.com>
Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAGRkXqTo+aK=GTy5pSc-9cy8H2F2TJvcrZ-zXEiNJj93np1UUw@mail.gmail.com
Backpatch-through: 13
The logical replication launcher process would sometimes sleep
for as much as 3 minutes before noticing that it is supposed
to launch a new worker. This could happen if
(1) WaitForReplicationWorkerAttach absorbed a process latch wakeup
that was meant to cause ApplyLauncherMain to do work, or
(2) logicalrep_worker_launch reported failure, either because of
resource limits or because the new worker terminated immediately.
In case (2), the expected behavior is that we retry the launch after
wal_retrieve_retry_interval, but that didn't reliably happen.
It's not clear how often such conditions would occur in the field,
but in our subscription test suite they are somewhat common,
especially in tests that exercise cases that cause quick worker
failure. That causes the tests to take substantially longer than
they ought to do on typical setups.
To fix (1), make WaitForReplicationWorkerAttach re-set the latch
before returning if it cleared it while looping. To fix (2), ensure
that we reduce wait_time to no more than wal_retrieve_retry_interval
when logicalrep_worker_launch reports failure. In passing, fix a
couple of perhaps-hypothetical race conditions, e.g. examining
worker->in_use without a lock.
Backpatch to v16. Problem (2) didn't exist before commit 5a3a95385
because the previous code always set wait_time to
wal_retrieve_retry_interval when launching a worker, regardless of
success or failure of the launch. That behavior also greatly
mitigated problem (1), so I'm not excited about adapting the remainder
of the patch to the substantially-different code in older branches.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/817604.1750723007@sss.pgh.pa.us
Backpatch-through: 16
Commit 62d712ecfd made query jumbling squash lists of Consts as a
single element, but there's no reason not to treat PARAM_EXTERN
parameters the same. For these purposes, these values are indeed
constants for any particular execution of a query.
In particular, this should make list squashing more useful for
applications using extended query protocol, which would use parameters
extensively.
A complication arises: if a query has both external parameters and
squashable lists, then the parameter number used as placeholder for the
squashed list might be inconsistent with regards to the parameter
numbers used by the query literal. To reduce the surprise factor, all
parameters are renumbered starting from 1 in that case.
Author: Sami Imseih <samimseih@gmail.com>
Author: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAA5RZ0tRXoPG2y6bMgBCWNDt0Tn=unRerbzYM=oW0syi1=C1OA@mail.gmail.com
There's no principled reason for query jumbling to only remove the first
layer of RelabelType and CoerceViaIO. Change it to see through as many
layers as there are.
ca307d5cec introduced keeping WAL segments by slot's last saved restart LSN.
It also added an assertion that the slot's restart LSN never goes backward.
However, situations when the restart LSN goes backward have been spotted by
buildfarm animals and investigated in the thread.
When pg_receivewal starts the replication, it sets the last replayed LSN to
the beginning of the segment, which is older than what
ReplicationSlotReserveWal() set for the slot. A similar situation can happen
to pg_basebackup. When standby reconnects to the primary, it sends the last
replayed LSN, which might be older than the last confirmed flush LSN. In
both these situations, a concurrent checkpoint may trigger an assert trap.
Based on ideas from Vitaly Davydov <v.davydov@postgrespro.ru>,
Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>,
Vignesh C <vignesh21@gmail.com>,
Amit Kapila <amit.kapila16@gmail.com>.
Reported-by: Vignesh C <vignesh21@gmail.com>
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALDaNm3s-jpQTe1MshsvQ8GO%3DTLj233JCdkQ7uZ6pwqRVpxAdw%40mail.gmail.com
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Specify whether the bucket bounds are inclusive or exclusive,
and improve some other vague language. Explain the behavior that
occurs when the "low" bound is greater than the "high" bound.
Make width_bucket_numeric's comment more like that for
width_bucket_float8, in particular noting that infinite
bounds are rejected (since they became possible in v14).
Reported-by: Ben Peachey Higdon <bpeacheyhigdon@gmail.com>
Author: Robert Treat <rob@xzilla.net>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/2BD74F86-5B89-4AC1-8F13-23CED3546AC1@gmail.com
Backpatch-through: 13
Commit 85e5e222b, which added (a forerunner of) this logic,
argued that
Adding the necessary complexity to make this work doesn't seem like
it would be repaid in significantly better plans, because in cases
where such a PHV exists, there is probably a corresponding join order
constraint that would allow a good plan to be found without using the
star-schema exception.
The flaw in this claim is that there may be other join-order
restrictions that prevent us from finding a join order that doesn't
involve a "dangerous" PHV. In particular we now recognize that
small join_collapse_limit or from_collapse_limit could prevent it.
Therefore, let's bite the bullet and make the case work.
We don't have to extend the executor's support for nestloop parameters
as I thought at the time, because we can instead push the evaluation
of the placeholder's expression into the left-hand input of the
NestLoop node. So there's not really a lot of downside to this
solution, and giving the planner more join-order flexibility should
have value beyond just avoiding failure.
Having said that, there surely is a nonzero risk of introducing
new bugs. Since this failure mode escaped detection for ten years,
such cases don't seem common enough to justify a lot of risk.
Therefore, let's put this fix into master but leave the back branches
alone (for now anyway).
Bug: #18953
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Diagnosed-by: Richard Guo <guofenglinux@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18953-1c9883a9d4afeb30@postgresql.org
While choosing an autogenerated name for an index, look for
pre-existing relations using a SnapshotDirty snapshot, instead of the
previous behavior that considered only committed-good pg_class rows.
This allows us to detect and avoid conflicts against indexes that are
still being built.
It's still possible to fail due to a race condition, but the window
is now just the amount of time that it takes DefineIndex to validate
all its parameters, call smgrcreate(), and enter the index's pg_class
row. Formerly the race window covered the entire time needed to
create and fill an index, which could be very long if the table is
large. Worse, if the conflicting index creation is part of a larger
transaction, it wouldn't be visible till COMMIT.
So this isn't a complete solution, but it should greatly ameliorate
the problem, and the patch is simple enough to be back-patchable.
It might at some point be useful to do the same for pg_constraint
entries (cf. ChooseConstraintName, ConstraintNameExists, and related
functions). However, in the absence of field complaints, I'll leave
that alone for now. The relation-name test should be good enough for
index-based constraints, while foreign-key constraints seem to be okay
since they require exclusive locks to create.
Bug: #18959
Reported-by: Maximilian Chrzan <maximilian.chrzan@here.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/18959-f63b53b864bb1417@postgresql.org
Backpatch-through: 13
Improve the clarity of LOG messages when a failover logical slot
synchronization fails, making the reasons more explicit for easier
debugging.
Update the documentation to outline scenarios where slot synchronization
can fail, especially during the initial sync, and emphasize that
pg_sync_replication_slot() is primarily intended for testing and
debugging purposes.
We also discussed improving the functionality of
pg_sync_replication_slot() so that it can be used reliably, but we would
take up that work for next version after some more discussion and review.
Reported-by: Suraj Kharage <suraj.kharage@enterprisedb.com>
Author: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 17, where it was introduced
Discussion: https://postgr.es/m/CAF1DzPWTcg+m+x+oVVB=y4q9=PYYsL_mujVp7uJr-_oUtWNGbA@mail.gmail.com
Commit 8492feb98f added support for parallel CREATE INDEX on GIN indexes.
However, previously two places in the documentation and two in the source
code comments still stated that only B-tree and BRIN indexes support
parallel builds.
This commit updates those references to correctly include GIN indexes.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/7d27d068-90e2-4022-9bd7-09b0fd3d4f47@oss.nttdata.com
This parameter has been introduced in 325f2ec555, and it was not
documented contrary to all the other arguments of
heap_create_with_catalog().
Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Steven Niu <niushiji@gmail.com>
Discussion: https://postgr.es/m/aE--bmEv-gJUTH5v@paquier.xyz
Commit 4909b38af0 introduced logic to distribute invalidation messages
from catalog-modifying transactions to all concurrent in-progress
transactions. However, since each transaction distributes not only its
original invalidation messages but also previously distributed
messages to other transactions, this leads to an exponential increase
in allocation request size for invalidation messages, ultimately
causing memory allocation failure.
This commit fixes this issue by tracking distributed invalidation
messages separately per decoded transaction and not redistributing
these messages to other in-progress transactions. The maximum size of
distributed invalidation messages that one transaction can store is
limited to MAX_DISTR_INVAL_MSG_PER_TXN (8MB). Once the size of the
distributed invalidation messages exceeds this threshold, we
invalidate all caches in locations where distributed invalidation
messages need to be executed.
Back-patch to all supported versions where we introduced the fix by
commit 4909b38af0.
Note that this commit adds two new fields to ReorderBufferTXN to store
the distributed transactions. This change breaks ABI compatibility in
back branches, affecting third-party extensions that depend on the
size of the ReorderBufferTXN struct, though this scenario seems
unlikely.
Additionally, it adds a new flag to the txn_flags field of
ReorderBufferTXN to indicate distributed invalidation message
overflow. This should not affect existing implementations, as it is
unlikely that third-party extensions use unused bits in the txn_flags
field.
Bug: #18938#18942
Author: vignesh C <vignesh21@gmail.com>
Reported-by: Duncan Sands <duncan.sands@deepbluecap.com>
Reported-by: John Hutchins <john.hutchins@wicourts.gov>
Reported-by: Laurence Parry <greenreaper@hotmail.com>
Reported-by: Max Madden <maxmmadden@gmail.com>
Reported-by: Braulio Fdo Gonzalez <brauliofg@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/680bdaf6-f7d1-4536-b580-05c2760c67c6@deepbluecap.com
Discussion: https://postgr.es/m/18942-0ab1e5ae156613ad@postgresql.org
Discussion: https://postgr.es/m/18938-57c9a1c463b68ce0@postgresql.org
Discussion: https://postgr.es/m/CAD1FGCT2sYrP_70RTuo56QTizyc+J3wJdtn2gtO3VttQFpdMZg@mail.gmail.com
Discussion: https://postgr.es/m/CANO2=B=2BT1hSYCE=nuuTnVTnjidMg0+-FfnRnqM6kd23qoygg@mail.gmail.com
Backpatch-through: 13
Sometimes the TupleDesc used in verify_compact_attribute() is shared
among backends, and since CompactAttribute.attcacheoff gets updated
during tuple deformation, it was possible that another backend would
set attcacheoff on a given CompactAttribute in the small window of time
from when the attcacheoff from the live CompactAttribute was being set
in the 'tmp' CompactAttribute and before the Assert verifying that the
live and tmp CompactAttributes matched.
Here we adjust the code to make a copy of the live CompactAttribute so
that we're not trying to Assert against a shared copy of it.
Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/7195e408-758c-4031-8e61-4f842c716ac0@gmail.com
Previously there was no memory barrier enforcing correct memory ordering when
waiting for a free IO handle. However, in the much more common case of waiting
for IO to complete, memory barriers already were present.
On strongly ordered architectures like x86 this had no negative consequences,
but on some armv8 hardware (observed on Apple hardware), it was possible for
the update, in the IO worker, to PgAioHandle->state to become visible before
->distilled_result becoming visible, leading to rather confusing assertion
failures. The failures were rare enough that the bug sometimes took days to
reproduce when running 027_stream_regress in a loop.
Once finally debugged, it was easy enough to come up with a much quicker
repro: Trigger a lot of very fast IO by limiting io_combine_limit to 1 and
ensure that we always have to wait for a free handle by setting
io_max_concurrency to 1. Triggering lots of concurrent seqscans in that setup
triggers the issue within seconds.
One reason this was hard to debug was that the assertion failure most commonly
happened in WaitReadBuffers(), rather than in the AIO subsystem itself. The
assertions added in this commit make problems like this easier to understand.
Also add a comment to the IO worker explaining that we rely on the lwlock
acquisition for correct memory ordering.
I think it'd be good to add a tap test that stress tests buffer IO, but that's
material for a separate patch.
Thanks a lot to Alexander and Konstantin for all the debugging help.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Investigated-by: Andres Freund <andres@anarazel.de>
Investigated-by: Alexander Lakhin <exclusion@gmail.com>
Investigated-by: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/2dkz7azclpeiqcmouamdixyn5xhlzy4rvikxrbovyzvi6rnv5c@pz7o7osv2ahf
Our maintenance of typedefs.list has been a little haphazard
(and apparently we can't alphabetize worth a darn). Replace
the file with the authoritative list from our buildfarm, and
run pgindent using that.
I also updated the additions/exclusions lists in pgindent where
necessary to keep pgindent from messing things up significantly.
Notably, now that regex_t and some related names are macros not real
typedefs, we have to whitelist them explicitly. The exclusions list
has also drifted noticeably, presumably due to changes of system
headers on the buildfarm animals that contribute to the list.
Unlike in prior years, I've not manually added typedef names that
are missing from the buildfarm's list because they are not used to
declare any variables or fields. So there are a few places where
the typedef declaration itself is formatted worse than before,
e.g. typedef enum IoMethod. I could preserve the names that were
manually added to the list previously, but I'd really prefer to find
a less manual way of dealing with these cases. A quick grep finds
about 75 such symbols, most of which have never gotten any special
treatment.
Per discussion among pgsql-release, doing this now seems appropriate
even though we're still a week or two away from making the v18 branch.
The new tests verify that logical and physical replication slots are still
valid after an immediate restart on checkpoint completion when the slot was
advanced during the checkpoint.
This commit introduces two new injection points to make these tests possible:
* checkpoint-before-old-wal-removal - triggered in the checkpointer process
just before old WAL segments cleanup;
* logical-replication-slot-advance-segment - triggered in
LogicalConfirmReceivedLocation() when restart_lsn was changed enough to
point to the next WAL segment.
Discussion: https://postgr.es/m/flat/1d12d2-67235980-35-19a406a0%4063439497
Author: Vitaly Davydov <v.davydov@postgrespro.ru>
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 17
The patch fixes the issue with the unexpected removal of old WAL segments
after checkpoint, followed by an immediate restart. The issue occurs when
a slot is advanced after the start of the checkpoint and before old WAL
segments are removed at the end of the checkpoint.
The patch introduces a new in-memory state for slots: last_saved_restart_lsn,
which is used to calculate the oldest LSN for removing WAL segments. This
state is updated every time with the current restart_lsn at the moment when
the slot is saved to disk.
This fix changes the shared memory layout. It's applied to HEAD only because
we don't have to preserve ABI compatibility during the beta stage. Another
fix that doesn't affect the ABI is committed to back branches.
Discussion: https://postgr.es/m/1d12d2-67235980-35-19a406a0%4063439497
Author: Vitaly Davydov <v.davydov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
_bt_readnextpage expects so->currPos.buf to be InvalidBuffer (and for
the position's page to be unlocked) when called. However, it does not
expect there to be no pins held on any page. In particular, so->markPos
might hold a separate pin, both before and after the call. Fix some
comments that seemed to suggest otherwise.
Follow-up commit to commit 7c319f54, which made _bt_killitems drop pins
it acquired itself.
This is a continuation of 15a79c7311, cleaning up the AIO io_uring
code that has been committed after that while still using %llu.
The code changed here is new in v18, so cleaning things now means less
conflicts if this area of the code changes on backpatch once the 18
stable branch is created.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/aEZcGCnYFq642q8k@paquier.xyz
The algorithm to squash lists of constants added by commit 62d712ecfd
was a bit too simplistic; we wanted to avoid adding unnecessary
complexity, but cases like direct function calls of typecasting
functions (and others) were missed, and bogus SQL syntax was being shown
in pg_stat_statements normalized query text field. To fix normalization
for those cases, we need the parser to transmit information about were
each list of constant values starts and ends, so add that to a couple of
nodes. Also add a few more test cases to make sure we're doing the
right thing.
The patch initially submitted by Sami added a new private struct in
gram.y to carry the start/end information for A_Expr, but I (Álvaro)
decided that a better fix was to remove the parser indirection via the
in_expr production, and instead create separate components in the a_expr
rule. I'm surprised that this works and doesn't require more changes,
but I assume (without checking) that the grammar used to be more complex
and got simplified at some point.
Bump catversion.
Author: Sami Imseih <samimseih@gmail.com>
Author: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAA5RZ0tRXoPG2y6bMgBCWNDt0Tn=unRerbzYM=oW0syi1=C1OA@mail.gmail.com
This commit reverts the two following commits:
- 499edb0974, track more precisely query locations for nested
statements.
- 06450c7b8c, a follow-up fix of 499edb0974 with query locations.
The test introduced in this commit is not reverted. This is proving
useful to track a problem that only pgaudit was able to detect.
These prove to have issues with the tracking of SELECT statements, when
these use multiple parenthesis which is something supported by the
grammar. Incorrect location and lengths are causing pg_stat_statements
to become confused, failing its job in query normalization with
potential out-of-bound writes because the location and the length may
not match with what can be handled. A lot of the query patterns
discussed when this issue was reported have no test coverage in the main
regression test suite, or the recovery test 027_stream_regress.pl would
have caught the problems as pg_stat_statements is loaded by the node
running the regression tests. A first step would be to improve the test
coverage to stress more the query normalization logic.
A different portion of this work was done in 45e0ba30fc, with the
addition of tests for nested queries. These can be left in the tree.
They are useful to track the way inner queries are currently tracked by
PGSS with non-top-level entries, and will be useful when reconsidering
in the future the work reverted here.
Reported-by: Alexander Kozhemyakin <a.kozhemyakin@postgrespro.ru>
Discussion: https://postgr.es/m/18947-cdd2668beffe02bf@postgresql.org
Teach nbtree's _bt_killitems to leave the so->currPos page that it sets
LP_DEAD items on in whatever state it was in when _bt_killitems was
called. In particular, make sure that so->dropPin scans don't acquire a
pin whose reference is saved in so->currPos.buf.
Allowing _bt_killitems to change so->currPos.buf like this is wrong.
The immediate consequence of allowing it is that code in _bt_steppage
(that copies so->currPos into so->markPos) will behave as if the scan is
a !so->dropPin scan. so->markPos will therefore retain the buffer pin
indefinitely, even though _bt_killitems only needs to acquire a pin
(along with a lock) for long enough to mark known-dead items LP_DEAD.
This issue came to light following a report of a failure of an assertion
from recent commit e6eed40e. The test case in question involves the use
of mark and restore. An initial call to _bt_killitems takes place that
leaves so->currPos.buf in a state that is inconsistent with the scan
being so->dropPin. A subsequent call to _bt_killitems for the same
position (following so->currPos being saved in so->markPos, and then
restored as so->currPos) resulted in the failure of an assertion that
tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin
(non-assert builds got a "resource was not closed" WARNING instead).
The same problem exists on earlier releases, though the issue is far
more subtle there. Recent commit e6eed40e introduced the so->dropPin
field as a partial replacement for testing so->currPos.buf directly.
Earlier releases won't get an assertion failure (or buffer pin leak),
but they will allow the second _bt_killitems call from the test case to
behave as if a buffer pin was consistently held since the original call
to _bt_readpage. This is wrong; there will have been an initial window
during which no pin was held on the so->currPos page, and yet the second
_bt_killitems call will neglect to check if so->currPos.lsn continues to
match the page's now-current LSN.
As a result of all this, it's just about possible that _bt_killitems
will set the wrong items LP_DEAD (on release branches). This could only
happen with merge joins (the sole user of nbtree mark/restore support),
when a concurrently inserted index tuple used a recently-recycled TID
(and only when the new tuple was inserted onto the same page as a
distinct concurrently-removed tuple with the same TID). This is exactly
the scenario that _bt_killitems' check of the page's now-current LSN
against the LSN stashed in currPos was supposed to prevent.
A follow-up commit will make nbtree completely stop conditioning whether
or not a position's pin needs to be dropped on whether the 'buf' field
is set. All call sites that might need to drop a still-held pin will be
taught to rely on the scan-level so->dropPin field recently introduced
by commit e6eed40e. That will make bugs of the same general nature as
this one impossible (or make them much easier to detect, at least).
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com
Backpatch-through: 13
Traditionally, libpq's pqPutMsgEnd has rounded down the amount-to-send
to be a multiple of 8K when it is eagerly writing some data. This
still seems like a good idea when sending through a Unix socket, as
pipes typically have a buffer size of 8K or some fraction/multiple of
that. But there's not much argument for it on a TCP connection, since
(a) standard MTU values are not commensurate with that, and (b) the
kernel typically applies its own packet splitting/merging logic.
Worse, our SSL and GSSAPI code paths both have API stipulations that
if they fail to send all the data that was offered in the previous
write attempt, we mustn't offer less data in the next attempt; else
we may get "SSL error: bad length" or "GSSAPI caller failed to
retransmit all data needing to be retried". The previous write
attempt might've been pqFlush attempting to send everything in the
buffer, so pqPutMsgEnd can't safely write less than the full buffer
contents. (Well, we could add some more state to track exactly how
much the previous write attempt was, but there's little value evident
in such extra complication.) Hence, apply the round-down only on
AF_UNIX sockets, where we never use SSL or GSSAPI.
Interestingly, we had a very closely related bug report before,
which I attempted to fix in commit d053a879b. But the test case
we had then seemingly didn't trigger this pqFlush-then-pqPutMsgEnd
scenario, or at least we failed to recognize this variant of the bug.
Bug: #18907
Reported-by: Dorjpalam Batbaatar <htgn.dbat.95@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18907-d41b9bcf6f29edda@postgresql.org
Backpatch-through: 13
We concluded that commit e5a3c9d9b is a feature rather than a fix; since
it was added after feature freeze, revert it.
Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reported-by: Michael Paquier <michael@paquier.xyz>
Reported-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/ed2296f1-1a6b-4932-b870-5bb18c2591ae%40oss.nttdata.com
Commit 5fe08c006c fixed this for calls to dshash_create(). This
commit fixes calls to dshash_attach() and dsa_create_in_place().
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aECi_gSD9JnVWQ8T%40nathan
Delay calling BufferGetLSNAtomic() until we finish reading a page that
actually contains items that btgettuple will return to the executor.
This reduces the number of calls during plain index scans (we'll only
call BufferGetLSNAtomic() when _bt_readpage returns true), and totally
eliminates calls during index-only scans, bitmap index scans, and plain
index scans of an unlogged relation.
Currently, when checksums (or wal_log_hints) are enabled, acquiring a
page's LSN in BufferGetLSNAtomic() involves locking the buffer header
(which involves the use of spinlocks). Testing has shown that enabling
page-level checksums causes large regressions with certain workloads,
especially on larger multi-socket systems.
The regression isn't tied to any Postgres 18 commit. However, Postgres
18 commit 04bec894 made initdb use checksums by default, so it seems
prudent to address the problem now.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/941f0190-e3c6-4622-9ac7-c04e936e5fdb@vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk-Dg5XWs_jDuiHt4_7ryrSY+n=vxmHY51EVqPDFsKXmg@mail.gmail.com
Use of a RowCompare key makes nbtree index scans ineligible to use
pstate.forcenonrequired following recent bugfix commit 5f4d98d4.
There's no longer any need for _bt_check_rowcompare to accept a
forcenonrequired argument, so remove it.
Similar to commit cc733ed164: when an unenforced foreign key that
references a partitioned table is altered to be enforced, we scan
the constrained table based on each partition on the referenced
partitioned table. This is bogus and likely to cause the ALTER TABLE to
fail: we must only scan the constrained table as pointing to the
top-level partitioned table. Oversight in commit eec0040c4b. Fix by
eliding those scans.
Author: Amul Sul <sulamul@gmail.com>
Reported-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxF1e_gPOLtsDoaE4VCgQPC8KZW_kPAjPR5Rvv4Ew=fb2A@mail.gmail.com
Validating an unvalidated foreign key that references a partitioned
table would try to queue validations for each individual partition of
the referenced table, but this is wrong: each individual partition would
not necessarily have all the referenced rows, so errors would be raised.
Avoid doing that. The pg_constraint rows that cause this to happen are
only there to support the action triggers that implement the DELETE/
UPDATE actions of the FK, so no validating scan is necessary.
This was an oversight in commit b663b9436e.
An equivalent oversight exists for NOT ENFORCED constraints, which is
not fixed in this commit.
Author: Amul Sul <sulamul@gmail.com>
Reported-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/26983.1748418675@localhost
This commit replaces the formula used for "TotalProcs" with a call to
pgaio_uring_procs() in pgaio_uring_shmem_init() for the shared memory
initialization, which is exactly the same, removing a duplication.
pgaio_uring_procs() is used for shared memory sizing and a sanity check,
and it has some documentation explaining some reasoning behind the
formula.
Author: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/ME0P300MB044521067A1EDDA9EDEC3793B66DA@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
Commit 4f7f7b0375 implemented the extension_control_path GUC, and to
make it work it was decided that we should strip the $libdir/ on
module_pathname from .control files, so that extensions don't need to
worry about this change.
This strip logic was implemented on expand_dynamic_library_name()
which works fine when executing the SQL functions from extensions, but
this function is also called when the LOAD command is executed, and
since the user may explicitly pass the $libdir prefix on LOAD
parameter, we should not strip in this case.
This commit fixes this issue by moving the strip logic from
expand_dynamic_library_name() to load_external_function() that is
called when the running the SQL script from extensions.
Reported-by: Evan Si <evsi@amazon.com>
Author: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Bug: #18920
Discussion: https://www.postgresql.org/message-id/flat/18920-b350b1c0a30af006%40postgresql.org
This commit renames the GUC log_lock_failure to log_lock_failures
to align with the existing similar setting log_lock_waits, which uses
the plural form. This improves naming consistency across related GUCs.
Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Author: Fujii Masao <masao.fujii@gmail.com
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/7a8198b6-d5b8-4910-b41e-8d3efcbb015d@eisentraut.org
We store values for these options as array elements with the syntax
"name=value", hence a name containing "=" confuses matters when
it's time to read the array back in. Since validation of the
options is often done (long) after this conversion to array format,
that leads to confusing and off-point error messages. We can
improve matters by rejecting names containing "=" up-front.
(Probably a better design would have involved pairs of array
elements, but it's too late now --- and anyway, there's no
evident use-case for option names like this. We already
reject such names in some other contexts such as GUCs.)
Reported-by: Chapman Flack <jcflack@acm.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chapman Flack <jcflack@acm.org>
Discussion: https://postgr.es/m/6830EB30.8090904@acm.org
Backpatch-through: 13
052026c9b9 mistakenly reordered setup steps in heap_vacuum_rel(),
incorrectly moving RelationGetNumberOfBlocks() before
vacuum_get_cutoffs().
OldestXmin must be determined before RelationGetNumberOfBlocks()
calculates the number of blocks in the relation that will be vacuumed.
Otherwise tuples older than OldestXmin may be inserted into the end of
the relation into blocks that are not vacuumed. If additional tuples
newer than those inserted into unscanned blocks but older than
OldestXmin are inserted into free space earlier in the relation, the
result could be advancing pg_class.relfrozenxid to a newer value than an
unfrozen XID in one of the unscanned heap pages.
Assigning an incorrect relfrozenxid can lead to data loss, so it is
imperative that it correctly reflect the oldest unfrozen xid.
Reported-by: Peter Geoghegan <pg@bowt.ie>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzntqvVEdbbpqG5JqSZGuLWmy4PBfUO-OswfivKchr2gvw%40mail.gmail.com
Commit 7406ab623f added a gist support function that we internally
refer to by the symbol GIST_STRATNUM_PROC. This translated from
"well-known" strategy numbers to opfamily-specific strategy numbers.
However, we later (commit 630f9a43ce) changed this to fit into
index-AM-level compare type mapping, so this function actually now
maps from compare type to opfamily-specific strategy numbers. So this
name is no longer fitting.
Moreover, the index AM level also supports the opposite, a function to
map from strategy number to compare type. This is currently not
supported in gist, but one might wonder what this function is supposed
to be called when it is added.
This patch changes the naming of the gist-level functionality to be
more in line with the index-AM-level functionality. This makes sense
because these are essentially the same thing on different levels.
This also changes the names of the externally visible functions that
are provided for use as such a support function.
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://www.postgresql.org/message-id/37ebb1d9-9036-485f-a215-e55435689917%40eisentraut.org
A cascading WAL sender doing logical decoding (as known as doing its
work on a standby) has been using as flush LSN the value returned by
GetStandbyFlushRecPtr() (last position safely flushed to disk). This is
incorrect as such processes are only able to decode changes up to the
LSN that has been replayed by the startup process.
This commit changes cascading logical WAL senders to use the replay LSN,
as returned by GetXLogReplayRecPtr(). This distinction is important
particularly during shutdown, when WAL senders need to send any
remaining available data to their clients, switching WAL senders to a
caught-up state. Using the latest flush LSN rather than the replay LSN
could cause the WAL senders to be stuck in an infinite loop preventing
them to shut down, as the startup process does not run when WAL senders
attempt to catch up, so they could keep waiting for work that would
never happen.
Backpatch down to v16, where logical decoding on standbys has been
introduced.
Author: Alexey Makhmutov <a.makhmutov@postgrespro.ru>
Reviewed-by: Ajin Cherian <itsajin@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/52138028-7246-421c-9161-4fa108b88070@postgrespro.ru
Backpatch-through: 16
Previously, postgres_fdw always 1) opened a remote transaction in READ
WRITE mode even when the local transaction was READ ONLY, causing a READ
ONLY transaction using it that references a foreign table mapped to a
remote view executing a volatile function to write in the remote side,
and 2) opened the remote transaction in NOT DEFERRABLE mode even when
the local transaction was DEFERRABLE, causing a SERIALIZABLE READ ONLY
DEFERRABLE transaction using it to abort due to a serialization failure
in the remote side.
To avoid these, modify postgres_fdw to open a remote transaction in the
same access/deferrable modes as the local transaction. This commit also
modifies it to open a remote subtransaction in the same access mode as
the local subtransaction.
Although these issues exist since the introduction of postgres_fdw,
there have been no reports from the field. So it seems fine to just fix
them in master only.
Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com
When a MERGE's target table is the parent of an inheritance tree, any
INSERT actions insert into the parent table using ModifyTableState's
rootResultRelInfo. However, there are two bugs in the way is
initialized:
1. ExecInitMerge() incorrectly uses a different ResultRelInfo entry
from ModifyTableState's resultRelInfo array to build the insert
projection, which may not be compatible with rootResultRelInfo.
2. ExecInitModifyTable() does not fully initialize rootResultRelInfo.
Specifically, ri_WithCheckOptions, ri_WithCheckOptionExprs,
ri_returningList, and ri_projectReturning are not initialized.
This can lead to crashes, or incorrect query results due to failing to
check WCO's or process the RETURNING list for INSERT actions.
Fix both these bugs in ExecInitMerge(), noting that it is only
necessary to fully initialize rootResultRelInfo if the MERGE has
INSERT actions and the target table is a plain inheritance parent.
Backpatch to v15, where MERGE was introduced.
Reported-by: Andres Freund <andres@anarazel.de>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/4rlmjfniiyffp6b3kv4pfy4jw3pciy6mq72rdgnedsnbsx7qe5@j5hlpiwdguvc
Backpatch-through: 15
uint64 was chosen to be consistent with the type used by the query ID,
but the conclusion of a recent discussion for the query ID is that int64
is a better fit as the signed form is shown to the user, for PGSS or
EXPLAIN outputs.
This commit changes the plan ID to use int64, following c3eda50b06
that has done the same for the query ID.
The plan ID is new to v18, introduced in 2a0cd38da5.
Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aCvzJNwetyEI3Sgo@paquier.xyz
A few places that access system catalogs don't set up an active
snapshot before potentially accessing their TOAST tables. To fix,
push an active snapshot just before each section of code that might
require accessing one of these TOAST tables, and pop it shortly
afterwards. While at it, this commit adds some rather strict
assertions in an attempt to prevent such issues in the future.
Commit 16bf24e0e4 recently removed pg_replication_origin's TOAST
table in order to fix the same problem for that catalog. On the
back-branches, those bugs are left in place. We cannot easily
remove a catalog's TOAST table on released major versions, and only
replication origins with extremely long names are affected. Given
the low severity of the issue, fixing older versions doesn't seem
worth the trouble of significantly modifying the patch.
Also, on v13 and v14, the aforementioned strict assertions have
been omitted because commit 2776922201, which added
HaveRegisteredOrActiveSnapshot(), was not back-patched. While we
could probably back-patch it now, I've opted against it because it
seems unlikely that new TOAST snapshot issues will be introduced in
the oldest supported versions.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18127-fe54b6a667f29658%40postgresql.org
Discussion: https://postgr.es/m/18309-c0bf914950c46692%40postgresql.org
Discussion: https://postgr.es/m/ZvMSUPOqUU-VNADN%40nathan
Backpatch-through: 13
Our GSSAPI code only allows packet sizes up to 16kB. However it
emerges that during authentication, larger packets might be needed;
various authorities suggest 48kB or 64kB as the maximum packet size.
This limitation caused login failure for AD users who belong to many
AD groups. To add insult to injury, we gave an unintelligible error
message, typically "GSSAPI context establishment error: The routine
must be called again to complete its function: Unknown error".
As noted in code comments, the 16kB packet limit is effectively a
protocol constant once we are doing normal data transmission: the
GSSAPI code splits the data stream at those points, and if we change
the limit then we will have cross-version compatibility problems
due to the receiver's buffer being too small in some combinations.
However, during the authentication exchange the packet sizes are
not determined by us, but by the underlying GSSAPI library. So we
might as well just try to send what the library tells us to.
An unpatched recipient will fail on a packet larger than 16kB,
but that's not worse than the sender failing without even trying.
So this doesn't introduce any meaningful compatibility problem.
We still need a buffer size limit, but we can easily make it be
64kB rather than 16kB until transport negotiation is complete.
(Larger values were discussed, but don't seem likely to add
anything.)
Reported-by: Chris Gooch <cgooch@bamfunds.com>
Fix-suggested-by: Jacob Champion <jacob.champion@enterprisedb.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://postgr.es/m/DS0PR22MB5971A9C8A3F44BCC6293C4DABE99A@DS0PR22MB5971.namprd22.prod.outlook.com
Backpatch-through: 13
Previously, XactLockTableWait() and ConditionalXactLockTableWait() could enter
a non-interruptible loop when they successfully acquired a lock on a transaction
but the transaction still appeared to be running. Since this loop continued
until the transaction completed, it could result in long, uninterruptible waits.
Although this scenario is generally unlikely since XactLockTableWait() and
ConditionalXactLockTableWait() can basically acquire a transaction lock
only when the transaction is not running, it can occur in a hot standby.
In such cases, the transaction may still appear active due to
the KnownAssignedXids list, even while no lock on the transaction exists.
For example, this situation can happen when creating a logical replication
slot on a standby.
The cause of the non-interruptible loop was the absence of CHECK_FOR_INTERRUPTS()
within it. This commit adds CHECK_FOR_INTERRUPTS() to the loop in both functions,
ensuring they can be interrupted safely.
Back-patch to all supported branches.
Author: Kevin K Biju <kevinkbiju@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAM45KeELdjhS-rGuvN=ZLJ_asvZACucZ9LZWVzH7bGcD12DDwg@mail.gmail.com
Backpatch-through: 13
uint64 was perhaps chosen in cff440d36 as the type was uint32 prior to
that widening work.
Having this as uint64 doesn't make much sense and just adds the overhead of
having to remember that we always output this in its signed form. Let's
remove that overhead.
The signed form output is seemingly required since we have no way to
represent the full range of uint64 in an SQL type. We use BIGINT in places
like pg_stat_statements, which maps directly to int64.
The release notes "Source Code" section may want to mention this
adjustment as some extensions may wish to adjust their code.
Author: David Rowley <dgrowleyml@gmail.com>
Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/50cb0c8b-994b-48f9-a1c4-13039eb3536b@eisentraut.org
ParseFraction only expects to deal with fields that contain a decimal
point and digit(s). However it's possible in some edge cases for it
to be passed input that doesn't look like that. In particular the
input could look like a valid floating-point number, such as ".123e6".
strtod() will happily eat that, possibly producing a result that is
not within the expected range 0..1, which can result in integer
overflow in the callers. That doesn't have any security consequences,
but it's still not very desirable. Fix by checking that the input
has the expected form.
Similarly, DecodeNumberField only expects to deal with fields that
contain a decimal point and digit(s), but it's sometimes abused to
parse strings that might not look like that. This could result in
failure to reject bogus input, yielding silly results. Again, fix
by rejecting input that doesn't look as-expected. That decision
also means that we can affirmatively answer the very old comment
questioning whether we couldn't save some duplicative code by
using ParseFractionalSecond here.
While these changes should only reject input that nobody would
consider valid, it still doesn't seem like a change to make in
stable branches. Apply to HEAD only.
Reported-by: Evgeniy Gorbanev <gorbanev.es@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1328335.1748371099@sss.pgh.pa.us
In pl_comp.c, initially create the plpgsql function's cache context
under the assumed-short-lived caller's context, and reparent it under
CacheMemoryContext only upon success. This avoids a process-lifespan
leak of 8kB or more if the function contains syntax errors. (This
leakage has existed for a long time without many complaints, but as
we move towards a possibly multi-threaded future, getting rid of
process-lifespan leaks grows more important.)
In funccache.c, arrange to reclaim the CachedFunction struct in case
the language-specific compile callback function throws an error;
previously, that resulted in an independent process-lifespan leak.
This is arguably a new bug in v18, since the leakage now occurred
for SQL-language functions as well as plpgsql.
Also, don't fill fn_xmin/fn_tid/dcallback until after successful
completion of the compile callback. This avoids a scenario where a
partially-built function cache might appear already valid upon later
inspection, and another scenario where dcallback might fail upon being
presented with an incomplete cache entry. We would have to reach such
a faulty cache entry via a pre-existing fn_extra pointer, so I'm not
sure these scenarios correspond to any live bug. (The predecessor
code in pl_comp.c never took any care about this, and we've heard no
complaints about that.) Still, it's better to be careful.
Given the lack of field complaints, I'm not very excited about
back-patching any of this; but it seems still in-scope for v18.
Discussion: https://postgr.es/m/999171.1748300004@sss.pgh.pa.us
The code that translates SIMILAR TO pattern matching expressions to
POSIX-style regular expressions did not consider that square brackets
can be nested. For example, in an expression like [[:alpha:]%_], the
logic replaced the placeholders '_' and '%' but it should not.
This commit fixes the conversion logic by tracking the nesting level of
square brackets marking character class areas, while considering that
in expressions like []] or [^]] the first closing square bracket is a
regular character. Multiple tests are added to show how the conversions
should or should not apply applied while in a character class area, with
specific cases added for all the characters converted outside character
classes like an opening parenthesis '(', dollar sign '$', etc.
Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/16ab039d1af455652bdf4173402ddda145f2c73b.camel@cybertec.at
Backpatch-through: 13
Previously, we asserted that the eager scan's success and failure
counters were positive before decrementing them. However, this
assumption was incorrect, as it's possible that some blocks have
already been eagerly scanned by the time eager scanning is disabled.
This commit replaces the assertions with guards to handle this
scenario gracefully.
With this change, we continue to allow read-ahead operations by the
read stream that exceed the success and failure caps. While there is a
possibility that overruns will trigger eager scans of additional
pages, this does not pose a practical concern as the overruns will not
be substantial and remain within an acceptable range.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoConf6tkVCv-=JhQJj56kYsDwo4jG5+WqgT+ukSkYomSQ@mail.gmail.com
PgStat_StatTabEntry and AutoVacOpts structs were leaked until
the end of the autovacuum worker's run, which is bad news if
there are a lot of relations in the database.
Note: pfree'ing the PgStat_StatTabEntry structs here seems a bit
risky, because pgstat_fetch_stat_tabentry_ext does not guarantee
anything about whether its result is long-lived. It appears okay
so long as autovacuum forces PGSTAT_FETCH_CONSISTENCY_NONE, but
I think that API could use a re-think.
Also ensure that the VacuumRelation structure passed to
vacuum() is in recoverable storage.
Back-patch to v15 where we started to manage table statistics
this way. (The AutoVacOpts leakage is probably older, but
I'm not excited enough to worry about just that part.)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
Backpatch-through: 15
If the inner allocation call returns NULL, we should restore the
previous state and return NULL. Previously this code pfree'd
the old chunk anyway, which is surely wrong.
Also, make it call MemoryContextAllocationFailure rather than
summarily returning NULL. The fact that we got control back from the
inner call proves that MCXT_ALLOC_NO_OOM was passed, so this change
is just cosmetic, but someday it might be less so.
This is just a latent bug at present: AFAICT no in-core callers use
this function at all, let alone call it with MCXT_ALLOC_NO_OOM.
Still, it's the kind of bug that might bite back-patched code pretty
hard someday, so let's back-patch to v17 where the bug was introduced
(by commit 743112a2e).
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
Backpatch-through: 17
Move oauth_validator_libraries in postgresql.conf.sample to be grouped
with the other CONN_AUTH_AUTH settings, rather than making up a new
ad-hoc category. This matches the internal categorization and also
how it is listed in the documentation.
9219093cab modularized log_connections output to allow more
granular control over which aspects of connection establishment are
logged. It converted the boolean log_connections GUC into a list of strings
and deprecated previously supported boolean-like values on, off, true,
false, 1, 0, yes, and no. Those values still work, but they are
supported mainly for backwards compatability. As such, documented
examples of log_connections should not use these deprecated values.
Update references in the docs to deprecated log_connections values. Many
of the tests use log_connections. This commit also updates the tests to
use the new values of log_connections. In some of the tests, the updated
log_connections value covers a narrower set of aspects (e.g. the
'authentication' aspect in the tests in src/test/authentication and the
'receipt' aspect in src/test/postmaster). In other cases, the new value
for log_connections is a superset of the previous included aspects (e.g.
'all' in src/test/kerberos/t/001_auth.pl).
Reported-by: Peter Eisentraut <peter@eisentraut.org>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://postgr.es/m/e1586594-3b69-4aea-87ce-73a7488cdc97%40eisentraut.org
The code carelessly modified mtstate->ps.plan->targetlist,
which it's not supposed to do. Fortunately, there's not
really any need to do that because the planner already
set up a perfectly acceptable targetlist for the plan node.
We just need to remove the erroneous assignments and update some
relevant comments.
As it happens, the erroneous assignments caused the targetlist to
point to a different part of the source plan tree, so that there
isn't really a risk of the pointer becoming dangling after executor
termination. The only visible effect of this change we can find is
that EXPLAIN will show upper references to the ModifyTable's output
expressions using different variables. Formerly it showed Vars from
the first target relation that survived executor-startup pruning.
Now it always shows such references using the first relation appearing
in the planner output, independently of what happens during executor
pruning. On the whole that seems like a good thing.
Also make a small tweak in ExplainPreScanNode to ensure that the first
relation will receive a refname assignment in set_rtable_names, even
if it got pruned at startup. Previously the Vars might be shown
without any table qualification, which is confusing in a multi-table
query.
I considered back-patching this, but since the bug doesn't seem to
have any really terrible consequences in existing branches, it
seems better to not change their EXPLAIN output. It's not too late
for v18 though, especially since v18 already made other changes in
the EXPLAIN output for these cases.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/213261.1747611093@sss.pgh.pa.us
xmltotext_with_options sometimes tries to replace the existing
root node of a libxml2 document. In that case xmlDocSetRootElement
will unlink and return the old root node; if we fail to free it,
it's leaked for the remainder of the session. The amount of memory
at stake is not large, a couple hundred bytes per occurrence, but
that could still become annoying in heavy usage.
Our only other xmlDocSetRootElement call is not at risk because
it's working on a just-created document, but let's modify that
code too to make it clear that it's dependent on that.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/1358967.1747858817@sss.pgh.pa.us
Backpatch-through: 16
As pointed out by Tom Lane, the patch introduced fragile and invasive
design around plan invalidation handling when locking of prunable
partitions was deferred from plancache.c to the executor. In
particular, it violated assumptions about CachedPlan immutability and
altered executor APIs in ways that are difficult to justify given the
added complexity and overhead.
This also removes the firstResultRels field added to PlannedStmt in
commit 28317de72, which was intended to support deferred locking of
certain ModifyTable result relations.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/605328.1747710381@sss.pgh.pa.us
pg_aios used the terms "read" and "write" for vectored I/O read and
write operations, respectively. The documentation refers to them as
"readv" and "writev", and the code uses internally the terms
PGAIO_OP_READV and PGAIO_OP_WRITEV for them, as of "vectored".
This commit adjusts these operation names to match with the code and the
documentation.
Oversight in 8e293e689b.
Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Discussion: https://postgr.es/m/6df1e949d1d759ad2767c18e5845963e@oss.nttdata.com
Since commit 8b1dccd37c, the PREPARE TRANSACTION WAL record includes
information about dropped statistics entries. However, the WAL resource
manager description function for PREPARE TRANSACTION record failed to
parse this information correctly and always assumed there were
no such entries.
As a result, for example, pg_waldump could not display the dropped
statistics entries stored in PREPARE TRANSACTION records.
The root cause was that ParsePrepareRecord() did not set the number of
statistics entries to drop on commit or abort. These values remained
zero-initialized and were never updated from the parsed record.
This commit fixes the issue by properly setting those values during parsing.
With this fix, pg_waldump can now correctly report dropped statistics
entries in PREPARE TRANSACTION records.
Back-patch to v15, where commit 8b1dccd37c was introduced.
Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAJDiXgh-6Epb2XiJe4uL0zF-cf0_s_7Lw1TfEHDMLzYjEmfGOw@mail.gmail.com
Backpatch-through: 15
The statement location calculated for some nested query cases was wrong
when multiple queries are sent as a single string, these being separated
by semicolons. As pointed by Sami Imseih, the location calculation was
incorrect when the last query of nested statement with multiple queries
does **NOT** finish with a semicolon for the last statement. In this
case, the statement length tracked by RawStmt is 0, which is equivalent
to say that the string should be used until its end. The code
previously discarded this case entirely, causing the location to remain
at 0, the same as pointing at the beginning of the string. This caused
pg_stat_statements to store incorrect query strings.
This issue has been introduced in 499edb0974. I have looked at the
diffs generated by pgaudit back then, and noticed the difference
generated for this nested query case, but I have missed the point that
it was an actual regression with an existing case. A test case is added
in pg_stat_statements to provide some coverage, restoring the pre-17
behavior for the calculation of the query locations. Special thanks to
David Steele, who, through an analysis of the test diffs generated by
pgaudit with the new v18 logic, has poked me about the fact that my
original analysis of the matter was wrong.
The test output of pg_overexplain is updated to reflect the new logic,
as the new locations refer to the beginning of the argument passed to
the function explain_filter(). When the module was introduced in
8d5ceb113e, which was after 499edb0974 (for the new calculation
method), the locations of the test were not actually right: the plan
generated for the query string given in input of the function pointed to
the top-level query, not the nested one.
Reported-by: David Steele <david@pgbackrest.org>
Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: David Steele <david@pgbackrest.org>
Discussion: https://postgr.es/m/844a3b38-bbf1-4fb2-9fd6-f58c35c09917@pgbackrest.org
elog()/ereport() process interrupts, iff the log message is < ERROR and the
log message will be emitted. aio's debug messages are emitted via ereport(),
but in some places the code is not ready for interrupts to be processed.
Fix the issue using a few different methods:
1) handle interrupts arriving concurrently - in some places it's easy to
detect that by fetching the handle's generation a bit earlier
2) Check if interrupts made the work needing to be done obsolete
3) Disallow interrupts, as there's no sane way to make interrupt processing
safe
To prevent some similar issues from being re-introduced, assert that
interrupts are held in pgaio_io_update_state().
This commit also fixes the contents of a debug message I added in 039bfc457e.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/mvpm7ga3dfgz7bvum22hmuz26cariylmcppb3irayftc7bwk3r@l7gb6gr7azhc
In the grammar, <expr> is a c_expr, which accepts only a limited set
of integer literals and simple expressions without parens. The
deparsing logic didn't quite match the grammar rule, and failed to use
parens e.g. for "5::bigint".
To fix, always surround the expression with parens. Would be nice to
omit the parens in simple cases, but unfortunately it's non-trivial to
detect such simple cases. Even if the expression is a simple literal
123 in the original query, after parse analysis it becomes a FuncExpr
with COERCE_IMPLICIT_CAST rather than a simple Const.
Reported-by: yonghao lee
Backpatch-through: 13
Discussion: https://www.postgresql.org/message-id/18929-077d6b7093b176e2@postgresql.org
Prevent moving the confirmed_flush backwards, as this could lead to data
duplication issues caused by replicating already replicated changes.
This can happen when a client acknowledges an LSN it doesn't have to do
anything for, and thus didn't store persistently. After a restart, the
client can send the prior LSN that it stored persistently as an
acknowledgement, but we need to ignore such an LSN to avoid retreating
confirm_flush LSN.
Diagnosed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Author: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Tested-by: Nisha Moond <nisha.moond412@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/CAJpy0uDZ29P=BYB1JDWMCh-6wXaNqMwG1u1mB4=10Ly0x7HhwQ@mail.gmail.com
Discussion: https://postgr.es/m/OS0PR01MB57164AB5716AF2E477D53F6F9489A@OS0PR01MB5716.jpnprd01.prod.outlook.com
6b94e7a6da adjusted generate_orderedappend_paths() to consider fractional
paths. However, it didn't manage to interpret the tuple_fraction value
correctly. According to the header comment of grouping_planner(), the
tuple_fraction >= 1 specifies the absolute number of expected tuples. That
number must be divided by the expected total number of tuples to get the
actual fraction.
Even though this is a bug fix, we don't backpatch it. The risks of the side
effects of plan changes on stable branches are too high.
Reported-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/3ca271fa-ca5c-458c-8934-eb148622b270%40gmail.com
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
In an XMLTABLE expression, columns can be marked NOT NULL, and the
parser internally fabricates an option named "is_not_null" to
represent this. However, the parser also allows users to specify
arbitrary option names. This creates a conflict: a user can
explicitly use "is_not_null" as an option name and assign it a
non-Boolean value, which violates internal assumptions and triggers an
assertion failure.
To fix, this patch checks whether a user-supplied name collides with
the internally reserved option name and raises an error if so.
Additionally, the internal name is renamed to "__pg__is_not_null" to
further reduce the risk of collision with user-defined names.
Reported-by: Евгений Горбанев <gorbanyoves@basealt.ru>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/6bac9886-65bf-4cec-96bd-e304159f28db@basealt.ru
Backpatch-through: 15
When creating a new PlannerGlobal node in standard_planner(), most
fields are explicitly initialized, but a few are not. This doesn't
cause any functional issues, as makeNode() zeroes all fields by
default. However, the inconsistency is undesirable from a clarity and
maintenance perspective.
This patch explicitly initializes the remaining fields to improve
consistency and readability.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-TgQHNOiouqGcuHoBqbJjWyx4UxGKxUY3FrF4trGbcPA@mail.gmail.com
As complained about by Valgrind, in commit a379061a22 I failed to
realize that I was causing rd_att->constr->check to become allocated
when no CHECK constraints exist; previously it'd remain NULL. (This was
my bug, not the mentioned commit author's). Fix by making the
allocation conditional, and set ->check to NULL if unallocated.
Reported-by: Yasir <yasir.hussain.shah@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/202505082025.57ijx3qrbx7u@alvherre.pgsql
This must be "return MemoryContextAllocationFailure(context, size, flags)"
instead. The effect of this oversight is that if we got a malloc
failure right here, the code would act as though MCXT_ALLOC_NO_OOM
had been specified, whether it was or not. That would likely lead
to a null-pointer-dereference crash at the unsuspecting call site.
Noted while messing with a patch to improve our Valgrind leak
detection support. Back-patch to v17 where this code came in.
Commit f4ece891fc added the assertion in
an attempt to catch some defects even after VACUUM FULL or REINDEX.
However, IsCatalogTextUniqueIndexOid(tag.relNumber) always returns false
after a relfilenode change, provoking unintended assertion failures.
Reported-by: Adam Guo <adamguo@amazon.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Bug: #18912
Discussion: https://postgr.es/m/18912-a41c9bd0e0ad19b1@postgresql.org
This cleans up the code related to the testing infrastructure of AIO
that used injection points, switching the test code to use the new
facility for injection points added by 371f2db8b0 rather than tweaks
to pass and reset arguments to the callbacks run.
This removes all the dependencies to USE_INJECTION_POINTS in the AIO
code. pgaio_io_call_inj(), pgaio_inj_io_get() and pgaio_inj_cur_handle
are now gone.
Reviewed-by: Greg Burd <greg@burd.me>
Discussion: https://postgr.es/m/Z_y9TtnXubvYAApS@paquier.xyz
The macros INJECTION_POINT() and INJECTION_POINT_CACHED() are extended
with an optional argument that can be passed down to the callback
attached when an injection point is run, giving to callbacks the
possibility to manipulate a stack state given by the caller. The
existing callbacks in modules injection_points and test_aio have their
declarations adjusted based on that.
da7226993f (core AIO infrastructure) and 93bc3d75d8 (test_aio) and
been relying on a set of workarounds where a static variable called
pgaio_inj_cur_handle is used as runtime argument in the injection point
callbacks used by the AIO tests, in combination with a TRY/CATCH block
to reset the argument value. The infrastructure introduced in this
commit will be reused for the AIO tests, simplifying them.
Reviewed-by: Greg Burd <greg@burd.me>
Discussion: https://postgr.es/m/Z_y9TtnXubvYAApS@paquier.xyz
A 'void *' argument suggests that the caller might pass an arbitrary
struct, which is appropriate for functions like libc's read/write, or
pq_sendbytes(). 'uint8 *' is more appropriate for byte arrays that
have no structure, like the cancellation keys or SCRAM tokens. Some
places used 'char *', but 'uint8 *' is better because 'char *' is
commonly used for null-terminated strings. Change code around SCRAM,
MD5 authentication, and cancellation key handling to follow these
conventions.
Discussion: https://www.postgresql.org/message-id/61be9e31-7b7d-49d5-bc11-721800d89d64@eisentraut.org
When creating an explicit Sort node for the outer path of a mergejoin,
we need to determine the number of presorted keys of the outer path to
decide whether explicit incremental sort can be applied. Currently,
this is done by repeatedly calling pathkeys_count_contained_in.
This patch caches the number of presorted outer pathkeys in MergePath,
allowing us to save several calls to pathkeys_count_contained_in. It
can be considered a complement to the changes in commit 828e94c9d.
Reported-by: David Rowley <dgrowleyml@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAApHDvqvBireB_w6x8BN5txdvBEHxVgZBt=rUnpf5ww5P_E_ww@mail.gmail.com
When building a ForeignPath for a joinrel, if there's a possibility
that EvalPlanQual will be executed, we must identify a suitable path
for EPQ checks. If the outer or inner path of the chosen path is a
ForeignPath representing a pushed-down join, we replace it with its
fdw_outerpath to ensure that the EPQ check path consists entirely of
local joins.
If the chosen path is a MergePath, and its outer or inner path is a
ForeignPath that is not already well enough ordered, the MergePath
will have non-NIL outersortkeys or innersortkeys indicating the
desired ordering to be created by an explicit Sort node. If we then
replace the outer or inner path with its corresponding fdw_outerpath,
and that path is already sufficiently ordered, we end up in an
inconsistent state: the MergePath has non-NIL outersortkeys or
innersortkeys, and its input path is already properly ordered. This
inconsistency can result in an Assert failure or the addition of a
redundant Sort node.
To fix, check if the new outer or inner path of a MergePath is already
properly sorted, and set its outersortkeys or innersortkeys to NIL if
so.
Bug: #18902
Reported-by: Nikita Kalinin <n.kalinin@postgrespro.ru>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/18902-71c1bed2b9f7c46f@postgresql.org
A few places that access this catalog don't set up an active
snapshot before potentially accessing its TOAST table. However,
roname (the replication origin name) is the only varlena column, so
this is only a problem if the name requires out-of-line storage.
This commit removes its TOAST table to avoid needing to set up a
snapshot. It also places a limit on replication origin names so
that attempts to set long names will fail with a more user-friendly
error. Those chosen limit of 512 bytes should be sufficient to
avoid "row is too big" errors independent of BLCKSZ, but it should
also be lenient enough for all reasonable use-cases.
Bumps catversion.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/ZvMSUPOqUU-VNADN%40nathan
nbtree array index scans could fail to return matching tuples in rare
cases where the missed tuples cover key space that the scan's arrays
incorrectly indicate has already been read. These cases involved nearby
tuples with NULL values that were evaluated using a skip array key while
in pstate.forcenonrequired mode.
To fix, prevent forcenonrequired mode from prematurely advancing the
scan's array keys beyond key space that the scan has yet to read tuples
from: reset the scan's array keys (to the first elements in the current
scan direction) before the _bt_checkkeys call for pstate.finaltup. That
way _bt_checkkeys starts from a clean slate, which ensures that it will
call _bt_advance_array_keys (while passing it sktrig_required=true).
This reliably restores the invariant that the scan's arrays always
accurately track its progress through the index's key space (at least
when the scan is "between pages").
Oversight in commit 8a510275, which optimized nbtree search scan key
comparisons.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzmodSE+gpTd1CRGU9ez8ytyyDS+Kns2r9NzgUp1s56kpw@mail.gmail.com
Be more conservative when performing a scheduled recheck of an nbtree
scan's array keys once on the next page, having set so->scanBehind: back
out of reading the page (perform another primitive scan instead) when
the next page's high key/finaltup has an untruncated prefix of matching
values and truncated suffix attributes associated with lower-order keys.
In other words, stop assuming that the lower-order keys have been
satisfied by the truncated suffix attributes in this context (only do so
when considering scheduling a recheck within _bt_advance_array_keys).
The new behavior is more logical: if the next page read after setting
so->scanBehind can only contain tuples that are themselves "behind the
scan", that's reason enough to cut our losses. In general, when we set
so->scanBehind, we only expect to perform one recheck on the next page
to make a final decision about whether or not to continue the current
primitive index scan. It seems unprincipled for the recheck to allow a
_bt_readpage to continue unless the scan's arrays will advance/unless
the page might actually contain relevant tuples.
In practice it is highly unlikely that things will line up like this
(the untruncated prefix of attribute values from the next page's high
key is seldom an exact match for their corresponding array's current
element following array advancement on the original/previous page).
That gives us all the more reason to keep things simple and consistent.
This was arguably an oversight in commit 9a2e2a285a, which improved
nbtree array primitive scan scheduling.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkXzJajgyW-pCQ7vaDPhaT3huU+Zw_j448rpCBEsu2YOQ@mail.gmail.com
fc069a3a63 implemented Self-Join Elimination (SJE) and put related logic
to ChangeVarNodes_walker(). This commit provides refactoring to remove the
SJE-related logic from ChangeVarNodes_walker() but adds a custom callback to
ChangeVarNodesExtended(), which has a chance to process a node before
ChangeVarNodes_walker(). Passing this callback to ChangeVarNodesExtended()
allows SJE-related node handling to be kept within the analyzejoins.c.
Reported-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com
Author: Andrei Lepikhov <lepihov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
IO workers are treated as auxiliary processes. The comments fixed in
this commit stated that there could be only one auxiliary process of
each BackendType at the same time. This is not true for IO workers, as
up to MAX_IO_WORKERS of them can co-exist at the same time.
Author: Cédric Villemain <Cedric.Villemain@data-bene.io>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/e4a3ac45-abce-4b58-a043-b4a31cd11113@Data-Bene.io
With GB18030 as source encoding, applications could crash the server via
SQL functions convert() or convert_from(). Applications themselves
could crash after passing unterminated GB18030 input to libpq functions
PQescapeLiteral(), PQescapeIdentifier(), PQescapeStringConn(), or
PQescapeString(). Extension code could crash by passing unterminated
GB18030 input to jsonapi.h functions. All those functions have been
intended to handle untrusted, unterminated input safely.
A crash required allocating the input such that the last byte of the
allocation was the last byte of a virtual memory page. Some malloc()
implementations take measures against that, making the SIGSEGV hard to
reach. Back-patch to v13 (all supported versions).
Author: Noah Misch <noah@leadboat.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Backpatch-through: 13
Security: CVE-2025-4207
fc069a3a63 implemented Self-Join Elimination (SJE) and put related logic
to ChangeVarNodes_walker(). This commit provides refactoring to remove the
SJE-related logic from ChangeVarNodes_walker() but adds a custom callback to
ChangeVarNodesExtended(), which has a chance to process a node before
ChangeVarNodes_walker(). Passing this callback to ChangeVarNodesExtended()
allows SJE-related node handling to be kept within the analyzejoins.c.
Reported-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com
Author: Andrei Lepikhov <lepihov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
The previous code was allocating more memory and copying more data than
necessary because it specified the wrong PgStat_KindInfo member as the
size argument for MemoryContextAlloc and memcpy, respectively.
Although these issues exist since 5891c7a8e, there have been no reports
from the field. So for now, it seems sufficient to fix them in master.
Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Gurjeet Singh <gurjeet@singh.im>
Discussion: https://postgr.es/m/CAPmGK15eTRCZTnfgQ4EuBNo%3DQLYGFEbXS_7m2dXqtkcT7L8qrQ%40mail.gmail.com
Also adjust the phrasing in the comments.
Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Gurjeet Singh <gurjeet@singh.im>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAPmGK17%3DPHSDZ%2B0G6jcj12buyyE1bQQc3sbp1Wxri7tODT-SDw%40mail.gmail.com
Backpatch-through: 15
Consistently prevent nbtree array advancement from treating a scankey as
required when operating in pstate.forcenonrequired mode. Otherwise, we
risk a NULL pointer dereference. This was possible in the path where
_bt_check_compare is called to recheck a tuple that advanced all of the
scan's arrays to matching values: its continuescan=false handling
expects _bt_advance_array_keys to have been called with a valid pstate,
but it'll always be NULL during sktrig_required=false calls (which is
how _bt_advance_array_keys must be called when pstate.forcenonrequired).
Oversight in commit 8a510275, which optimized nbtree search scan key
comparisons.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAHgHdKsn2W=gPBmj7p6MjQFvxB+zZDBkwTSg0o3f5Hh8rkRrsA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmodSE+gpTd1CRGU9ez8ytyyDS+Kns2r9NzgUp1s56kpw@mail.gmail.com
To insert the merged GIN entries in _gin_parallel_merge, the leader
calls ginEntryInsert(). This may allocate memory, e.g. for a new leaf
tuple. This was allocated in the PortalContext, and kept until the end
of the index build. For most GIN indexes the amount of leaked memory is
negligible, but for custom opclasses with large keys it may cause OOMs.
Fixed by calling ginEntryInsert() in a temporary memory context, reset
after each insert. Other ginEntryInsert() callers do this too, except
that the context is reset after batches of inserts. More frequent resets
don't seem to hurt performance, it may even help it a bit.
Report and fix by Vinod Sridharan.
Author: Vinod Sridharan <vsridh90@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAFMdLD4p0VBd8JG=Nbi=BKv6rzFAiGJ_sXSFrw-2tNmNZFO5Kg@mail.gmail.com
We only need a tuplestore if we're actually going to accumulate
multiple result tuples. Obviously then we don't need one for non-set-
returning functions; but even a SRF doesn't need one if we decide to
use "lazyEval" (one row at a time) mode. In these cases, it's
sufficient to use the junkfilter's result slot to hold the single row
that's due to be returned. We just need to "materialize" that slot
to ensure it holds onto the data past shutdown of the sub-executor.
The original intent of this patch was partially to save a few cycles
(by not putting tuples into a tuplestore only to pull them back out
immediately), but mostly to ensure that we don't use a tuplestore
in non-set-returning functions. That's because I had concerns
about whether a tuplestore is safe to keep across queries,
which was possible for functions invoked via long-lived FmgrInfos
such as those kept in the typcache. There are no cases where SRFs
are called that way, so getting rid of the tuplestore in non-SRFs
should make things safer.
However, it emerges that running fmgr_sql in a short-lived context
(as 595d1efed made it do) makes the existing coding unsafe anyway:
we can end up with a long-lived TupleTableSlot holding a freeable
reference to a short-lived tuple, resulting in a double-free crash.
Not trying to pull tuples out of the tuplestore using that slot
dodges the problem, so I'm going to commit this now rather than
invent a band-aid solution for v18.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2443532.1744919968@sss.pgh.pa.us
Discussion: https://postgr.es/m/9f975803-1a1c-4f21-b987-f572e110e860@gmail.com
For self-referencing foreign keys in partitioned tables, we weren't
handling creation of pg_constraint rows during CREATE TABLE PARTITION AS
as well as ALTER TABLE ATTACH PARTITION. This is an old bug -- mostly,
we broke this in 614a406b4f while trying to fix it (so 12.13, 13.9,
14.6 and 15.0 and up all behave incorrectly). This commit reverts part
of that with additional fixes for full correctness, and installs more
tests to verify the parts we broke, not just the catalog contents but
also the user-visible behavior.
Backpatch to all live branches. In branches 13 and 14, commit
46a8c27a72 changed the behavior during DETACH to drop a FK
constraint rather than trying to repair it, because the complete fix of
repairing catalog constraints was problematic due to lack of previous
fixes. For this reason, the test behavior in those branches is a bit
different. However, as best as I can tell, the fix works correctly
there.
In release notes we have to recommend that all self-referencing foreign
keys on partitioned tables be recreated if partitions have been created
or attached after the FK was created, keeping in mind that violating
rows might already be present on the referencing side.
Reported-by: Guillaume Lelarge <guillaume@lelarge.info>
Reported-by: Matthew Gabeler-Lee <fastcat@gmail.com>
Reported-by: Luca Vallisa <luca.vallisa@gmail.com>
Discussion: https://postgr.es/m/CAECtzeWHCA+6tTcm2Oh2+g7fURUJpLZb-=pRXgeWJ-Pi+VU=_w@mail.gmail.com
Discussion: https://postgr.es/m/18156-a44bc7096f0683e6@postgresql.org
Discussion: https://postgr.es/m/CAAT=myvsiF-Attja5DcWoUWh21R12R-sfXECY2-3ynt8kaOqjw@mail.gmail.com
The extension_control_path setting (commit 4f7f7b0375) did not
support extensions that set a custom "directory" setting in their
control file. Very few extensions use that and during the discussion
on the previous commit it was suggested to maybe remove that
functionality. But a fix was easier than initially thought, so this
just adds that support. The fix is to use the control->control_dir as
a share dir to return the path of the extension script files.
To make this work more sensibly overall, the directory suffix
"extension" is no longer to be included in the extension_control_path
value. To quote the patch, it would be
-extension_control_path = '/usr/local/share/postgresql/extension:/home/my_project/share/extension:$system'
+extension_control_path = '/usr/local/share/postgresql:/home/my_project/share:$system'
During the initial patch, there was some discussion on which of these
two approaches would be better, and the committed patch was a 50/50
decision. But the support for the "directory" setting pushed it the
other way, and also it seems like many people didn't like the previous
behavior much.
Author: Matheus Alcantara <mths.dev@pm.me>
Reviewed-by: Christoph Berg <myon@debian.org>
Reviewed-by: David E. Wheeler <david@justatheory.com>
Discussion: https://www.postgresql.org/message-id/flat/aAi1VACxhjMhjFnb%40msg.df7cb.de#0cdf7b7d727cc593b029650daa3c4fbc
Make an nbtree array preprocessing assertion account for scans that add
fewer skip arrays than initially expected due to preprocessing finding
an unsatisfiable array qual.
Oversight in commit 92fe23d9.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAHgHdKtQMhHy5qcB3KqCcGiW-Rp8P7KzUFRa9ZMKUiv6zen7LQ@mail.gmail.com
We try to avoid using strncpy() due to the ease of which it can
be misused. Convert this callsite to use strlcpy() instead to
match similar codepaths in this file.
Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/2a796830-de2d-4030-b480-d673f6cc5d94@eisentraut.org
When copying the string strncpy won't add nul termination since
the string length is equal to the length specified. Explicitly
set a nul terminator after copying to properly terminate. Found
via post-commit review.
Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAH2L28vt16C9xTuK+K7QZvtA3kCNWXOEiT=gEekUw3Xxp9LVQw@mail.gmail.com
This reverts commit 38da053463, which
attempted to preserve our ability to start with only 60 semaphores.
Subsequent changes (particularly 55b454d0e) have put that idea pretty
much permanently out of reach: people wishing to use Postgres v18 on
OpenBSD or NetBSD will have no choice but to increase those platforms'
default values of SEMMNI and SEMMNS.
Hence, revert 38da05346's changes in SEMAS_PER_SET and the minimum
tested value of max_connections. Adjust a comment from the subsequent
patch 6d0154196, and tweak the wording in runtime.sgml to make it
clear that changing SEMMNI/SEMMNS is no longer even a little bit
optional on these platforms.
Although 38da05346 was later back-patched into v17, leave that branch
alone: it's still capable of starting with 60 semaphores, and there's
no reason to break that.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/E1tuZNv-0037Gs-34@gemulon.postgresql.org
Discussion: https://postgr.es/m/1052019.1745947915@sss.pgh.pa.us
This commit fixes two bug in ChangeVarNodes_walker() function.
* When considering RestrictInfo, walk down to its clauses based on the
presense of relid to be deleted not just in clause_relids but also in
required_relids.
* Incrementally adjust num_base_rels based on the change of clause_relids
instead of recalculating it using clause_relids, which could contain
outer-join relids.
Reported-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
The slot synchronization skips updating the confirmed_flush LSN of the
local slot if the local slot has a newer catalog_xmin or restart_lsn, but
still allows updating the two_phase and two_phase_at fields of the slot.
This opens up a window for the prepared transactions between old
confirmed_flush LSN and two_phase_at to unexpectedly get decoded and sent
to the downstream after promotion. Then, while decoding the commit
prepared the assert will fail, which expects that the prepare hasn't been
sent to the downstream.
The fix is to skip updating the other slot fields when we are skipping to
update the confirmed_flush LSN of the slot.
We didn't backpatch this commit as two_phase_at was not synced in back
branches, which means prepared transactions won't be unexpectedly sent to
downstream.
We discovered this problem while analyzing BF failure reported in the
discussion link.
Reliably reproducing this issue without a debugger is difficult. Given
its rarity, adding specific injection point to test it doesn't seem
worthwhile, so we won't be adding a dedicated test case.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716B44052000EB91EFAE60E94BC2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Index vacuuming and [auto]prewarm AIO concurrency should be governed by
maintenance_io_concurrency. As such, pass those read stream users the
READ_STREAM_MAINTENANCE flag which will calculate their read stream
distance with maintenance_io_concurrency instead of
effective_io_concurrency. This was an oversight in the original commits
making those operations use the read stream API.
Discussion: https://postgr.es/m/flat/CAAKRu_aopDxTo4b41Mt_7Zc-z0_ngocrY8SFCCY6Aph1HgwuNw%40mail.gmail.com
Checking if another primitive scan is required after all once the next
leaf page was moved from _bt_checkkeys to its _bt_readpage caller by
commit 9a2e2a28. Update a comment that incorrectly described the
recheck mechanism as something that takes place in _bt_checkkeys.
Also fix an older typo in related code comments.
_bt_check_compare neglected to handle a case that can arise when the
scan's keys are temporarily treated as nonrequired, as an optimization:
whenever a NULL tuple value was encountered that had a skip array whose
current element wasn't already NULL, _bt_check_compare failed to advance
the array to the NULL element. This allowed _bt_check_compare to fail
to return matching tuples containing a NULL value (though only with an
array column that came before a skip array column with NULLs, and only
during _bt_readpage calls that set pstate.forcenonrequired=true on a
page where the higher-order column also had to advance).
To fix, teach _bt_check_compare to handle this case just like any other
case where a skip array key is unsatisfied and must be advanced directly
(due to the key being considered a nonrequired key).
Oversight in commit 8a510275, which optimized nbtree search scan key
comparisons with skip arrays.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAHgHdKtLFWZcjr87hMH0hYDHgcifu4Tj7iHz-xh8qsJREt5cqA@mail.gmail.com
During logical decoding, we advance catalog_xmin of logical too early in
fast_forward mode, resulting in required catalog data being removed by
vacuum. This mode is normally used to advance the slot without processing
the changes, but we still can't let the slot's xmin to advance to an
incorrect value.
Commit f49a80c481 fixed a similar issue where the logical slot's
catalog_xmin was getting advanced prematurely during non-fast-forward
mode. During xl_running_xacts processing, instead of directly advancing
the slot's xmin to the oldest running xid in the record, it allowed the
xmin to be held back for snapshots that can be used for
not-yet-replayed transactions, as those might consider older txns as
running too. However, it missed the fact that the same problem can happen
during fast_forward mode decoding, as we won't build a base snapshot in
that mode, and the future call to get_changes from the same slot can miss
seeing the required catalog changes leading to incorrect reslts.
This commit allows building the base snapshot even in fast_forward mode to
prevent the early advancement of xmin.
Reported-by: Amit Kapila <amit.kapila16@gmail.com>
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/CAA4eK1LqWncUOqKijiafe+Ypt1gQAQRjctKLMY953J79xDBgAg@mail.gmail.com
Discussion: https://postgr.es/m/OS0PR01MB57163087F86621D44D9A72BF94BB2@OS0PR01MB5716.jpnprd01.prod.outlook.com
wait_event_types.h is generated by the code, and included wait_event.h.
wait_event.h did the opposite move, including wait_event_types.h,
causing a circular dependency between both.
wait_event_types.h only needs to now about the wait event classes, so
this information is moved into its own file, and wait_event_types.h uses
this new header so as it does not depend anymore on wait_event.h.
Note that such errors can be found with clang-tidy, with commands like
this one:
clang-tidy source_file.c --checks=misc-header-include-cycle -- \
-I/install/path/include/ -I/install/path/include/server/
Issue introduced by fa88928470.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/350192.1745768770@sss.pgh.pa.us
fc069a3a63 implements Self-Join Elimination (SJE), which can remove base
relations when appropriate. However, regressions tests for SJE only cover
the case when placeholder variables (PHVs) are evaluated and needed only
in a single base rel. If this baserel is removed due to SJE, its clauses,
including PHVs, will be transferred to the keeping relation. Removing these
PHVs may trigger an error on plan creation -- thanks to the b3ff6c742f for
detecting that.
This commit skips removal of PHVs during SJE. This might also happen that
we skip the removal of some PHVs that could be removed. However, the overhead
of extra PHVs is small compared to the complexity of analysis needed to remove
them.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Alena Rybakina <a.rybakina@postgrespro.ru>
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
This cleans up some loose ends left by commit e8ca9ed1d. I hadn't
looked closely enough at these places before, but now I have.
The use of double-quoted #includes for Perl headers in plperl_system.h
seems to be simply a mistake introduced in 6c944bf3c and faithfully
copied forward since then. (I had thought possibly it was required
by some weird Windows build setup, but there's no evidence of that in
our history.)
The occurrences in SectionMemoryManager.h and SectionMemoryManager.cpp
evidently stem from those files' origin as LLVM code. It's
understandable that LLVM would treat their own files as needing
double-quoted #includes; but they're still system headers to us.
I also applied the same check to *.c files, and found a few other
random incorrect usages in both directions.
Our ECPG headers and test files routinely use angle brackets to refer
to ECPG headers. I left those usages alone, since it seems reasonable
for an ECPG user to regard those headers as system headers.
c4d5cb71d2 adjusted the fast-path locking code to allow some
configuration of the number of fast-path locking slots via the
max_locks_per_transaction GUC. In that commit the FAST_PATH_REL_GROUP()
macro used integer division to determine the fast-path locking group slot
to use for the lock.
The divisor in this case is always a power-of-two value. Here we swap
out the divide by a bitwise-AND, which is a significantly faster
operation to perform.
In passing, adjust the code that's setting FastPathLockGroupsPerBackend
so that it's more clear that the value being set is a power-of-two.
Also, adjust some comments in the area which contained some magic
numbers. It seems better to justify the 1024 upper limit in the
location where the #define is made instead of where it is used.
Author: David Rowley <drowleyml@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAApHDvodr3bcnpxcs7+k-3cFwYR0tP-BYhyd2PpDhe-bCx9i=g@mail.gmail.com
10f6646847 intended to limit the value of io_combine_limit to the minimum of
io_combine_limit and io_max_combine_limit. To avoid issues with interdependent
GUCs, it introduced io_combine_limit_guc and set io_combine_limit in assign
hooks. That plan was thwarted by guc_tables.c accidentally still referencing
io_combine_limit, instead of io_combine_limit_guc. That lead to the GUC
machinery overriding the work done in the assign hooks, potentially leaving
io_combine_limit with a too high value.
The consequence of this bug was that when running with io_combine_limit >
io_combine_limit_guc the AIO machinery would not have reserved large enough
iovec and IO data arrays, with one IO's arrays overlapping with another IO's,
leading to total confusion.
To make such a problem easier to detect in the future, add assertions to
pgaio_io_set_handle_data_* checking the length is smaller than
io_max_combine_limit (not just PG_IOV_MAX).
It'd be nice to have a few tests for this, but it's not entirely obvious how
to do so portably.
As remarked upon by Tom, the GUC assignment hooks really shouldn't set the
underlying variable, that's the job of the GUC machinery. Change that as well.
Discussion: https://postgr.es/m/c5jyqnuwrpigd35qe7xdypxsisdjrdba5iw63mhcse4mzjogxo@qdjpv22z763f
pgaio_io_reclaim() reset the fields in PgAioHandle before updating the state
to IDLE or incrementing the generation. For most things that's OK, but for
pg_get_aios() it is not - if it copied the PgAioHandle while fields were being
reset, we wouldn't detect that and could call
pgaio_io_get_target_description() with ioh->target == PGAIO_TID_INVALID,
leading to a crash.
Fix this issue by incrementing the generation and state earlier, before
resetting.
Also add an assertion to pgaio_io_get_target_description() for the target to
be valid - that'd have made this case a bit easier to debug. While at it,
add/update a few related assertions.
Author: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/062daca9-dfad-4750-9da8-b13388301ad9@gmail.com
Not having this check would produce a core dump at startup when running
pgstat_read_statsfile(), in the case where the information of a stats
kind for an entry in the dshash could not be found. The same check
already happens for fixed-numbered stats and entries that are stored
with their names. This issue can be seen with custom stats kinds.
Note that this problem can be reproduced what what is in the core code:
- Tweak the test module injection_points to not load the fixed-numbered
stats part, leaving only the variable-numbered stats.
- Create an instance with injection_points defined in
shared_preload_libraries.
- Create a pgstats entry by attaching and running a point.
- Restart the server without shared_preload_libraries. The startup
process detects that something is wrong and reports a WARNING.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aAieZAvM+K1d89R2@ip-10-97-1-34.eu-west-3.compute.internal
One place in hash_create() used DynaHashAlloc() as a convenient
shorthand for MemoryContextAlloc(). That was fine when it was
written, but it stopped being fine when 9c911ec06 changed
DynaHashAlloc() to use MCXT_ALLOC_NO_OOM (mea culpa). Change
the code to call plain MemoryContextAlloc() as intended.
I think that this bug may be unreachable in practice, since we now
always create AllocSets with some space already allocated, so that
an OOM failure here for a non-shared hash table should be impossible
(with a hash table name of reasonable length anyway). And there
aren't enough shared hash tables to make a crash for one of those
probable. Nonetheless it's clearly not operating as designed, so
back-patch to v16 where 9c911ec06 came in.
Reported-by: Maksim Korotkov <m.korotkov@postgrespro.ru>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/219bdccd460510efaccf90b57e5e5ef2@postgrespro.ru
Backpatch-through: 16
b85a9d046e introduced a new RelIdToTypeIdCacheHash, whose entries should
exist for typecache entries with TCFLAGS_HAVE_PG_TYPE_DATA flag set or any
of TCFLAGS_OPERATOR_FLAGS set or tupDesc set. However, TypeCacheOpcCallback(),
which resets TCFLAGS_OPERATOR_FLAGS, was forgotten to update
RelIdToTypeIdCacheHash.
This commit adds a delete_rel_type_cache_if_needed() call to the
TypeCacheOpcCallback() function to maintain RelIdToTypeIdCacheHash after
resetting TCFLAGS_OPERATOR_FLAGS.
Also, this commit fixes the name of the delete_rel_type_cache_if_needed()
function in its mentions in the comments.
Reported-by: Noah Misch
Discussion: https://postgr.es/m/20250411203241.e9.nmisch%40google.com
To estimate with extended statistics, we need to clear the varnullingrels
field in the expression, and duplicates are not allowed in the GroupVarInfo
list. We might re-use add_unique_group_var(), but we don't do so for two
reasons.
1) We must keep the origin_rinfos list ordered exactly the same way as
varinfos.
2) add_unique_group_var() is designed for estimate_num_groups(), where a
larger number of groups is worse. While estimating the number of hash
buckets, we have the opposite: a lesser number of groups is worse.
Therefore, we don't have to remove "known equal" vars: the removed var
may valuably contribute to the multivariate statistics to grow the number
of groups.
This commit adds custom code to estimate_multivariate_bucketsize() to
initialize varinfos properly.
Reported-by: Robins Tharakan <tharakan@gmail.com>
Discussion: https://postgr.es/m/18885-da51324078588253%40postgresql.org
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
When a foreign key constraint is placed on a partitioned table, we
actually make two pg_constraint entries associated with that table.
(I have my doubts about the wisdom of that, but it's been like that
since v12 and post-feature-freeze is no time to be messing with such
entrenched decisions.) The second "child" entry always had a name
generated according to the default rule, "table_column(s)_fkey[nnn]",
even if the primary entry had an unrelated user-specified name. The
trouble with doing that is that the default name could collide with
the user-specified name of some other constraint on the same table.
While we were willing to adjust the generated name to avoid
collisions, that only helps if it's made second; if it's made first
then creation of the other constraint would fail, potentially causing
dump/reload or pg_upgrade failures.
The core of the problem here is that we're infringing on user
namespace, so I doubt that there's any 100% solution other than to
find a way to not need the "child" entry. In the meantime, it seems
like it'd be an improvement to make the child's name be the name of
the parent constraint with an underscore and digit(s) appended as
necessary to make it unique. This rule can in theory fail in the same
way, but it seems much less probable; for one thing, this rule is
guaranteed not to match primary entries having auto-generated names.
(While an auto-generated primary name isn't user-specified to begin
with, it acts like that during dump/reload, so collisions against such
names are definitely possible.)
An additional bonus, visible in some of the regression test cases
that change here, arises from the fact that some error messages
cite the child constraint's name not the parent's. In the
previous approach the two names could be completely unrelated,
leading to user confusion --- the more so since psql's \d command
hides child constraints. With this approach it's hopefully much
clearer which constraint-the-user-knows-about is failing.
However, that does mean that there's user-visible behavior change
occurring here, making it seem like not something to back-patch.
I feel it's not too late for v18, though.
Reported-by: Kirill Reshke <reshkekirill@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CALdSSPhGitjpTfzEMJN-Y2x+Q-5QChSxAsmSJ1-E8mQJLkHOqQ@mail.gmail.com
Commit 3f28b2fcac tried to ensure that the replication origin shouldn't be
advanced in case of an ERROR in the apply worker, so that it can request
the same data again after restart. However, it is possible that an ERROR
was caught and handled by a (say PL/pgSQL) function, and the apply worker
continues to apply further changes, in which case, we shouldn't reset the
replication origin.
Ensure to reset the origin only when the apply worker exits after an
ERROR.
Commit 3f28b2fcac added new function geterrlevel, which we removed in HEAD
as part of this commit, but kept it in backbranches to avoid breaking any
applications. A separate case can be made to have such a function even for
HEAD.
Reported-by: Shawn McCoy <shawn.the.mccoy@gmail.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 16, where it was introduced
Discussion: https://postgr.es/m/CALsgZNCGARa2mcYNVTSj9uoPcJo-tPuWUGECReKpNgTpo31_Pw@mail.gmail.com
This assertion, based on pending_since (timestamp used to prevent stats
reports to be too frequent or should a partial flush happen), is reached
when it is found that no data can be flushed but a previous call of
pgstat_report_stat() determined that some stats data has been found as
in need of a flush. So pending_since is set when some stats data is
pending (in non-force mode) or if report attempts are too frequent, and
reset to 0 once all stats have been flushed.
Since 5cbbe70a9c, WAL senders have begun to report their stats on a
periodic basis for IO stats in v16~ and backend stats on HEAD, creating
some friction with the concurrent pgstat_report_stat() calls that can
happen in the context of a WAL sender (shutdown callback doing a final
report or backend-related code paths). This problem is the cause of
spurious failures in the TAP tests.
In theory, this assertion can be also reached in v15, even if that's
very unlikely. For example, a process, say a background worker, could
do periodic and direct stats flushes with concurrent calls of
pgstat_report_stat() that could cause conflicting values of
pending_since. This can be done with WAL or SLRU stats flushes using
pgstat_flush_wal() or pgstat_slru_flush(). HEAD makes this situation
easier to happen with custom cumulative stats.
This commit removes the assertion altogether, per discussion, as it is
more useful to keep the state of things as they are for the WAL sender.
The assertion could use a special state based on for example
am_walsender, but I doubt that this would be meaningful in the long run
based on the other arguments raised while discussing this issue.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/1489124.1744685908@sss.pgh.pa.us
Discussion: https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk@omg4g3cc6dgq
Backpatch-through: 15
This error message was 'runaway "struct_name"', which isn't all
that clear; I think 'could not find closing brace for "struct_name"'
is better. Also, provide the location of the struct start using the
script's usual '$file:$lineno' style.
Bug: #18901
Reported-by: Clemens Ruck <clemens.ruck@t-online.de>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18901-424272abe01357e6@postgresql.org
This injection point was named "AtEOXact_Inval-with-transInvalInfo", not
respecting the implied naming convention that injection points should
use lower-case characters, with terms separated by dashes. All the
other points defined in the tree follow this style, so let's be more
consistent.
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://postgr.es/m/OSCPR01MB14966E14C1378DEE51FB7B7C5F5B32@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 17
Word boundaries are based on whether a character is alphanumeric or
not. For the PG_UNICODE_FAST collation, alphanumeric includes
non-ASCII digits; whereas for the PG_C_UTF8 collation, it only
includes digits 0-9. Pass down the right information from the
pg_locale_t into initcap_wbnext to differentiate the behavior.
Reported-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250417135841.33.nmisch@google.com
exec_replication_command created a cmd_context to work in and
then deleted it on exit. This is pretty dangerous because
some replication commands start/finish transactions. In the
wake of commit 1afe31f03, that could lead to re-selecting a
CurrentMemoryContext that's already been deleted, leading to
hilarity such as a memory context that is its own parent.
To fix, let's make the cmd_context persist across
exec_replication_command calls; instead of deleting it, we'll just
reset it each time. In this way it retains the same identity and
there's no problem if transaction abort restores it as the working
context. It probably even saves a few microseconds to do this.
This fix also ensures that exec_replication_command returns to the
caller (PostgresMain) with the same context active that had been
when it was called (probably MessageContext). The previous
coding could get that wrong too.
Reported-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAO6_XqoJA7-_G6t7Uqe5nWF3nj+QBGn4F6Ptp=rUGDr0zo+KvA@mail.gmail.com
The case of "node == parent" might seem impossible, since we just
allocated the new node. But it's possible if parent is a dangling
reference to a recently-deleted context. In fact, given aset.c's
habit of recycling contexts, it's actually rather likely if that's so.
If we'd had this assertion before, it would have simplified debugging
a recently-identified walsender issue.
Reported-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAO6_XqoJA7-_G6t7Uqe5nWF3nj+QBGn4F6Ptp=rUGDr0zo+KvA@mail.gmail.com
Blocking checkpoint phase 2 requires MarkBufferDirty() and
BUFFER_LOCK_EXCLUSIVE; neither suffices by itself. transam/README documents
this, citing SyncOneBuffer(). Update the DELAY_CHKPT_START documentation to
say this. Expand the heap_inplace_update_and_unlock() comment that cites
XLogSaveBufferForHint() as precedent, since heap_inplace_update_and_unlock()
could have opted not to use DELAY_CHKPT_START.
Commit 8e7e672cda added DELAY_CHKPT_START to
heap_inplace_update_and_unlock(). Since commit
bc6bad8857 reverted it in non-master branches,
no back-patch.
Discussion: https://postgr.es/m/20250406180054.26.nmisch@google.com
Commit 7102070329 fixed a similar bug, but
it missed the case of database-wide ANALYZE ("use_own_xacts" mode).
Commit a07e03fd8f changed consequences
from silent discard of a pg_class stats (relpages et al.) update to
ERROR "tuple to be updated was already modified". Losing a relpages
update of an ON COMMIT DELETE ROWS table was negligible, but a
COMMIT-time error isn't negligible. Back-patch to v13 (all supported
versions).
Reported-by: Richard Guo <guofenglinux@gmail.com
Reported-by: Robins Tharakan <tharakan@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-XwMKMKJ_GT=p3_-_=j9rQSEs1FbDFUnW9zHuKPsPNEQ@mail.gmail.com
Backpatch-through: 13
1349d2790 added support so that aggregate functions with an ORDER BY or
DISTINCT clause could make use of presorted inputs to avoid an implicit
sort within nodeAgg.c. That commit failed to consider that a FILTER
clause may exist that filters rows before the aggregate function
arguments are evaluated. That can be problematic if an aggregate
argument contains an expression which could error out during evaluation.
It's perfectly valid to want to have a FILTER clause which eliminates
such values, and with the pre-sorted path added in 1349d2790, it was
possible that the planner would produce a plan with a Sort node above
the Aggregate to perform the sort on the aggregate's arguments long before
the Aggregate node would filter out the non-matching values.
Here we fix this by inspecting ORDER BY / DISTINCT aggregate functions
which have a FILTER clause to see if the aggregate's arguments are
anything more complex than a Var or a Const. Evaluating these isn't
going to cause an error. If we find any non-Var, non-Const parameters
then the planner will now opt to perform the sort in the Aggregate node
for these aggregates, i.e. disable the presorted aggregate optimization.
An alternative fix would have been to completely disallow the presorted
optimization for Aggrefs with any FILTER clause, but that wasn't done as
that could cause large performance regressions for queries that see
significant gains from 1349d2790 due to presorted results coming in from
an Index Scan.
Backpatch to 16, where 1349d2790 was introduced
Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Kaimeh <kkaimeh@gmail.com>
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAK-%2BJz9J%3DQ06-M7cDJoPNeYbz5EZDqkjQbJnmRyQyzkbRGsYkA%40mail.gmail.com
Backpatch-through: 16
The format of the injection point names used by the AIO code does not
match the existing naming convention used everywhere else in the code,
so let's be consistent. These points are used in test_aio.
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/Z_yTB80bdu1sYDqJ@paquier.xyz
Both pg_get_process_memory_contexts() and pg_backend_memory_contexts
have 1-based levels, whereas pg_log_backend_memory_contexts() was using
0-based levels. Align these.
This results in slightly saner behavior from MemoryContextStatsDetail()
in regards to the max_level. Previously it would stop at 1 level before
the maximum requested level rather than at that level.
Reported-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Author: David Rowley <drowleyml@gmail.com
Reviewed-by: Melih Mutlu <m.melihmutlu@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Discussion: https://postgr.es/m/395ea5d4fe190480efa95bf533485c70@oss.nttdata.com
The "children" list won't be used until "got_children" has been set
true, but older compilers don't get that; about half a dozen
buildfarm animals are warning about this. Issue added by 11ff192b5.
While here, improve slightly-shaky grammar in comment.
Discussion: https://postgr.es/m/2057835.1744833309@sss.pgh.pa.us
This gets rid of repetitive get_typlen calls in postquel_sub_params,
which show up as costing a few percent of the runtime in simple test
cases (more with more parameters).
In combination with the preceding patches, this gets us most of the
way back down to the amount of per-call overhead that functions.c
had before commit 0dca5d68d. There are some more things that could
be done, but this seems like an okay place to stop for v18.
At this point, the only data structures we allocate directly in
fcontext are the SQLFunctionCache struct itself, the ParamListInfo
struct, and the execution_state array, all of which are small and
perfectly capable of being re-used across executions of the same
FmgrInfo. Hence, let's give them the same lifespan as the FmgrInfo.
This step gets rid of the separate SQLFunctionLink struct and makes
fn_extra point to SQLFunctionCache again. We also get rid of the
separate fcontext memory context and allocate these items directly
in fn_mcxt.
For notational simplicity, SQLFunctionCache still has an fcontext
field, but it's just a copy of fn_mcxt.
The motivation for this is to allow these structures to live as
long as the FmgrInfo and be re-used across calls, restoring the
original design without its propensity for memory leaks. This
gets rid of some per-call overhead that we added in 0dca5d68d.
We also make an effort to re-use the JunkFilter and result slot.
Those might need to change if the function definition changes,
so we compromise by rebuilding them if the cached plan changes.
This also moves the tuplestore into fn_mcxt so that it can be
re-used across calls, again undoing a change made in 0dca5d68d.
Put the JunkFilter and its result slot (and thence also
some subsidiary data such as the result tupledesc) into a
separate subcontext "jfcontext". This doesn't accomplish
a lot at this point, because we make a new JunkFilter each
time through the SQL function. However, the plan is to make
the fcontext long-lived, and that raises the possibility
that we'll need a new JunkFilter because the plan for the
result-generating query changes. A separate context makes
it easy to free the obsoleted data when that happens.
Also, instead of always running the sub-executor in fcontext,
make a separate context for it if we're doing lazy eval of
a SRF, and otherwise just run it inside CurrentMemoryContext.
Previously, much of this code ran with CurrentMemoryContext set
to be the function's fcontext, so that we tended to leak a lot of
stuff there. Commit 0dca5d68d dealt with that by releasing the
fcontext at the completion of each SQL function call, but we'd
like to go back to the previous approach of allowing the fcontext
to be query-lifespan. To control the leakage problem, rearrange
the code so that we mostly run in the memory context that fmgr_sql
is called in (which we expect to be short-lived). Notably, this
means that parsing/planning is all done in the short-lived context
and doesn't leak cruft into fcontext.
This patch also fixes the allocation of execution_state records
so that we don't leak them across executions. I set that up
with a re-usable array that contains at least as many
execution_state structs as we need for the current querytree.
The chain structure is still there, but it's not really doing
much for us, and maybe somebody will be motivated to get rid
of it. I'm not though.
This incidentally also moves the call of BlessTupleDesc to be
with the code that creates the JunkFilter. That doesn't make
much difference now, but a later patch will reduce the number
of times the JunkFilter gets made, and we needn't bless the
results any more often than that.
We still leak a fair amount in fcontext, particularly when
executing utility statements, but that's material for a
separate patch step; the point here is only to get rid of
unintentional allocations in fcontext.
Late in the development of commit 0dca5d68d, I added a step to copy
the result tlist we extract from the cached final query, because
I was afraid that that might not last as long as the JunkFilter that
we're passing it off to. However, that turns out to cost a noticeable
number of cycles, and it's really quite unnecessary because the
JunkFilter will not examine that tlist after it's been created.
(ExecFindJunkAttribute would use it, but we don't use that function
on this JunkFilter.) Hence, remove the copy step. For safety,
reset the might-become-dangling jf_targetList pointer to NIL.
In passing, remove DR_sqlfunction.cxt, which we don't use anymore;
it's confusing because it's not entirely clear which context it
ought to point at.
Commit 0bada39c83 fixed a bug of this kind,
which existed in all branches for six days before detection. While the
probability of reaching the trouble was low, the disruption was extreme. No
new backends could start, and service restoration needed an immediate
shutdown. Hence, add this to catch the next bug like it.
The new check in RelationIdGetRelation() suffices to make autovacuum detect
the bug in commit 243e9b40f1 that led to commit
0bada39. This also checks in a number of similar places. It replaces each
Assert(IsTransactionState()) that pertained to a conditional catalog read.
No back-patch for now, but a back-patch of commit 243e9b4 should back-patch
this, too. A back-patch could omit the src/test/regress changes, since back
branches won't gain new index columns.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/20250410191830.0e.nmisch@google.com
Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
estimate_multivariate_ndistinct() is coded to assume the caller handles
passing it a list of GroupVarInfos with unique 'var' fields over the
entire list. 6bb6a62f3 added code which didn't ensure this and that
could result in estimate_multivariate_ndistinct() erroring out with:
ERROR: corrupt MVNDistinct entry
This occurred because estimate_multivariate_ndistinct() first searches
for a set of stats that match to at least two of the given GroupVarInfos
and then later assumes that the MVNDistinctItem.items array of the
best matching stats will have an entry for those two columns. If the
GroupVarInfos List contained a duplicate entry then the same column could
be matched to twice and that could trick the code into thinking we have
>= 2 columns matched in cases where only a single distinct column has been
matched. This could result in a failure to find the correct
MVNDistinctItem in the stats as the array containing those never
contains an item for single columns.
Here we make it more clear that the function needs a distinct set of
GroupVarInfos and also tidy up a few other comments to make things a bit
easier to follow.
Author: David Rowley <drowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvocZCUhM9W9mJ39d6oQz7ePKoqFnao_347mvC-A7QatcQ@mail.gmail.com
Buildfarm member drongo complained because the definitions of these
functions used "const Oid foo" where the forward declarations just
had "Oid foo". (I'm a bit surprised that drongo seems to be the only
complainant.) I chose to fix this by removing the "consts" because
(a) I'm generally not a fan of using const that way, and (b) it was
a minority usage even within these two functions, let alone compared
to the rest of our code base.
Oversight in commit eec0040c4, so no need for back-patch.
We were unnecessarily acquiring AccessExclusiveLock on all child tables
when "ALTER TABLE ONLY sometab ADD PRIMARY KEY" was run on their parent
table, an oversight in commit 14e87ffa5c. This caused deadlocks
during pg_restore of partitioned tables.
The reason to acquire the AEL was that we need to verify that child
tables have the involved columns already marked as not-null; but if the
parent table has an inheritable not-null constraint, then all children
must necessarily be in the correct state already, so we can skip the
check, which avoids acquiring the lock. Reorder the code so that it
works that way. This doesn't change things in the case where the
constraint doesn't exist, but that case is of lesser importance because
it doesn't occur during parallel pg_restore.
While at it, reword some errmsg() and add errhint() to similar cases in
related but not adjacent code.
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/67469c1c-38bc-7d94-918a-67033f5dd731@gmx.net
Discussion: https://postgr.es/m/2045026.1743801143@sss.pgh.pa.us
Discussion: https://postgr.es/m/1280408.1744650810@sss.pgh.pa.us
Memoize typically marks cache entries as complete after fully scanning
the inner side of a join. However, in the case of unique joins, we
skip to the next outer tuple as soon as the first matching inner tuple
is found, leaving no opportunity to scan the inner side to completion.
To work around that, we mark cache entries as complete after fetching
the first matching inner tuple in unique joins.
This approach is only safe when all of the join's restriction clauses
are parameterized; otherwise, there is no guarantee that reading just
one tuple from the inner side is sufficient.
Currently, we check for this by verifying that the number of clauses
in ppi_clauses is no less than the number of the join's restriction
clauses. However, this check isn't entirely reliable, as ppi_clauses
includes join clauses available from all outer rels, not just the
current outer rel. This means the check could pass even if a
restriction clause isn't parameterized, as long as another join
clause, which doesn't belong to the current join, is included in
ppi_clauses.
To fix this, we explicitly check whether each restriction clause of
the current join is present in ppi_clauses.
While we're here, remove the XXX comment from the modified code, as
it's not justified; in certain cases, it's not possible to move a join
clause to the inner side.
This is arguably a bugfix, but no backpatch given the lack of field
reports.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-8JPouj=wBDj4DhK-WO4+Xdx=A2jbjvvyyTBQneJ1=BQ@mail.gmail.com
If a GENERATED column is declared to have a domain data type where
the domain's constraints disallow null values, INSERT commands failed
because we built a targetlist that included coercing a null constant
to the domain's type. The failure occurred even when the generated
value would have been perfectly OK. This is adjacent to the issues
fixed in 0da39aa76, but we didn't notice for lack of testing a domain
with such a constraint.
We aren't going to use the result of the targetlist entry for the
generated column --- ExecComputeStoredGenerated will overwrite it.
So it's not really necessary that it have the exact datatype of
the generated column. This patch fixes the problem by changing
the targetlist entry to be a null Const of the domain's base type,
which should be sufficiently legal. (We do have to tweak
ExecCheckPlanOutput to accept the situation, though.)
This has been broken since we implemented generated columns.
However, this patch only applies easily as far back as v14, partly
because I (tgl) only carried 0da39aa76 back that far, but mostly
because v14 significantly refactored the handling of INSERT/UPDATE
targetlists. Given the lack of field complaints and the short
remaining support lifetime of v13, I judge the cost-benefit ratio
not good for devising a version that would work in v13.
Reported-by: jian he <jian.universality@gmail.com>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxG59tip2+9h=rEv-ykOFjt0cbsPVchhi0RTij8bABBA0Q@mail.gmail.com
Backpatch-through: 14
Commit 0f21db36d made an assumption that GIN triConsistentFns
would not modify their input entryRes[] arrays. But in fact,
the "shim" triConsistentFn that we use for opclasses that don't
supply their own did exactly that, potentially leading to wrong
answers from a GIN index search. Through bad luck, none of the
test cases that we have for such opclasses exposed the bug.
One response to this could be that the assumption of consistency check
functions not modifying entryRes[] arrays is a bad one, but it still
seems reasonable to me. Notably, shimTriConsistentFn is itself
assuming that with respect to the underlying boolean consistentFn,
so it's sure being self-centered in supposing that it gets to do so.
Fortunately, it's quite simple to fix shimTriConsistentFn to restore
the entry-time state of entryRes[], so let's do that instead.
This issue doesn't affect any core GIN opclasses, since they all
supply their own triConsistentFns. It does affect contrib modules
btree_gin, hstore, and intarray.
Along the way, I (tgl) noticed that shimTriConsistentFn failed to
pick up on a "recheck" flag returned by its first call to the boolean
consistentFn. This may be only a latent problem, since it would be
unlikely for a consistentFn to set recheck for the all-false case
and not any other cases. (Indeed, none of our contrib modules do
that.) Nonetheless, it's formally wrong.
Reported-by: Vinod Sridharan <vsridh90@gmail.com>
Author: Vinod Sridharan <vsridh90@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFMdLD7XzsXfi1+DpTqTgrD8XU0i2C99KuF=5VHLWjx4C1pkcg@mail.gmail.com
Backpatch-through: 13
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places. These
inconsistencies were all introduced during Postgres 18 development.
This commit was written with help from clang-tidy, by mechanically
applying the same rules as similar clean-up commits (the earliest such
commit was commit 035ce1fe).
synchronous_standby_names cannot be reloaded safely by backends, and the
checkpointer is in charge of updating a state in shared memory if the
GUC is enabled in WalSndCtl, to let the backends know if they should
wait or not for a given LSN. This provides a strict control on the
timing of the waiting queues if the GUC is enabled or disabled, then
reloaded. The checkpointer is also in charge of waking up the backends
that could be waiting for a LSN when the GUC is disabled.
This logic had a race condition at startup, where it would be possible
for backends to not wait for a LSN even if synchronous_standby_names is
enabled. This would cause visibility issues with transactions that we
should be waiting for but they were not. The problem lasts until the
checkpointer does its initial update of the shared memory state when it
loads synchronous_standby_names.
In order to take care of this problem, the shared memory state in
WalSndCtl is extended to detect if it has been initialized by the
checkpointer, and not only check if synchronous_standby_names is
defined. In WalSndCtlData, sync_standbys_defined is renamed to
sync_standbys_status, a bits8 able to know about two states:
- If the shared memory state has been initialized. This flag is set by
the checkpointer at startup once, and never removed.
- If synchronous_standby_names is known as defined in the shared memory
state. This is the same as the previous sync_standbys_defined in
WalSndCtl.
This method gives a way for backends to decide what they should do until
the shared memory area is initialized, and they now ultimately fall back
to a check on the GUC value in this case, which is the best thing that
can be done.
Fortunately, SyncRepUpdateSyncStandbysDefined() is called immediately by
the checkpointer when this process starts, so the window is very narrow.
It is possible to enlarge the problematic window by making the
checkpointer wait at the beginning of SyncRepUpdateSyncStandbysDefined()
with a hardcoded sleep for example, and doing so has showed that a 2PC
visibility test is indeed failing. On machines slow enough, this bug
would cause spurious failures.
In 17~, we have looked at the possibility of adding an injection point
to have a reproducible test, but as the problematic window happens at
early startup, we would need to invent a way to make an injection point
optionally persistent across restarts when attached, something that
would be fine for this case as it would involve the checkpointer. This
issue is quite old, and can be reproduced on all the stable branches.
Author: Melnikov Maksim <m.melnikov@postgrespro.ru>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/163fcbec-900b-4b07-beaa-d2ead8634bec@postgrespro.ru
Backpatch-through: 13
Sami complained that there's a discrepancy between n_mod_since_analyze
and n_ins_since_vacuum, as the former only accounts for committed changes
and the latter tracks committed and aborted inserts. Nobody seemed
overly concerned that this would cause any concerning issues. The
repercussions, from what I can tell, are limited to causing an
autovacuum to trigger for inserts sooner than it otherwise might. For
typical ratios of commits to aborts, it's unlikely to ever be noticed.
Fixing things to make it so n_ins_since_vacuum only displays committed
inserts would require an additional field in PgStat_TableCounts, which
does not quite seem worthwhile at this stage. This commit just adds a
comment with some details to mention that we know about it, which will
hopefully prevent repeat discussions.
Reported-by: Sami Imseih <samimseih@gmail.com>
Author: David Rowley <drowleyml@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpgV3a-R2EGmPOh0L-x3pHbZpM3y4dySWfy+UqUazwDQA@mail.gmail.com
By inspection, ip_addrsize() can't return a negative result.
(If it could, we'd have way bigger problems elsewhere.)
So delete useless check in network_send(). Most C compilers
are probably perfectly capable of removing this code by
themselves, but it's confusing/misleading.
Bug: #18889
Reported-by: Daniel Elishakov <dan-eli@mail.ru>
Discussion: https://postgr.es/m/18889-73d4f19e953a629e@postgresql.org
Data loss can happen when the DDLs like ALTER PUBLICATION ... ADD TABLE ...
or ALTER TYPE ... that don't take a strong lock on table happens
concurrently to DMLs on the tables involved in the DDL. This happens
because logical decoding doesn't distribute invalidations to concurrent
transactions and those transactions use stale cache data to decode the
changes. The problem becomes bigger because we keep using the stale cache
even after those in-progress transactions are finished and skip the
changes required to be sent to the client.
This commit fixes the issue by distributing invalidation messages from
catalog-modifying transactions to all concurrent in-progress transactions.
This allows the necessary rebuild of the catalog cache when decoding new
changes after concurrent DDL.
We observed performance regression primarily during frequent execution of
*publication DDL* statements that modify the published tables. The
regression is minor or nearly nonexistent for DDLs that do not affect the
published tables or occur infrequently, making this a worthwhile cost to
resolve a longstanding data loss issue.
An alternative approach considered was to take a strong lock on each
affected table during publication modification. However, this would only
address issues related to publication DDLs (but not the ALTER TYPE ...)
and require locking every relation in the database for publications
created as FOR ALL TABLES, which is impractical.
The bug exists in all supported branches, but we are backpatching till 14.
The fix for 13 requires somewhat bigger changes than this fix, so the fix
for that branch is still under discussion.
Reported-by: hubert depesz lubaczewski <depesz@depesz.com>
Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Tested-by: Benoit Lobréau <benoit.lobreau@dalibo.com>
Backpatch-through: 14
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
Discussion: https://postgr.es/m/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com
d69d45a5a changed how em_is_child members are stored in
EquivalenceClasses. Children are no longer stored in the ec_members
list. optimizer/README mentioned that most operations "should ignore
child members", but that felt a little untrue now since child members
are now stored in a separate place, they simply won't be found by the
normal means of looking (a foreach loop over ec_members), and if you don't
find them, there's technically no need to "ignore" them.
Here we tweak the wording slightly to reflect the new storage location
for child members.
Reported-by: Amit Langote <amitlangote09@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqE8v=EuAP_3F_A2xn8zWx+nG_etW_Fe_DvKO-Fkx=+DdQ@mail.gmail.com
This moves/renames some of the functions defined in pg_numa.c:
* pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and
moved to src/backend/storage/ipc/shmem.c. The new name better reflects
that the page size is not related to NUMA, and it's specifically about
the page size used for the main shared memory segment.
* move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into
the backend (which more appropriate for functions callable from SQL).
While at it, improve the comment to explain what page size it returns.
* remove unnecessary includes from src/port/pg_numa.c, adding
unnecessary dependencies (src/port should be suitable for frontent).
These were either leftovers or unnecessary thanks to the other changes
in this commit.
This eliminates unnecessary dependencies on backend symbols, which we
don't want in src/port.
Reported-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
Change MyCancelKeyLength's type from uint8 to int. While it always
fits in a uint8, plain int is less surprising, as there's no
particular reason for it to be uint8.
Fix one ProcSignalInit caller that passed 'false' instead of NULL for
the pointer argument.
Author: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/61be9e31-7b7d-49d5-bc11-721800d89d64@eisentraut.org
These examples fail to account for join clauses generated by
EquivalenceClasses, but since we haven't mentioned EquivalenceClasses
yet it seems like it'd just add confusion to make them fully accurate.
Instead, parenthetically note that they're oversimplified.
Reported-by: Zeyuan Hu <ferrishu3886@gmail.com>
Co-authored-by: David Rowley <dgrowleyml@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACvHWmYFo+60yMqKJajDDvKN5EM41YHrCT3oxukwXmGAqpWvyw@mail.gmail.com
The issue happens when building conflict information during apply of
INSERT or UPDATE operations that violate unique constraints on leaf
partitions.
The problem was introduced in commit 9ff68679b5, which removed the
redundant calls to ExecOpenIndices/ExecCloseIndices. The previous code was
relying on the redundant ExecOpenIndices call in
apply_handle_tuple_routing() to build the index information required for
unique key conflict detection.
The fix is to delay building the index information until a conflict is
detected instead of relying on ExecOpenIndices to do the same. The
additional benefit of this approach is that it avoids building index
information when there is no conflict.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by:Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/TYAPR01MB57244ADA33DDA57119B9D26494A62@TYAPR01MB5724.jpnprd01.prod.outlook.com
It can be set to either COPY (the default) or CLONE if the system
supports it. CLONE causes callers of copydir(), currently CREATE
DATABASE ... STRATEGY=FILE_COPY and ALTER DATABASE ... SET TABLESPACE =
..., to use copy_file_range (Linux, FreeBSD) or copyfile (macOS) to copy
files instead of a read-write loop over the contents.
CLONE gives the kernel the opportunity to share block ranges on
copy-on-write file systems and push copying down to storage on others,
depending on configuration. On some systems CLONE can be used to clone
large databases quickly with CREATE DATABASE ... TEMPLATE=source
STRATEGY=FILE_COPY.
Other operating systems could be supported; patches welcome.
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
This adds a function for retrieving memory context statistics
and information from backends as well as auxiliary processes.
The intended usecase is cluster debugging when under memory
pressure or unanticipated memory usage characteristics.
When calling the function it sends a signal to the specified
process to submit statistics regarding its memory contexts
into dynamic shared memory. Each memory context is returned
in detail, followed by a cumulative total in case the number
of contexts exceed the max allocated amount of shared memory.
Each process is limited to use at most 1Mb memory for this.
A summary can also be explicitly requested by the user, this
will return the TopMemoryContext and a cumulative total of
all lower contexts.
In order to not block on busy processes the caller specifies
the number of seconds during which to retry before timing out.
In the case where no statistics are published within the set
timeout, the last known statistics are returned, or NULL if
no previously published statistics exist. This allows dash-
board type queries to continually publish even if the target
process is temporarily congested. Context records contain a
timestamp to indicate when they were submitted.
Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com
Before, BAS_BULKREAD was always of size 256kB. With the default
io_combine_limit of 16, that only allowed 1-2 IOs to be in flight -
insufficient even on very low latency storage.
We don't just want to increase the size to a much larger hardcoded value, as
very large rings (10s of MBs of of buffers), appear to have negative
performance effects when reading in data that the OS has cached (but not when
actually needing to do IO).
To address this, increase the size of BAS_BULKREAD to allow for
io_combine_limit * effective_io_concurrency buffers getting read in. To
prevent the ring being much larger than useful, limit the increased size with
GetPinLimit().
The formula outlined above keeps the ring size to sizes for which we have not
observed performance regressions, unless very large effective_io_concurrency
values are used together with large shared_buffers setting.
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
In addition to the added functions, the pg_buffercache_evict() function now
shows whether the buffer was flushed.
pg_buffercache_evict_relation(): Evicts all shared buffers in a
relation at once.
pg_buffercache_evict_all(): Evicts all shared buffers at once.
Both functions provide mechanism to evict multiple shared buffers at
once. They are designed to address the inefficiency of repeatedly calling
pg_buffercache_evict() for each individual buffer, which can be time-consuming
when dealing with large shared buffer pools. (e.g., ~477ms vs. ~2576ms for
16GB of fully populated shared buffers).
These functions are intended for developer testing and debugging
purposes and are available to superusers only.
Minimal tests for the new functions are included. Also, there was no test for
pg_buffercache_evict(), test for this added too.
No new extension version is needed, as it was already increased this release
by ba2a3c2302.
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Joseph Koshakow <koshy44@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com
When planning queries to partitioned tables, we clone all
EquivalenceMembers belonging to the partitioned table into em_is_child
EquivalenceMembers for each non-pruned partition. For partitioned tables
with large numbers of partitions, this meant the ec_members list could
become large and code searching that list would become slow. Effectively,
the more partitions which were present, the more searches needed to be
performed for operations such as find_ec_member_matching_expr() during
create_plan() and the more partitions present, the longer these searches
would take, i.e., a quadratic slowdown.
To fix this, here we adjust how we store EquivalenceMembers for
em_is_child members. Instead of storing these directly in ec_members,
these are now stored in a new array of Lists in the EquivalenceClass,
which is indexed by the relid. When we want to find EquivalenceMembers
belonging to a certain child relation, we can narrow the search to the
array element for that relation.
To make EquivalenceMember lookup easier and to reduce the amount of code
change, this commit provides a pair of functions to allow iteration over
the EquivalenceMembers of an EC which also handles finding the child
members, if required. Callers that never need to look at child members
can remain using the foreach loop over ec_members, which will now often
be faster due to only parent-level members being stored there.
The actual performance increases here are highly dependent on the number
of partitions and the query being planned. Performance increases can be
visible with as few as 8 partitions, but the speedup is marginal for
such low numbers of partitions. The speedups become much more visible
with a few dozen to hundreds of partitions. With some tested queries
using 56 partitions, the planner was around 3x faster than before. For
use cases with thousands of partitions, these are likely to become
significantly faster. Some testing has shown planner speedups of 60x or
more with 8192 partitions.
Author: Yuya Watari <watari.yuya@gmail.com>
Co-authored-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Tested-by: Thom Brown <thom@linux.com>
Tested-by: newtglobal postgresql_contributors <postgresql_contributors@newtglobalcorp.com>
Discussion: https://postgr.es/m/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com
Some tests try to invalidate logical slots on the standby server by
running VACUUM on the primary. The problem is that xl_running_xacts was
getting generated and replayed before the VACUUM command, leading to the
advancement of the active slot's catalog_xmin. Due to this, active slots
were not getting invalidated, leading to test failures.
We fix it by skipping the generation of xl_running_xacts for the required
tests with the help of injection points. As the required interface for
injection points was not present in back branches, we fixed the failing
tests in them by disallowing the slot to become active for the required
cases (where rows_removed conflict could be generated).
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 16, where it was introduced
Discussion: https://postgr.es/m/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal
A PG 17 optimization allowed columns with NOT NULL constraints to skip
table scans for IS NULL queries, and to skip IS NOT NULL checks for IS
NOT NULL queries. This didn't work for domain types, since domain types
don't follow the IS NULL/IS NOT NULL constraint logic. To fix, disable
this optimization for domains for PG 17+.
Reported-by: Jan Behrens
Diagnosed-by: Tom Lane
Discussion: https://postgr.es/m/Z37p0paENWWUarj-@momjian.us
Backpatch-through: 17
WAL senders do not flush their statistics until they exit, limiting the
monitoring possible for live processes. This is penalizing when WAL
senders are running for a long time, like in streaming or logical
replication setups, because it is not possible to know the amount of IO
they generate while running.
This commit makes WAL senders more aggressive with their statistics
flush, using an internal of 1 second, with the flush timing calculated
based on the existing GetCurrentTimestamp() done before the sleeps done
to wait for some activity. Note that the sleep done for logical and
physical WAL senders happens in two different code paths, so the stats
flushes need to happen in these two places.
One test is added for the physical WAL sender case, and one for the
logical WAL sender case. This can be done in a stable fashion by
relying on the WAL generated by the TAP tests in combination with a
stats reset while a server is running, but only on HEAD as WAL data has
been added to pg_stat_io in a051e71e28.
This issue exists since a9c70b46db and the introduction of pg_stat_io,
so backpatch down to v16.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/Z73IsKBceoVd4t55@ip-10-97-1-34.eu-west-3.compute.internal
Backpatch-through: 16
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes. For each shared memory
segment, the view returns one row for each NUMA node backing it, with
the total amount of memory allocated from that node.
The view may be relatively expensive, especially when executed for the
first time in a backend, as it has to touch all memory pages to get
reliable information about the NUMA node. This may also force allocation
of the shared memory.
Unlike pg_shmem_allocations, the view does not show anonymous shared
memory allocations. It also does not show memory allocated using the
dynamic shared memory infrastructure.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with the NUMA library.
The main function introduced is pg_numa_query_pages(), which allows
determining the NUMA node for individual memory pages. Internally the
function uses move_pages(2) syscall, as it allows batching, and is more
efficient than get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
In some edge cases valgrind flags issues with the memory referenced by
IOs. All of the cases addressed in this change are false positives.
Most of the false positives are caused by UnpinBuffer[NoOwner] marking buffer
data as inaccessible. This happens even though the AIO subsystem still holds a
pin. That's good, there shouldn't be accesses to the buffer outside of AIO
related code until it is pinned by "user" code again. But it requires some
explicit work - if the buffer is not pinned by the current backend, we need to
explicitly mark the buffer data accessible/inaccessible while executing
completion callbacks.
That however causes a cascading issue in IO workers: After the completion
callbacks for a buffer is executed, the page is marked as inaccessible. If
subsequently the same worker is executing IO targeting the same buffer, we
would get an error, as the memory is still marked inaccessible. To avoid that,
we need to explicitly mark the memory as accessible in IO workers.
Another issue is that IO executed in workers or via io_uring will not mark
memory as DEFINED. In the case of workers that is because valgrind does not
track memory definedness across processes. For io_uring that is because
valgrind does not understand io_uring, and therefore its IOs never mark memory
as defined, whether the completions are processed in the defining process or
in another context. It's not entirely clear how to best solve that. The
current user of AIO is not affected, as it explicitly marks buffers as DEFINED
& NOACCESS anyway. Defer solving this issue until we have a user with
different needs.
Per buildfarm animal skink.
Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
This mirrors 1e0dfd166b (+ 46ef520b95), for temporary table buffers. This
is mainly interesting right now because the AIO work currently triggers
spurious valgrind errors, and the fix for that is cleaner if temp buffers
behave the same as shared buffers.
This requires one change beyond the annotations themselves, namely to pin
local buffers while writing them out in FlushRelationBuffers().
Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
The list of transform objects that a function should use is specified
in CREATE FUNCTION's TRANSFORM clause, and then represented indirectly
in pg_proc.protrftypes. However, ProcedureCreate completely ignored
that for purposes of constructing pg_depend entries, and instead made
the function depend on any transforms that exist for its parameter or
return data types. This is bad in both directions: the function could
be made dependent on a transform it does not actually use, or it
could try to use a transform that's since been dropped. (The latter
scenario would require use of a transform that's not for any of the
parameter or return types, but that seems legit for cases where the
function performs SQL operations internally.)
To fix, pass in the list of transform objects that CreateFunction
identified, and build pg_depend entries from that not from the
parameter/return types. This results in changes in the expected
test outputs in contrib/bool_plperl, which I guess are due to
different ordering of pg_depend entries -- that test case is
surely not exercising either of the problem scenarios.
This fix is not back-patchable as-is: changing the signature of
ProcedureCreate seems too risky in stable branches. We could
do something like making ProcedureCreate a wrapper around
ProcedureCreateExt or so. However, I'm more inclined to do
nothing in the back branches. We had no field complaints up to
now, so the hazards don't seem to be a big issue in practice.
And we couldn't do anything about existing pg_depend entries,
so a back-patched fix would result in a mishmash of dependencies
created according to different rules. That cure could be worse
than the disease, perhaps.
I bumped catversion just to lay down a marker that the expected
contents of pg_depend are a bit different than before.
Reported-by: Chapman Flack <jcflack@acm.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3112950.1743984111@sss.pgh.pa.us
This allows them to be added without scanning the table, and validating
them afterwards without holding access exclusive lock on the table after
any violating rows have been deleted or fixed.
Doing ALTER TABLE ... SET NOT NULL for a column that has an invalid
not-null constraint validates that constraint. ALTER TABLE .. VALIDATE
CONSTRAINT is also supported. There are various checks on whether an
invalid constraint is allowed in a child table when the parent table has
a valid constraint; this should match what we do for enforced/not
enforced constraints.
pg_attribute.attnotnull is now only an indicator for whether a not-null
constraint exists for the column; whether it's valid or invalid must be
queried in pg_constraint. Applications can continue to query
pg_attribute.attnotnull as before, but now it's possible that NULL rows
are present in the column even when that's set to true.
For backend internal purposes, we cache the nullability status in
CompactAttribute->attnullability that each tuple descriptor carries
(replacing CompactAttribute.attnotnull, which was a mirror of
Form_pg_attribute.attnotnull). During the initial tuple descriptor
creation, based on the pg_attribute scan, we set this to UNRESTRICTED if
pg_attribute.attnotnull is false, or to UNKNOWN if it's true; then we
update the latter to VALID or INVALID depending on the pg_constraint
scan. This flag is also copied when tupledescs are copied.
Comparing tuple descs for equality must also compare the
CompactAttribute.attnullability flag and return false in case of a
mismatch.
pg_dump deals with these constraints by storing the OIDs of invalid
not-null constraints in a separate array, and running a query to obtain
their properties. The regular table creation SQL omits them entirely.
They are then dealt with in the same way as "separate" CHECK
constraints, and dumped after the data has been loaded. Because no
additional pg_dump infrastructure was required, we don't bump its
version number.
I decided not to bump catversion either, because the old catalog state
works perfectly in the new world. (Trying to run with new catalog state
and the old server version would likely run into issues, however.)
System catalogs do not support invalid not-null constraints (because
commit 14e87ffa5c didn't allow them to have pg_constraint rows
anyway.)
Author: Rushabh Lathia <rushabh.lathia@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Tested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAGPqQf0KitkNack4F5CFkFi-9Dqvp29Ro=EpcWt=4_hs-Rt+bQ@mail.gmail.com
This escape shows the numeric server IP address that the client
has connected to. Unix-socket connections will show "[local]".
Non-client processes (e.g. background processes) will show "[none]".
We expect that this option will be of interest to only a fairly
small number of users. Therefore the implementation is optimized
for the case where it's not used (that is, we don't do the string
conversion until we have to), and we've not added the field to
csvlog or jsonlog formats.
Author: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: David Steele <david@pgmasters.net>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAKAnmmK-U+UicE-qbNU23K--Q5XTLdM6bj+gbkZBZkjyjrd3Ow@mail.gmail.com
If the limit returned by GetAdditionalPinLimit() is large, the buffer_limit
variable in read_stream_start_pending_read() can overflow. While the code is
careful to limit buffer_limit PG_INT16_MAX, we subsequently add the number of
forwarded buffers.
The overflow can lead to assertion failures, crashes or wrong query results
when using large shared buffers.
It seems easier to avoid this if we make the buffer_limit variable an int,
instead of an int16. Do so, and clamp buffer_limit after adding the number of
forwarded buffers.
It's possible we might want to address this and related issues more widely by
changing to int instead of int16 more widely, but since the consequences of
this bug can be confusing, it seems better to fix it now.
This bug was introduced in ed0b87caac.
Discussion: https://postgr.es/m/ewvz3cbtlhrwqk7h6ca6cctiqh7r64ol3pzb3iyjycn2r5nxk5@tnhw3a5zatlr
fc069a3a63 implements Self-Join Elimination (SJE) and provides a new GUC
variable: enable_self_join_elimination. This new GUC variable was marked
as GUC_NOT_IN_SAMPLE. However, enable_self_join_elimination is documented
and is not different from any other enable_* GUCs. Thus, remove
GUC_NOT_IN_SAMPLE from it and add it to the postgresql.conf.sample.
Discussion: https://postgr.es/m/CAPpHfdsqMTEsmxk3aQwt6xPz%2BKpUELO%3D6fzmER9ZRGrbs4uMfA%40mail.gmail.com
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
palloc() is invoked with a specific formula for its allocation size in
quote_literal_cstr(). This wastes some memory, but the size is large
enough to cover even the worst-case scenarios.
No explanations were given about the reasons behind these numbers. This
commit adds more documentation about all that.
Author: Steve Chavez <steve@supabase.io>
Discussion: https://postgr.es/m/CAGRrpzZ9bToRWS+fAnjxDJrxwZN1QcJ-y1Pn2yg=Hst6rydLtw@mail.gmail.com
stats_fetch_consistency set to "snapshot" causes the backend entry
"beentry" retrieved by pgstat_get_beentry_by_proc_number() to be reset
at the beginning of pgstat_fetch_stat_backend() when fetching the
backend pgstats entry. As coded, "beentry" was being accessed after
being freed. This commit moves all the accesses to "beentry" to happen
before calling pgstat_fetch_stat_backend(), fixing the problem.
This problem could be reached by calling the SQL functions
pg_stat_get_backend_io() or pg_stat_get_backend_wal().
Issue caught by valgrind.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/f1788cc0-253a-4a3a-aee0-1b8ab9538736@gmail.com
The XLOG_CONTROL_FILE macro (defined in access/xlog_internal.h)
represents the control file name. While some parts of the codebase already
use this macro, others previously hardcoded the file name as a string.
This commit replaces those hardcoded strings with the macro,
ensuring consistent usage throughout the code. This makes future
maintenance easier and improves searchability, for example when
grepping for control file usage.
Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Masao Fujii <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/0841ec77-47e5-452a-adb4-c6fa55d605fc@postgrespro.ru
PgAioResult.result is never accessed in the relevant path, but coverity
complains about an uninitialized access anyway. So just zero-initialize the
whole thing. While at it, reduce the scope of the variable.
Reported-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CAEudQApsKqd-s+fsUQ0OmxJAMHmBSXxrAz3dCs+uvqb3iRtjSw@mail.gmail.com
This should silence Coverity's complaints about the result being
sometimes ignored.
I'm inclined to think that these routines are simply misdesigned,
because sometimes it's okay to ignore the result and sometimes it
isn't, and we have no way to enforce the latter. But for now
I just added a comment.
There were several places in ordering-related planning where a
requirement for btree was hardcoded but an amcanorder index could
suffice. This fixes that. We just need to do the necessary mapping
between strategy numbers and compare types and adjust some related
APIs so that this works independent of btree strategy numbers. For
instance, non-btree amcanorder indexes can now be used to support
sorting and merge joins. Also, predtest.c works independent of btree
strategy numbers now.
To avoid performance regressions, some details on btree and other
built-in index types are still hardcoded as shortcuts, but other index
types now have access to the same features by providing the required
flags and callbacks.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
makeDependencyGraphWalker thought that only SelectStmt nodes could
contain a WithClause. Which was true in our original implementation
of WITH, but astonishingly we missed updating this code when we added
the ability to attach WITH to INSERT/UPDATE/DELETE (and later MERGE).
Moreover, since it was coded to deliberately block recursion to a
WithClause, even updating raw_expression_tree_walker didn't save it.
The upshot of this was that we didn't see references to outer CTE
names appearing within an inner WITH, and would neither complain about
disallowed recursion nor account for such references when sorting CTEs
into a usable order. The lack of complaints about this is perhaps not
so surprising, because typical usage of WITH wouldn't hit either case.
Still, it's pretty broken; failing to detect recursion here leads to
assert failures or worse later on.
Fix by factoring out the processing of sub-WITHs into a new function
WalkInnerWith, and invoking that for all the statement types that
can have WITH.
Bug: #18878
Reported-by: Yu Liang <luy70@psu.edu>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18878-a26fa5ab6be2f2cf@postgresql.org
Backpatch-through: 13
transformJsonArrayQueryConstructor() applied transformStmt() to
the same subquery tree twice. While this causes no issue in many
cases, there are some where it causes a coredump, thanks to the
parser's habit of scribbling on its input.
Fix by making a copy before the first transformation (compare
0f43083d1). This is quite brute-force, but then so is the
whole business of transforming the input twice. Per discussion
in the bug thread, this implementation of json_array() parsing
should be replaced completely. But that will take some work
and will surely not be back-patchable, so for the moment let's
take the easy way out.
Oversight in 7081ac46a. Back-patch to v16 where that came in.
Bug: #18877
Reported-by: Yu Liang <luy70@psu.edu>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18877-c3c3ad75845833bb@postgresql.org
Backpatch-through: 16
Since v15 we've had an option to apply a foreign key constraint's
ON DELETE SET DEFAULT or SET NULL action to just some of the
referencing columns. There was not a check for duplicate entries in
the list of columns-to-set, though. That caused a potential memory
stomp in CreateConstraintEntry(), which incautiously assumed that
the list of columns-to-set couldn't be longer than the number of key
columns. Even after fixing that, the case doesn't work because you
get an error like "multiple assignments to same column" from the SQL
command that is generated to do the update.
We could either raise an error for duplicate columns or silently
suppress the dups, and after a bit of thought I chose to do the
latter. This is motivated by the fact that duplicates in the FK
column list are legal, so it's not real clear why duplicates
in the columns-to-set list shouldn't be. Of course there's no
need to actually set the column more than once.
I left in the fix in CreateConstraintEntry() too, just because
it didn't seem like such low-level code ought to be making
assumptions about what it's handed.
Bug: #18879
Reported-by: Yu Liang <luy70@psu.edu>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18879-259fc59d072bd4d7@postgresql.org
Backpatch-through: 15
This is yet another bit of fallout from the fact that backend/parser
(like other code) feels free to scribble on the parse tree it's
handed. In this case that resulted in modifying the
relatively-short-lived copy in the cached function's source_list.
That would be fine since we only need each source_list tree once
... except that if the parser fails after making some changes,
the function cache entry remains as-is and will still be there
if the user tries to execute the function again. Then we have
problems because we're feeding a non-pristine tree to the parser.
The most expedient fix is a quick copyObject(). I considered
other answers like somehow marking the cache entry invalid
temporarily, but that would add complexity and I'm not sure
it's worth it. In typical scenarios we'd only do this once
per function query per session.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6d442183-102c-498a-81d1-eeeb086cdc5a@gmail.com
Transform low_compare and high_compare nbtree skip array inequalities
(with opclasses that offer skip support) in such a way as to allow
_bt_first to consistently apply later keys when it descends the tree.
This can lower the number of index searches for multi-column scans that
use a ">" key on one of the index's prefix columns (or use a "<" key,
when scanning backwards) when it precedes some later lower-order key.
For example, an index qual "WHERE a > 5 AND b = 2" will now be converted
to "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place
after low_compare and high_compare have been finalized. That way, the
initial call to _bt_first can use "WHERE a >= 6 AND b = 2" to find an
initial position, rather than just using "WHERE a > 5" -- "b = 2" can be
applied during every _bt_first call. There's a decent chance that this
will allow such a scan to avoid the extra search that might otherwise be
needed to determine the lowest "a" value still satisfying "WHERE a > 5".
The transformation process can only lower the total number of index
pages read when the use of a more restrictive set of initial positioning
keys in _bt_first actually allows the scan to land on some later leaf
page directly, relative to the unoptimized case (or on an earlier leaf
page directly, when scanning backwards). But the savings can really add
up in cases where an affected skip array comes after some other array.
For example, a scan indexqual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2"
can save as many as 3 _bt_first calls by applying the new transformation
to its "y" array (up to 1 extra search can be avoided per "x" element).
Follow-up to commit 92fe23d9, which added nbtree skip scan.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com
Don't allow nbtree scans with skip arrays to end any primitive scan on
its first leaf page without giving some consideration to how many times
the scan's arrays advanced while changing at least one skip array
(though continue not caring about the number of array advancements that
only affected SAOP arrays, even during skip scans with SAOP arrays).
Now when a scan performs more than 3 such array advancements in the
course of reading a single leaf page, it is taken as a signal that the
next page is unlikely to be skippable. We'll therefore continue the
ongoing primitive index scan, at least until we can perform a recheck
against the next page's finaltup.
Testing has shown that this new heuristic occasionally makes all the
difference with skip scans that were expected to rely on the "passed
first page" heuristic added by commit 9a2e2a28. Without it, there is a
remaining risk that certain kinds of skip scans will never quite manage
to clear the initial hurdle of performing a primitive scan that lasts
beyond its first leaf page (or that such a skip scan will only clear
that initial hurdle when it has already wasted noticeably-many cycles
due to inefficient primitive scan scheduling).
Follow-up to commits 92fe23d9 and 9a2e2a28.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com
Postgres 17 commit e0b1ee17 added two complementary optimizations to
nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage
was made to avoid needlessly evaluating keys that are guaranteed to be
satisfied by applying page-level context. "prechecked" did this for
keys required in the current scan direction, while "firstmatch" did it
for keys required in the opposite-to-scan direction only.
The "prechecked" design had a number of notable issues. It didn't
account for the fact that an = array scan key's sk_argument field might
need to advance at the point of the page precheck (it didn't check the
precheck tuple against the key's array, only the key's sk_argument,
which needlessly made it ineffective in cases involving stepping to a
page having advanced the scan's arrays using a truncated high key).
"prechecked" was also completely ineffective when only one scan key
wasn't guaranteed to be satisfied by every tuple (it didn't recognize
that it was still safe to avoid evaluating other, earlier keys).
The "firstmatch" optimization had similar limitations. It could only be
applied after _bt_readpage found its first matching tuple, regardless of
why any earlier tuples failed to satisfy the scan's index quals. This
allowed unsatisfied non-required scan keys to impede the optimization.
Replace both optimizations with a new optimization, without any of these
limitations: the "startikey" optimization. Affected _bt_readpage calls
generate a page-level key offset ("startikey"), that their _bt_checkkeys
calls can then start at. This is an offset to the first key that isn't
known to be satisfied by every tuple on the page.
Although this is independently useful work, its main goal is to avoid
performance regressions with index scans that use skip arrays, but still
never manage to skip over irrelevant leaf pages. We must avoid wasting
CPU cycles on overly granular skip array maintenance in these cases.
The new "startikey" optimization helps with this by selectively
disabling array maintenance for the duration of a _bt_readpage call.
This has no lasting consequences for the scan's array keys (they'll
still reliably track the scan's progress through the index's key space
whenever the scan is "between pages").
Skip scan adds skip arrays during preprocessing using simple, static
rules, and decides how best to navigate/apply the scan's skip arrays
dynamically, at runtime. The "startikey" optimization enables this
approach. As a result of all this, the planner doesn't need to generate
distinct, competing index paths (one path for skip scan, another for an
equivalent traditional full index scan). The overall effect is to make
scan runtime close to optimal, even when the planner works off an
incorrect cardinality estimate. Scans will also perform well given a
skipped column with data skew: individual groups of pages with many
distinct values (in respect of a skipped column) can be read about as
efficiently as before -- without the scan being forced to give up on
skipping over other groups of pages that are provably irrelevant.
Many scans that cannot possibly skip will still benefit from the use of
skip arrays, since they'll allow the "startikey" optimization to be as
effective as possible (by allowing preprocessing to mark all the scan's
keys as required). A scan that uses a skip array on "a" for a qual
"WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now,
even when every tuple read by the scan has its own distinct "a" value.
However, there are still some remaining regressions, affecting certain
trickier cases.
Scans whose index quals have several range skip arrays, each on some
high cardinality column, can still be slower than they were before the
introduction of skip scan -- even with the new "startikey" optimization.
There are also known regressions affecting very selective index scans
that use a skip array. The underlying issue with such selective scans
is that they never get as far as reading a second leaf page, and so will
never get a chance to consider applying the "startikey" optimization.
In principle, all regressions could be avoided by teaching preprocessing
to not add skip arrays whenever they aren't expected to help, but it
seems best to err on the side of robust performance.
Follow-up to commit 92fe23d9, which added nbtree skip scan.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com
Teach nbtree multi-column index scans to opportunistically skip over
irrelevant sections of the index given a query with no "=" conditions on
one or more prefix index columns. When nbtree is passed input scan keys
derived from a predicate "WHERE b = 5", new nbtree preprocessing steps
output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.
That is, preprocessing generates a "skip array" (and an output scan key)
for the omitted prefix column "a", which makes it safe to mark the scan
key on "b" as required to continue the scan. The scan is therefore able
to repeatedly reposition itself by applying both the "a" and "b" keys.
A skip array has "elements" that are generated procedurally and on
demand, but otherwise works just like a regular ScalarArrayOp array.
Preprocessing can freely add a skip array before or after any input
ScalarArrayOp arrays. Index scans with a skip array decide when and
where to reposition the scan using the same approach as any other scan
with array keys. This design builds on the design for array advancement
and primitive scan scheduling added to Postgres 17 by commit 5bf748b8.
Testing has shown that skip scans of an index with a low cardinality
skipped prefix column can be multiple orders of magnitude faster than an
equivalent full index scan (or sequential scan). In general, the
cardinality of the scan's skipped column(s) limits the number of leaf
pages that can be skipped over.
The core B-Tree operator classes on most discrete types generate their
array elements with the help of their own custom skip support routine.
This infrastructure gives nbtree a way to generate the next required
array element by incrementing (or decrementing) the current array value.
It can reduce the number of index descents in cases where the next
possible indexable value frequently turns out to be the next value
stored in the index. Opclasses that lack a skip support routine fall
back on having nbtree "increment" (or "decrement") a skip array's
current element by setting the NEXT (or PRIOR) scan key flag, without
directly changing the scan key's sk_argument. These sentinel values
behave just like any other value from an array -- though they can never
locate equal index tuples (they can only locate the next group of index
tuples containing the next set of non-sentinel values that the scan's
arrays need to advance to).
A skip array's range is constrained by "contradictory" inequality keys.
For example, a skip array on "x" will only generate the values 1 and 2
given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip
array qual usually has near-identical performance characteristics to a
comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However,
improved performance isn't guaranteed. Much depends on physical index
characteristics.
B-Tree preprocessing is optimistic about skipping working out: it
applies static, generic rules when determining where to generate skip
arrays, which assumes that the runtime overhead of maintaining skip
arrays will pay for itself -- or lead to only a modest performance loss.
As things stand, these assumptions are much too optimistic: skip array
maintenance will lead to unacceptable regressions with unsympathetic
queries (queries whose scan can't skip over many irrelevant leaf pages).
An upcoming commit will address the problems in this area by enhancing
_bt_readpage's approach to saving cycles on scan key evaluation, making
it work in a way that directly considers the needs of = array keys
(particularly = skip array keys).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
This commit implements the automatic conversion of 'x IN (VALUES ...)' into
ScalarArrayOpExpr. That simplifies the query tree, eliminating the appearance
of an unnecessary join.
Since VALUES describes a relational table, and the value of such a list is
a table row, the optimizer will likely face an underestimation problem due to
the inability to estimate cardinality through MCV statistics. The cardinality
evaluation mechanism can work with the array inclusion check operation.
If the array is small enough (< 100 elements), it will perform a statistical
evaluation element by element.
We perform the transformation in the convert_ANY_sublink_to_join() if VALUES
RTE is proper and the transformation is convertible. The conversion is only
possible for operations on scalar values, not rows. Also, we currently
support the transformation only when it ends up with a constant array.
Otherwise, the evaluation of non-hashed SAOP might be slower than the
corresponding Hash Join with VALUES.
Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com
Author: Alena Rybakina <a.rybakina@postgrespro.ru>
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
This commit extracts the code to generate ScalarArrayOpExpr on top of the list
of expressions from match_orclause_to_indexcol() into a separate function
make_SAOP_expr(). This function was extracted to be used in optimization for
conversion of 'x IN (VALUES ...)' to 'x = ANY ...'. make_SAOP_expr() is
placed in clauses.c file as only two additional headers were needed there
compared with other places.
Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com
Author: Alena Rybakina <a.rybakina@postgrespro.ru>
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Fix for commit 9ef1851685: We have to skip indexes where sortopfamily
is NULL. This takes the place of the previous btree check. Detected
by valgrind on the buildfarm.
Commit 28d3c2ddcf introduced an assertion that if the memorized
downlink location in the insertion stack isn't valid, the parent's
LSN should've changed too. Turns out that was too strict. In
gistFindCorrectParent(), if we walk right, we update the parent's
block number and clear its memorized 'downlinkoffnum'. That triggered
the assertion on next call to gistFindCorrectParent(), if the parent
needed to be split too. Relax the assertion, so that it's OK if
downlinkOffnum is InvalidOffsetNumber.
Backpatch to v13-, all supported versions. The assertion was added in
commit 28d3c2ddcf in v12.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://www.postgresql.org/message-id/18396-03cac9beb2f7aac3@postgresql.org
Previously, "COPY table TO" command worked only with plain tables and
did not support materialized views, even when they were populated and
had physical storage. To copy rows from materialized views,
"COPY (query) TO" command had to be used, instead.
This commit extends "COPY table TO" to support populated materialized
views directly, improving usability and performance, as "COPY table TO"
is generally faster than "COPY (query) TO". Note that copying from
unpopulated materialized views will still result in an error.
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CACJufxHVxnyRYy67hiPePNCPwVBMzhTQ6FaL9_Te5On9udG=yg@mail.gmail.com
This was previously not supported because the btree strategy numbers
were hardcoded. Now we can support this for any index that has the
required strategy mapping support and the required operators.
If an index scan used for get_actual_variable_range() requires
recheck, we now just ignore it instead of erroring out. With btree we
knew this couldn't happen, but now it might.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
Previously, ALTER DEFAULT PRIVILEGES did not support large objects.
This meant that to grant privileges to users other than the owner,
permissions had to be manually assigned each time a large object
was created, which was inconvenient.
This commit extends ALTER DEFAULT PRIVILEGES to allow defining default
access privileges for large objects. With this change, specified privileges
will automatically apply to newly created large objects, making privilege
management more efficient.
As a side effect, this commit introduces the new keyword OBJECTS
since it's used in the syntax of ALTER DEFAULT PRIVILEGES.
Original patch by Haruka Takatsuka, with some fixes and tests by Yugo Nagata,
and rebased by Laurenz Albe.
Author: Takatsuka Haruka <harukat@sraoss.co.jp>
Co-authored-by: Yugo Nagata <nagata@sraoss.co.jp>
Co-authored-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Masao Fujii <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/20240424115242.236b499b2bed5b7a27f7a418@sraoss.co.jp
This gets rid of the bespoken ProcessWalRcvInterrupts() function,
which lets walreceiver terminate at any CHECK_FOR_INTERRUPTS() call.
And it's less code anyway.
We can now use the standard libpqsrv_connect_params() libpq wrapper
from libpq-be-fe-helpers.h, removing more code. We attempted to do
that earlier already in commit 728f86fec6, but that was reverted
because it didn't call ProcessWalRcvInterrupts() and therefore didn't
react to shutdown requests. Now that ProcessWalRcvInterrupts() is
gone, it works. As stated in that commit, this also leads to
libpqwalreceiver reserving file descriptors for libpq conncetions,
which is nice.
Author: Andres Freund <andres@anarazel.de> (the earlier commit)
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Change the PathKey struct to use CompareType to record the sort
direction instead of hardcoding btree strategy numbers. The
CompareType is then converted to the index-type-specific strategy when
the plan is created.
This reduces the number of places btree strategy numbers are
hardcoded, and it's a self-contained subset of a larger effort to
allow non-btree indexes to behave like btrees.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
Derived clauses are stored in ec_derives, a List of RestrictInfos.
These clauses are later looked up by matching the left and right
EquivalenceMembers along with the clause's parent EC.
This linear search becomes expensive in queries with many joins or
partitions, where ec_derives may contain thousands of entries. In
particular, create_join_clause() can spend significant time scanning
this list.
To improve performance, introduce a hash table (ec_derives_hash) that
is built when the list reaches 32 entries -- the same threshold used
for join_rel_hash. The original list is retained alongside the hash
table to support EC merging and serialization
(_outEquivalenceClass()).
Each clause is stored in the hash table using a canonicalized key: the
EquivalenceMember with the lower memory address is placed in the key
before the one with the higher memory address. This avoids storing or
searching for both permutations of the same clause. For clauses
involving a constant EM, the key places NULL in the first slot and the
non-constant EM in the second.
The hash table is initialized using list_length(ec_derives_list) as
the size hint. simplehash internally adjusts this to the next power of
two after dividing by the fillfactor, so this typically results in at
least 64 buckets near the threshold -- avoiding immediate resizing
while adapting to the actual number of entries.
The lookup logic for derived clauses is now centralized in
ec_search_derived_clause_for_ems(), which consults the hash table when
available and falls back to the list otherwise.
The new ec_clear_derived_clauses() always frees ec_derives_list, even
though some of the original code paths that cleared the old
ec_derives field did not. This ensures consistent cleanup and avoids
leaking memory when large lists are discarded.
An assertion originally placed in find_derived_clause_for_ec_member()
is moved into ec_search_derived_clause_for_ems() so that it is
enforced consistently, regardless of whether the hash table or list is
used for lookup.
This design incorporates suggestions by David Rowley, who proposed
both the key canonicalization and the initial sizing approach to
balance memory usage and CPU efficiency.
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com>
Tested-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Tested-by: Amit Langote <amitlangote09@gmail.com>
Tested-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAExHW5vZiQtWU6moszLP5iZ8gLX_ZAUbgEX0DxGLx9PGWCtqUg@mail.gmail.com
find_derived_clause_for_ec_member() searches for a previously-derived
clause that equates a non-constant EquivalenceMember to a constant.
It is only called for EquivalenceClasses with ec_has_const set, and
with a non-constant member the EquivalenceMember to search for.
The matched clause is expected to have the non-constant member on the
left-hand side and the constant EquivalenceMember on the right.
Assert that the RHS is indeed a constant, to catch violations of this
structure and enforce assumptions made by
generate_base_implied_equalities_const().
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CAExHW5scMxyFRqOFE6ODmBiW2rnVBEmeEcA-p4W_CyuEikURdA@mail.gmail.com
Previously bitmap heap scan was not AIO batchmode safe because of the
visibility map reads potentially done for the "skip fetch" optimization
(which skipped fetching tuples from the heap if the pages were all
visible and none of the columns were used in the query).
The skip fetch optimization implementation was found to have bugs and
was removed in 459e7bf8e2, so we can safely enable batchmode for
bitmap heap scans.
Several read stream users asserted that the read stream was exhausted
after looping on that very condition. It was pointed out in an a
review of an as-of-yet uncommitted read stream user [1] that this was
confusing and could lead the reader to think there was a possibility of
some kind of race condition. Remove these asserts.
[1] https://postgr.es/m/F9ACE8D0-B807-4A17-B6BD-87EF0717983D%40yesql.se
As coded, fmgr_sql() would get an assertion failure for a SQL function
that has an empty body and is declared to return some type other than
VOID. Typically you'd never get that far because fmgr_sql_validator()
would reject such a definition (I suspect that's how come I managed to
miss the bug). But if check_function_bodies is off or the function is
polymorphic, the validation check wouldn't get made.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/0fde377a-3870-4d18-946a-ce008ee5bb88@gmail.com
Previously, invalidated logical and physical replication slots could
be copied using the pg_copy_logical_replication_slot and
pg_copy_physical_replication_slot functions. Replication slots that
were invalidated for reasons other than WAL removal retained their
restart_lsn. This meant that a new slot copied from an invalidated
slot could have a restart_lsn pointing to a WAL segment that might
have already been removed.
This commit restricts the copying of invalidated replication slots.
Backpatch to v16, where slots could retain their restart_lsn when
invalidated for reasons other than WAL removal.
For v15 and earlier, this check is not required since slots can only
be invalidated due to WAL removal, and existing checks already handle
this issue.
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CANhcyEU65aH0VYnLiu%3DOhNNxhnhNhwcXBeT-jvRe1OiJTo_Ayg%40mail.gmail.com
Backpatch-through: 16
The check for non-inheritable constraints is performed later, and the
same comment is included at that point.
While we're here, remove one extraneous blank line.
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CACJufxETi6x86S8EkH8mRfOcm2AenoE9t1pyCFVMpU34gVhF3w@mail.gmail.com
The issue is that the transactions prepared before two-phase decoding is
enabled can fail to replicate to the subscriber after being committed on a
promoted standby following a failover. This is because the two_phase_at
field of a slot, which tracks the LSN from which two-phase decoding
starts, is not synchronized to standby servers. Without two_phase_at, the
logical decoding might incorrectly identify prepared transaction as
already replicated to the subscriber after promotion of standby server,
causing them to be skipped.
To address the issue on HEAD, the two_phase_at field of the slot is
exposed by the pg_replication_slots view and allows the slot
synchronization to copy this value to the corresponding synced slot on the
standby server.
This bug is likely to occur if the user toggles the two_phase option to
true after initial slot creation. Given that altering the two_phase option
of a replication slot is not allowed in PostgreSQL 17, this bug is less
likely to occur. We can't change the view/function definition in
backbranch so we can't push the same fix but we are brainstorming an
appropriate solution for PG17.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/TYAPR01MB5724CC7C288535BBCEEE65DA94A72@TYAPR01MB5724.jpnprd01.prod.outlook.com
compareentry() is declared to work on WordEntryIN structs, but
tsvectorrecv() is using it in two places to work on WordEntry
structs. This is almost okay, since WordEntry is the first
field of WordEntryIN. But on machines with 8-byte pointers,
WordEntryIN will have a larger alignment spec than WordEntry,
and it's at least theoretically possible that the compiler
could generate code that depends on the larger alignment.
Given the lack of field reports, this may be just a hypothetical bug
that upsets nothing except sanitizer tools. Or it may be real on
certain hardware but nobody's tried to use tsvectorrecv() on such
hardware. In any case we should fix it, and the fix is trivial:
just change compareentry() so that it works on WordEntry without any
mention of WordEntryIN. We can also get rid of the quite-useless
intermediate function WordEntryCMP.
Bug: #18875
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18875-07a29c49c825a608@postgresql.org
Backpatch-through: 13
The optimization does not take the removal of TIDs by a concurrent vacuum into
account. The concurrent vacuum can remove dead TIDs and make pages ALL_VISIBLE
while those dead TIDs are referenced in the bitmap. This can lead to a
skip_fetch scan returning too many tuples.
It likely would be possible to implement this optimization safely, but we
don't have the necessary infrastructure in place. Nor is it clear that it's
worth building that infrastructure, given how limited the skip_fetch
optimization is.
In the backbranches we just disable the optimization by always passing
need_tuples=true to table_beginscan_bm(). We can't perform API/ABI changes in
the backbranches and we want to make the change as minimal as possible.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/CAEze2Wg3gXXZTr6_rwC+s4-o2ZVFB5F985uUSgJTsECx6AmGcQ@mail.gmail.com
Backpatch-through: 13
In the historical implementation of SQL functions (if they don't get
inlined), we built plans for all the contained queries at first call
within an outer query, and then re-used those plans for the duration
of the outer query, and then forgot everything. This was not ideal,
not least because the plans could not be customized to specific values
of the function's parameters. Our plancache infrastructure seems
mature enough to be used here. That will solve both the problem with
not being able to build custom plans and the problem with not being
able to share work across successive outer queries.
Aside from those performance concerns, this change fixes a
longstanding bugaboo with SQL functions: you could not write DDL that
would affect later statements in the same function. That's mostly
still true with new-style SQL functions, since the results of parse
analysis are baked into the stored query trees (and protected by
dependency records). But for old-style SQL functions, it will now
work much as it does with PL/pgSQL functions, because we delay parse
analysis and planning of each query until we're ready to run it.
Some edge cases that require replanning are now handled better too;
see for example the new rowsecurity test, where we now detect an RLS
context change that was previously missed.
One other edge-case change that might be worthy of a release note
is that we now insist that a SQL function's result be generated
by the physically-last query within it. Previously, if the last
original query was deleted by a DO INSTEAD NOTHING rule, we'd be
willing to take the result from the preceding query instead.
This behavior was undocumented except in source-code comments,
and it seems hard to believe that anyone's relying on it.
Along the way to this feature, we needed a few infrastructure changes:
* The plancache can now take either a raw parse tree or an
analyzed-but-not-rewritten Query as the starting point for a
CachedPlanSource. If given a Query, it is caller's responsibility
that nothing will happen to invalidate that form of the query.
We use this for new-style SQL functions, where what's in pg_proc is
serialized Query(s) and we trust the dependency mechanism to disallow
DDL that would break those.
* The plancache now offers a way to invoke a post-rewrite callback
to examine/modify the rewritten parse tree when it is rebuilding
the parse trees after a cache invalidation. We need this because
SQL functions sometimes adjust the parse tree to make its output
exactly match the declared result type; if the plan gets rebuilt,
that has to be re-done.
* There is a new backend module utils/cache/funccache.c that
abstracts the idea of caching data about a specific function
usage (a particular function and set of input data types).
The code in it is moved almost verbatim from PL/pgSQL, which
has done that for a long time. We use that logic now for
SQL-language functions too, and maybe other PLs will have use
for it in the future.
Author: Alexander Pyhalov <a.pyhalov@postgrespro.ru>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://postgr.es/m/8216639.NyiUUSuA9g@aivenlaptop
For GiST, having a sortsupport function allows building the index
using the "sorted build" method, which is much faster.
For b-tree, the sortsupport routine doesn't give any new
functionality, but speeds up sorting a tiny bit. The difference is not
very significant, about 2% in cursory testing on my laptop, because
the range type comparison function has quite a lot of overhead from
detoasting. In any case, since we have the function for GiST anyway,
we might as well register it for the btree opfamily too.
Author: Bernd Helmle <mailings@oopsware.de>
Discussion: https://www.postgresql.org/message-id/64d324ce2a6d535d3f0f3baeeea7b25beff82ce4.camel@oopsware.de
Various places allocated shared memory by first allocating a small chunk
using ShmemInitStruct(), followed by ShmemAlloc() calls to allocate more
memory. Unfortunately, ShmemAlloc() does not update ShmemIndex, so this
affected pg_shmem_allocations - it only shown the initial chunk.
This commit modifies the following allocations, to allocate everything
as a single chunk, and then split it internally.
- PredXactList
- RWConflictPool
- PGPROC structures
- Fast-Path Lock Array
The fast-path lock array is allocated separately, not as a part of the
PGPROC structures allocation.
Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(),
but for shared hash tables that covered only the header and hash
directory. The remaining parts (segments and buckets) were allocated
later using ShmemAlloc(), which does not update the shmem accounting.
Thus, these allocations were not shown in pg_shmem_allocations.
This commit improves the situation by allocating all the hash table
parts at once, using a single ShmemInitStruct() call. This way the
ShmemIndex entries (and thus pg_shmem_allocations) better reflect the
proper size of the hash table.
This affects allocations for private (non-shared) hash tables too, as
the hash_create() code is shared. For non-shared tables this however
makes no practical difference.
This changes the alignment a bit. ShmemAlloc() aligns the chunks using
CACHELINEALIGN(), which means some parts (header, directory, segments)
were aligned this way. Allocating all parts as a single chunk removes
this (implicit) alignment. We've considered adding explicit alignment,
but we've decided not to - it seems to be merely a coincidence due to
using the ShmemAlloc() API, not due to necessity.
Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
Without this, an additional change to the same pg_attribute row
within the same command will fail. This is possible at least with
ALTER TABLE ADD COLUMN on a multiple-inheritance-pathway structure.
(Another potential hazard is that immediately-following operations
might not see the missingval.)
Introduced by 95f650674, which split the former coding that
used a single pg_attribute update to change both atthasdef and
atthasmissing/attmissingval into two updates, but missed that
this should entail two CommandCounterIncrements as well. Like
that fix, back-patch through v13.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/025a3ffa-5eff-4a88-97fb-8f583b015965@gmail.com
Backpatch-through: 13
Currently, the cancel request key is a 32-bit token, which isn't very
much entropy. If you want to cancel another session's query, you can
brute-force it. In most environments, an unauthorized cancellation of
a query isn't very serious, but it nevertheless would be nice to have
more protection from it. Hence make the key longer, to make it harder
to guess.
The longer cancellation keys are generated when using the new protocol
version 3.2. For connections using version 3.0, short 4-bytes keys are
still used.
The new longer key length is not hardcoded in the protocol anymore,
the client is expected to deal with variable length keys, up to 256
bytes. This flexibility allows e.g. a connection pooler to add more
information to the cancel key, which might be useful for finding the
connection.
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions)
Discussion: https://www.postgresql.org/message-id/508d0505-8b7a-4864-a681-e7e5edfe32aa@iki.fi
This expands the NOT ENFORCED constraint flag, previously only
supported for CHECK constraints (commit ca87c415e2), to foreign key
constraints.
Normally, when a foreign key constraint is created on a table, action
and check triggers are added to maintain data integrity. With this
patch, if a constraint is marked as NOT ENFORCED, integrity checks are
no longer required, making these triggers unnecessary. Consequently,
when creating a NOT ENFORCED foreign key constraint, triggers will not
be created, and the constraint will be marked as NOT VALID.
Similarly, if an existing foreign key constraint is changed to NOT
ENFORCED, the associated triggers will be dropped, and the constraint
will also be marked as NOT VALID. Conversely, if a NOT ENFORCED
foreign key constraint is changed to ENFORCED, the necessary triggers
will be created, and the will be changed to VALID by performing
necessary validation.
Since not-enforced foreign key constraints have no triggers, the
shortcut used for example in psql and pg_dump to skip looking for
foreign keys if the relation is known not to have triggers no longer
applies. (It already didn't work for partitioned tables.)
Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Joel Jacobson <joel@compiler.org>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Isaac Morland <isaac.morland@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Tested-by: Triveni N <triveni.n@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com
Allow multiple backends to initialize WAL buffers concurrently. This way
`MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without
taking a single LWLock in exclusive mode.
The new algorithm works as follows:
* reserve a page for initialization using XLogCtl->InitializeReserved,
* ensure the page is written out,
* once the page is initialized, try to advance XLogCtl->InitializedUpTo and
signal to waiters using XLogCtl->InitializedUpToCondVar condition
variable,
* repeat previous steps until we reserve initialization up to the target
WAL position,
* wait until concurrent initialization finishes using a
XLogCtl->InitializedUpToCondVar.
Now, multiple backends can, in parallel, concurrently reserve pages,
initialize them, and advance XLogCtl->InitializedUpTo to point to the latest
initialized page.
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Tested-by: Michael Paquier <michael@paquier.xyz>
Even after reaching the minimum recovery point, if there are long-lived
write transactions with 64 subtransactions on the primary, the recovery
snapshot may not yet be ready for hot standby, delaying read-only
connections on the standby. Previously, when read-only connections were
not accepted due to this condition, the following error message was logged:
FATAL: the database system is not yet accepting connections
DETAIL: Consistent recovery state has not been yet reached.
This DETAIL message was misleading because the following message was
already logged in this case:
LOG: consistent recovery state reached
This contradiction, i.e., indicating that the recovery state was consistent
while also stating it wasn’t, caused confusion.
This commit improves the error message to better reflect the actual state:
FATAL: the database system is not yet accepting connections
DETAIL: Recovery snapshot is not yet ready for hot standby.
HINT: To enable hot standby, close write transactions with more than 64 subtransactions on the primary server.
To implement this, the commit introduces a new postmaster signal,
PMSIGNAL_RECOVERY_CONSISTENT. When the startup process reaches
a consistent recovery state, it sends this signal to the postmaster,
allowing it to correctly recognize that state.
Since this is not a clear bug, the change is applied only to the master
branch and is not back-patched.
Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/02db8cd8e1f527a8b999b94a4bee3165@oss.nttdata.com
Push an ErrorContextCallback adding additional detail about the process
performing the I/O and the owner of the I/O when those are not the same.
For io_method worker, this adds context specifying which process owns
the I/O that the I/O worker is processing.
For io_method io_uring, this adds context only when a backend is
*completing* I/O for another backend. It specifies the pid of the owning
process.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/rdml3fpukrqnas7qc5uimtl2fyytrnu6ymc2vjf2zuflbsjuul%40hyizyjsexwmm
50e17ad28 (v14) and 29f45e299 (v15) made it so the planner could identify
IN and NOT IN clauses which have Const lists as right-hand arguments and
when an appropriate hash function is available for the data types, mark
the ScalarArrayOpExpr as hashable so the executor could execute it more
optimally by building and probing a hash table during expression
evaluation.
These commits both worked correctly when there was only a single
ScalarArrayOpExpr in the given expression being processed by the
planner, but when there were multiple, only the first was checked and any
subsequent ones were not identified, which resulted in less optimal
expression evaluation during query execution for all but the first found
ScalarArrayOpExpr.
Backpatch to 14, where 50e17ad28 was introduced.
Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/29a76f51-97b0-4c07-87b7-ec8e3b5345c9@gmail.com
Backpatch-through: 14
Create a function that will sort the elements of an array
according to the element type's sort order. If the array
has more than one dimension, the sub-arrays of the first
dimension are sorted per normal array-comparison rules,
leaving their contents alone.
In support of this, add pg_type.typarray to the set of fields
cached by the typcache.
Author: Junwang Zhao <zhjwpku@gmail.com>
Co-authored-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://postgr.es/m/CAEG8a3J41a4dpw_-F94fF-JPRXYxw-GfsgoGotKcjs9LVfEEvw@mail.gmail.com
mdreadv() has a codepath to zero out buffers when a read returns zero bytes,
guarded by a check for zero_damaged_pages || InRecovery.
The InRecovery codepath to zero out buffers in mdreadv() appears to be
unreachable. The only known paths to reach mdreadv()/mdstartreadv() in
recovery are XLogReadBufferExtended(), vm_readbuf(), and fsm_readbuf(), each
of which takes care to extend the relation if necessary. This looks to either
have been the case for a long time, or the code was never reachable.
The zero_damaged_pages path is incomplete, as missing segments are not
created.
Putting blocks into the buffer-pool that do not exist on disk is rather
problematic, as such blocks will, at least initially, not be found by scans
that rely on smgrnblocks(), as they are beyond EOF. It also can cause weird
problems with relation extension, as relation extension does not expect blocks
beyond EOF to exist.
Therefore we would like to remove that path.
mdstartreadv(), which I added in e5fe570b51c, does not implement this zeroing
logic. I had started a discussion about that a while ago (linked below), but
forgot to act on the conclusion of the discussion, namely to disable the
in-memory-zeroing behavior.
We could certainly implement equivalent zeroing logic in mdstartreadv(), but
it would have to be more complicated due to potential differences in the
zero_damaged_pages setting between the definer and completor of IO. Given that
we want to remove the logic, that does not seem worth implementing the
necessary logic.
For now, put an Assert(false) and comments documenting this choice into
mdreadv() and comments documenting the deprecation of the path in mdreadv()
and the non-implementation of it in mdstartreadv(). If we, during testing,
discover that we do need the path, we can implement it at that time.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/postgr.es/m/20250330024513.ac.nmisch@google.com
Discussion: https://postgr.es/m/postgr.es/m/3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd
To make the tests possible, a few functions from bufmgr.c/localbuf.c had to be
exported, via buf_internals.h.
Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Due to splitting the block id into two 16 bit integers, BlockIdSet()
is more expensive than one might think. Doing it once per returned
tuple shows up as a small but reliably reproducible cost. It's simple
enough to set the block number just once per block in pagemode, so do
so.
Author: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6
Previously effective_io_concurrency and maintenance_io_concurrency could not
be set above 0 on machines without fadvise support. AIO enables IO concurrency
without such support, via io_method=worker.
Currently only subsystems using the read stream API will take advantage of
this. Other users of maintenance_io_concurrency (like recovery prefetching)
which leverage OS advice directly will not benefit from this change. In those
cases, maintenance_io_concurrency will have no effect on I/O behavior.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CAAKRu_atGgZePo=_g6T3cNtfMf0QxpvoUh5OUqa_cnPdhLd=gw@mail.gmail.com
Submitting IO in larger batches can be more efficient than doing so
one-by-one, particularly for many small reads. It does, however, require
the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
a) block without first calling pgaio_submit_staged(), unless a
to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
never held while waiting for IO.
b) directly or indirectly start another batch pgaio_enter_batchmode()
As this requires care and is nontrivial in some cases, batching is only
used with explicit opt-in.
This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.
There are two cases where batching would likely be beneficial, but where we
aren't using it yet:
1) bitmap heap scans, because the callback reads the VM
This should soon be solved, because we are planning to remove the use of
the VM, due to that not being sound.
2) The first phase of heap vacuum
This could be made to support batchmode, but would require some care.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
continue issuing advice
- AIO benefits from reading ahead with direct IO
- If effective_io_concurrency=0, pass READ_BUFFERS_SYNCHRONOUSLY to
StartReadBuffers() to ensure synchronous IO execution
There are further improvements we should consider:
- While in read_stream_look_ahead(), we can use AIO batch submission mode for
increased efficiency. That however requires care to avoid deadlocks and thus
done separately.
- It can be beneficial to defer starting new IOs until we can issue multiple
IOs at once. That however requires non-trivial heuristics to decide when to
do so.
Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
This finally introduces the first actual use of AIO. StartReadBuffers() now
uses the AIO routines to issue IO.
As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths. However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.
When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests. This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.
One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO. A subsequent commit will adjust the docs
for this.
While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commit.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
This commit implements the infrastructure to perform asynchronous reads into
the buffer pool.
To do so, it:
- Adds readv AIO callbacks for shared and local buffers
It may be worth calling out that shared buffer completions may be run in a
different backend than where the IO started.
- Adds an AIO wait reference to BufferDesc, to allow backends to wait for
in-progress asynchronous IOs
- Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c
equivalents, to be able to deal with AIO
- Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it
now also needs to be called on IO completion
As of this commit, nothing issues AIO on shared/local buffers. A future commit
will update StartReadBuffers() to do so.
Buffer reads executed through this infrastructure will report invalid page /
checksum errors / warnings differently than before:
In the error case the error message will cover all the blocks that were
included in the read, rather than just the reporting the first invalid
block. If more than one block is invalid, the error will include information
about the range of the read, the first invalid block and the number of invalid
pages, with a HINT towards the server log for per-block details.
For the warning case (i.e. zero_damaged_buffers) we would previously emit one
warning message for each buffer in a multi-block read. Now there is only a
single warning message for the entire read, again referring to the server log
for more details in case of multiple checksum failures within a single larger
read.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
If an IO succeeds, but issues a warning, e.g. due to a page verification
failure with zero_damaged_pages, we want to issue that warning in the context
of the issuer of the IO, not the process that executes the completion (always
the case for worker).
It's already possible for a completion callback to report a custom error
message, we just didn't have a result status that allowed a user of AIO to
know that a warning should be emitted even though the IO request succeeded.
All that's needed for that is a dedicated PGAIO_RS_ value.
Previously there were not enough bits in PgAioResult.id for the new
value. Increase. While at that, add defines for the amount of bits and static
asserts to check that the widths are appropriate.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com
For AIO the completion of a read into shared buffers (i.e. verifying the page
including the checksum, updating the BufferDesc to reflect the IO) can happen
in a different backend than the backend that started the IO. As
ignore_checksum_failure can differ between backends, we need to allow the
caller of PageIsVerified() control whether to ignore checksum failures.
The commit leaves a gap in the PIV_* values, as an upcoming commit, which
depends on this commit, will add PIV_LOG_LOG, which better fits just after
PIV_LOG_WARNING.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com
For AIO we execute completion callbacks in critical sections (to ensure that
AIO can in the future be used for WAL, which in turn requires that we can call
completion callbacks in critical sections, to get the resources for WAL
io). To report checksum errors a backend now has to call
pgstat_prepare_report_checksum_failure(), before entering a critical section,
which guarantees the relevant pgstats entry is in shared memory, the relevant
DSM segment is mapped into the backend's memory and the address is known via a
PgStat_EntryRef.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/wkjj4p2rmkevutkwc6tewoovdqznj6c6nvjmvii4oo5wmbh5sr@retq7d6uqs4j
We have errmsg_internal(), errdetail_internal(), but not errhint_internal().
Sometimes it is useful to output a hint with already translated format
string (e.g. because there different messages depending on the condition). For
message/detail we do that with the _internal() variants, but we can't do that
with hint today. It's possible to work around that that by using something
like
str = psprintf(translated_format, args);
ereport(...
errhint("%s", str);
but that's not exactly pretty and makes it harder to avoid memory leaks.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/ym3dqpa4xcvoeknewcw63x77vnqdosbqcetjinb2zfoh65k55m@m4ozmwhr6lk6
For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.
Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.
Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Some of these comments have been wrong for a while (12f3867f55), some I
recently introduced (da7226993f, 55b454d0e1). This includes an update to a
comment in FlushBuffer(), which will be copied in a future commit.
These changes seem big enough to be worth doing in separate commits.
Suggested-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250319212530.80.nmisch@google.com
Checksum failure stats could be attributed to the wrong database in two cases:
- when a read of a shared relation encountered a checksum error , it would be
attributed to the current database, instead of the "database" representing
shared relations
- when using CREATE DATABASE ... STRATEGY WAL_LOG checksum errors in the
source database would be attributed to the current database
The checksum stats reporting via PageIsVerifiedExtended(PIV_REPORT_STAT) does
not have access to the information about what database a page belongs to.
This fixes the issue by removing PIV_REPORT_STAT and delegating the
responsibility to report stats to the caller, which now can learn about the
number of stats via a new optional argument.
As this changes the signature of PageIsVerifiedExtended() and all callers
should adapt to the new signature, use the occasion to rename the function to
PageIsVerified() and remove the compatibility macro.
We could instead have fixed this by adding information about the database to
the args of PageIsVerified(), but there are soon-to-be-applied patches that
need to separate the stats reporting from the PageIsVerified() call
anyway. Those patches also include testing for the failure paths, something we
inexplicably have not had.
As there is no caller of pgstat_report_checksum_failure() left, remove it.
It'd be possible, but awkward to fix this in the back branches. We considered
doing the work not quite worth it, as mis-attributed stats should still elicit
concern. The emitted error messages do allow to attribute the errors
correctly.
Discussion: https://postgr.es/m/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az
Discussion: https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz
ExecInitPartitionInfo() duplicates much of the logic in
ExecInitMerge(), except that it failed to handle DO NOTHING
actions. This would cause an "unknown action in MERGE WHEN clause"
error if a MERGE with any DO NOTHING actions attempted to insert into
a partition not already initialised by ExecInitModifyTable().
Bug: #18871
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Gurjeet Singh <gurjeet@singh.im>
Discussion: https://postgr.es/m/18871-b44e3c96de3bd2e8%40postgresql.org
Backpatch-through: 15
Currently, group_similar_or_args() permutes original positions of clauses
independently on whether it manages to find any groups of similar clauses.
While we are not providing any strict warranties on saving the original order
of OR-clauses, it is preferred that the original order be modified as little
as possible.
This commit changes the reordering algorithm of group_similar_or_args() in
the following way. We reorder each group of similar clauses so that the
first item of the group stays in place, but all the other items are moved
after it. So, if there are no similar clauses, the order of clauses stays
the same. When there are some groups, only required reordering happens while
the rest of the clauses remain in their places.
Reported-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/3ac7c436-81e1-4191-9caf-b0dd70b51511%40gmail.com
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Alena Rybakina <a.rybakina@postgrespro.ru>
The uuidv7_interval() function previously converted a shifted
microsecond-precision timestamp (64-bit integer) to another 64-bit
integer representing a timestamp with nanosecond precision. This
conversion caused overflow for dates beyond the year 2262. The
millisecond and sub-millisecond parts were then extracted from this
nanosecond-precision timestamp and stored in UUIDv7 values.
With this commit, the millisecond and sub-millisecond parts are stored
directly into the UUIDv7 value without being converted back to a
nanosecond precision timestamp. Following RFC 9562, the timestamp is
stored as an unsigned integer, enabling support for dates up to the
year 10889.
Reported and fixed by Andrey Borodin, with cosmetic changes and
regression tests by me.
Reported-by: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/96DEC2D9-659A-40E8-B7BA-AF5D162A9E21@yandex-team.ru
This was left out of the original patch for virtual generated columns
(commit 83ea6c5402).
This just involves a bit of extra work in the executor to expand the
generation expressions and run a "IS NOT NULL" test against them.
There is also a bit of work to make sure that not-null constraints are
checked during a table rewrite.
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Navneet Kumar <thanit3111@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CACJufxHArQysbDkWFmvK+D1TPHQWWTxWN15cMuUaTYX3xhQXgg@mail.gmail.com
Modernize code in ExecRelCheck() and ExecConstraints() a bit,
preparing the way for some new code.
Co-authored-by: jian he <jian.universality@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Navneet Kumar <thanit3111@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CACJufxHArQysbDkWFmvK+D1TPHQWWTxWN15cMuUaTYX3xhQXgg@mail.gmail.com
This removes some setlocale() calls and a lot of commentary about how
dangerous that is. strftime_l() is from POSIX 2008, and on Windows we
use _wcsftime_l().
While here, adjust error message for strftime_l() failure: it does not
in practice set errno (even though POSIX says it could), so no %m.
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/CA%2BhUKGJqVe0%2BPv9dvC9dSums_PXxGo9SWcxYAMBguWJUGbWz-A%40mail.gmail.com