We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com
This commit fixes an issue related to the way password verifiers hashed
with MD5 are detected, leading to possibly detect that plain passwords
are legit MD5 hashes. A MD5-hashed entry was checked based on if its
header uses "md5" and if the string length matches what is expected.
Unfortunately the code never checked if the hash only used hexadecimal
characters after the three-character prefix.
Fix 9.6 down to 9.4, where this code is present. This area of the code
has changed in 10 and upwards with the introduction of SCRAM, which led
to a different fix committed as of ccae190.
Reported-by: Tom Lane
Author: Michael Paquier
Reviewed-by: Jonathan Katz
Discussion: https://postgr.es/m/016deb6b-1f0a-8e9f-1833-a8675b170aa9@postgresql.org
Backpatch-through: 9.4
postmaster startup scrutinizes any shared memory segment recorded in
postmaster.pid, exiting if that segment matches the current data
directory and has an attached process. When the postmaster.pid file was
missing, a starting postmaster used weaker checks. Change to use the
same checks in both scenarios. This increases the chance of a startup
failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1
postmaster.pid` && rm postmaster.pid && pg_ctl -w start". A postmaster
will no longer stop if shmat() of an old segment fails with EACCES. A
postmaster will no longer recycle segments pertaining to other data
directories. That's good for production, but it's bad for integration
tests that crash a postmaster and immediately delete its data directory.
Such a test now leaks a segment indefinitely. No "make check-world"
test does that. win32_shmem.c already avoided all these problems. In
9.6 and later, enhance PostgresNode to facilitate testing. Back-patch
to 9.4 (all supported versions).
Reviewed (in earlier versions) by Daniel Gustafsson and Kyotaro HORIGUCHI.
Discussion: https://postgr.es/m/20190408064141.GA2016666@rfd.leadboat.com
transaction.
The transaction that is initiated by the parallel worker to cooperate
with the actual transaction started by the main backend to complete the
query execution should not be counted as a separate transaction. The
other internal transactions started and committed by the parallel worker
are still counted as separate transactions as we that is what we do in
other places like autovacuum.
This will partially fix the bloat in transaction stats due to additional
transactions performed by parallel workers. For a complete fix, we need to
decide how we want to show all the transactions that are started internally
for various operations and that is a matter of separate patch.
Reported-by: Haribabu Kommi
Author: Haribabu Kommi
Reviewed-by: Amit Kapila, Jamison Kirk and Rahila Syed
Backpatch-through: 9.6
Discussion: https://postgr.es/m/CAJrrPGc9=jKXuScvNyQ+VNhO0FZk7LLAShAJRyZjnedd2D61EQ@mail.gmail.com
We've long had reports of intermittent "could not reattach to shared
memory" errors on Windows. Buildfarm member dory fails that way when
PGSharedMemoryReAttach() execution overlaps with creation of a thread
for the process's "default thread pool". Fix that by providing a second
region to receive asynchronous allocations that would otherwise intrude
into UsedShmemSegAddr. In pgwin32_ReserveSharedMemoryRegion(), stop
trying to free reservations landing at incorrect addresses; the caller's
next step has been to terminate the affected process. Back-patch to 9.4
(all supported versions).
Reviewed by Tom Lane. He also did much of the prerequisite research;
see commit bcbf2346d6.
Discussion: https://postgr.es/m/20190402135442.GA1173872@rfd.leadboat.com
postmaster startup scrutinizes any shared memory segment recorded in
postmaster.pid, exiting if that segment matches the current data
directory and has an attached process. When the postmaster.pid file was
missing, a starting postmaster used weaker checks. Change to use the
same checks in both scenarios. This increases the chance of a startup
failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1
postmaster.pid` && rm postmaster.pid && pg_ctl -w start". A postmaster
will no longer recycle segments pertaining to other data directories.
That's good for production, but it's bad for integration tests that
crash a postmaster and immediately delete its data directory. Such a
test now leaks a segment indefinitely. No "make check-world" test does
that. win32_shmem.c already avoided all these problems. In 9.6 and
later, enhance PostgresNode to facilitate testing. Back-patch to 9.4
(all supported versions).
Reviewed by Daniel Gustafsson and Kyotaro HORIGUCHI.
Discussion: https://postgr.es/m/20130911033341.GD225735@tornado.leadboat.com
Relations dropped in a single transaction are tracked in a list of
unowned relations. With large number of dropped relations this resulted
in poor performance at the end of a transaction, when the relations are
removed from the singly linked list one by one.
Commit b4166911 attempted to address this issue (particularly when it
happens during recovery) by removing the relations in a reverse order,
resulting in O(1) lookups in the list of unowned relations. This did
not work reliably, though, and it was possible to trigger the O(N^2)
behavior in various ways.
Instead of trying to remove the relations in a specific order with
respect to the linked list, which seems rather fragile, switch to a
regular doubly linked. That allows us to remove relations cheaply no
matter where in the list they are.
As b4166911 was a bugfix, backpatched to all supported versions, do the
same thing here.
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/80c27103-99e4-1d0c-642c-d9f3b94aaa0a%402ndquadrant.com
Backpatch-through: 9.4
52ac6cd2d0 added new field to ginxlogDeletePage and was backpatched to 9.4.
That led to problems when patched postgres instance applies WAL records
generated by non-patched one. WAL records generated by non-patched instance
don't contain new field, which patched one is expecting to see.
Thankfully, we can distinguish patched and non-patched WAL records by their data
size. If we see that WAL record is generated by non-patched instance, we skip
processing of new field. This commit comes with some assertions. In
particular, if it appears that on some platform struct data size didn't change
then static assertion will trigger.
Reported-by: Simon Riggs
Discussion: https://postgr.es/m/CANP8%2Bj%2BK4whxf7ET7%2BgO%2BG-baC3-WxqqH%3DnV4X2CgfEPA3Yu3g%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Simon Riggs, Alvaro Herrera
Backpatch-through: 9.4
Commit 6f6a6d8b1 introduced a delay of up to 2 seconds if we're trying
to request a checkpoint but the checkpointer hasn't started yet (or,
much less likely, our kill() call fails). However buildfarm experience
shows that that's not quite enough for slow or heavily-loaded machines.
There's no good reason to assume that the checkpointer won't start
eventually, so we may as well make the timeout much longer, say 60 sec.
However, if the caller didn't say CHECKPOINT_WAIT, it seems like a bad
idea to be waiting at all, much less for as long as 60 sec. We can
remove the need for that, and make this whole thing more robust, by
adjusting the code so that the existence of a pending checkpoint
request is clear from the contents of shared memory, and making sure
that the checkpointer process will notice it at startup even if it did
not get a signal. In this way there's no need for a non-CHECKPOINT_WAIT
call to wait at all; if it can't send the signal, it can nonetheless
assume that the checkpointer will eventually service the request.
A potential downside of this change is that "kill -INT" on the checkpointer
process is no longer enough to trigger a checkpoint, should anyone be
relying on something so hacky. But there's no obvious reason to do it
like that rather than issuing a plain old CHECKPOINT command, so we'll
assume that nobody is. There doesn't seem to be a way to preserve this
undocumented quasi-feature without introducing race conditions.
Since a principal reason for messing with this is to prevent intermittent
buildfarm failures, back-patch to all supported branches.
Discussion: https://postgr.es/m/27830.1552752475@sss.pgh.pa.us
ecpglib attempts to force the LC_NUMERIC locale to "C" while reading
server output, to avoid problems with strtod() and related functions.
Historically it's just issued setlocale() calls to do that, but that
has major problems if we're in a threaded application. setlocale()
itself is not required by POSIX to be thread-safe (and indeed is not,
on recent OpenBSD). Moreover, its effects are process-wide, so that
we could cause unexpected results in other threads, or another thread
could change our setting.
On platforms having uselocale(), which is required by POSIX:2008,
we can avoid these problems by using uselocale() instead. Windows
goes its own way as usual, but we can make it safe by using
_configthreadlocale(). Platforms having neither continue to use the
old code, but that should be pretty much nobody among current systems.
(Subsequent buildfarm results show that recent NetBSD versions still
lack uselocale(), but it's not a big problem because they also do not
support non-"C" settings for LC_NUMERIC.)
Back-patch of commits 8eb4a9312 and ee27584c4.
Michael Meskes and Tom Lane; thanks also to Takayuki Tsunakawa.
Discussion: https://postgr.es/m/31420.1547783697@sss.pgh.pa.us
Recent OpenBSD (at least 5.9 and up) has a version of getopt(3)
that will not cope with the "-:" spec we use to accept double-dash
options in postgres.c and postmaster.c. Admittedly, that's a hack
because POSIX only requires getopt() to allow alphanumeric option
characters. I have no desire to find another way, however, so
let's just do what we were already doing on Solaris: force use
of our own src/port/getopt.c implementation.
In passing, improve some of the comments around said implementation.
Per buildfarm and local testing. Back-patch to all supported branches.
Discussion: https://postgr.es/m/30197.1547835700@sss.pgh.pa.us
Up to now, createplan.c attempted to share PARAM_EXEC slots for
NestLoopParams across different plan levels, if the same underlying Var
was being fed down to different righthand-side subplan trees by different
NestLoops. This was, I think, more of an artifact of using subselect.c's
PlannerParamItem infrastructure than an explicit design goal, but anyway
that was the end result.
This works well enough as long as the plan tree is executing synchronously,
but the feature whereby Gather can execute the parallelized subplan locally
breaks it. An upper NestLoop node might execute for a row retrieved from
a parallel worker, and assign a value for a PARAM_EXEC slot from that row,
while the leader's copy of the parallelized subplan is suspended with a
different active value of the row the Var comes from. When control
eventually returns to the leader's subplan, it gets the wrong answers if
the same PARAM_EXEC slot is being used within the subplan, as reported
in bug #15577 from Bartosz Polnik.
This is pretty reminiscent of the problem fixed in commit 46c508fbc, and
the proper fix seems to be the same: don't try to share PARAM_EXEC slots
across different levels of controlling NestLoop nodes.
This requires decoupling NestLoopParam handling from PlannerParamItem
handling, although the logic remains somewhat similar. To avoid bizarre
division of labor between subselect.c and createplan.c, I decided to move
all the param-slot-assignment logic for both cases out of those files
and put it into a new file paramassign.c. Hopefully it's a bit better
documented now, too.
A regression test case for this might be nice, but we don't know a
test case that triggers the problem with a suitably small amount
of data.
Back-patch to 9.6 where we added Gather nodes. It's conceivable that
related problems exist in older branches; but without some evidence
for that, I'll leave the older branches alone.
Discussion: https://postgr.es/m/15577-ca61ab18904af852@postgresql.org
013ebc0a7b implements so-called GiST microvacuum. That is gistgettuple() marks
index tuples as dead when kill_prior_tuple is set. Later, when new tuple
insertion claims page space, those dead index tuples are physically deleted
from page. When this deletion is replayed on standby, it might conflict with
read-only queries. But 013ebc0a7b doesn't handle this. That may lead to
disappearance of some tuples from read-only snapshots on standby.
This commit implements resolving of conflicts between replay of GiST microvacuum
and standby queries. On the master we implement new WAL record type
XLOG_GIST_DELETE, which comprises necessary information. On stable releases
we've to be tricky to keep WAL compatibility. Information required for conflict
processing is just appended to data of XLOG_GIST_PAGE_UPDATE record. So,
PostgreSQL version, which doesn't know about conflict processing, will just
ignore that.
Reported-by: Andres Freund
Diagnosed-by: Andres Freund
Discussion: https://postgr.es/m/20181212224524.scafnlyjindmrbe6%40alap3.anarazel.de
Author: Alexander Korotkov
Backpatch-through: 9.6
When GIN vacuum deletes a posting tree page, it assumes that no concurrent
searchers can access it, thanks to ginStepRight() locking two pages at once.
However, since 9.4 searches can skip parts of posting trees descending from the
root. That leads to the risk that page is deleted and reclaimed before
concurrent search can access it.
This commit prevents the risk of above by waiting for every transaction, which
might wait to reference this page, to finish. Due to binary compatibility
we can't change GinPageOpaqueData to store corresponding transaction id.
Instead we reuse page header pd_prune_xid field, which is unused in index pages.
Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
Author: Andrey Borodin, Alexander Korotkov
Reviewed-by: Alexander Korotkov
Backpatch-through: 9.4
postgres_fdw's postgresGetForeignPlan() assumes without checking that the
outer_plan it's given for a join relation must have a NestLoop, MergeJoin,
or HashJoin node at the top. That's been wrong at least since commit
4bbf6edfb (which could cause insertion of a Sort node on top) and it seems
like a pretty unsafe thing to Just Assume even without that.
Through blind good fortune, this doesn't seem to have any worse
consequences today than strange EXPLAIN output, but it's clearly trouble
waiting to happen.
To fix, test the node type explicitly before touching Join-specific
fields, and avoid jamming the new tlist into a node type that can't
do projection. Export a new support function from createplan.c
to avoid building low-level knowledge about the latter into FDWs.
Back-patch to 9.6 where the faulty coding was added. Note that the
associated regression test cases don't show any changes before v11,
apparently because the tests back-patched with 4bbf6edfb don't actually
exercise the problem case before then (there's no top-level Sort
in those plans).
Discussion: https://postgr.es/m/8946.1544644803@sss.pgh.pa.us
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure. In that case the only remaining copy of the
data is in the WAL. A subsequent fsync() could appear to succeed,
but not have flushed the data. That means that a future checkpoint
could apparently complete successfully but have lost data.
Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure. Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.
Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses. If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail. The GUC defaults
to off, meaning that we panic.
Back-patch to all supported releases.
There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error. A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.
Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
This patch fixes several related cases in which pg_shdepend entries were
never made, or were lost, for references to roles appearing in the ACLs of
schemas and/or types. While that did no immediate harm, if a referenced
role were later dropped, the drop would be allowed and would leave a
dangling reference in the object's ACL. That still wasn't a big problem
for normal database usage, but it would cause obscure failures in
subsequent dump/reload or pg_upgrade attempts, taking the form of
attempts to grant privileges to all-numeric role names. (I think I've
seen field reports matching that symptom, but can't find any right now.)
Several cases are fixed here:
1. ALTER DOMAIN SET/DROP DEFAULT would lose the dependencies for any
existing ACL entries for the domain. This case is ancient, dating
back as far as we've had pg_shdepend tracking at all.
2. If a default type privilege applies, CREATE TYPE recorded the
ACL properly but forgot to install dependency entries for it.
This dates to the addition of default privileges for types in 9.2.
3. If a default schema privilege applies, CREATE SCHEMA recorded the
ACL properly but forgot to install dependency entries for it.
This dates to the addition of default privileges for schemas in v10
(commit ab89e465c).
Another somewhat-related problem is that when creating a relation
rowtype or implicit array type, TypeCreate would apply any available
default type privileges to that type, which we don't really want
since such an object isn't supposed to have privileges of its own.
(You can't, for example, drop such privileges once they've been added
to an array type.)
ab89e465c is also to blame for a race condition in the regression tests:
privileges.sql transiently installed globally-applicable default
privileges on schemas, which sometimes got absorbed into the ACLs of
schemas created by concurrent test scripts. This should have resulted
in failures when privileges.sql tried to drop the role holding such
privileges; but thanks to the bug fixed here, it instead led to dangling
ACLs in the final state of the regression database. We'd managed not to
notice that, but it became obvious in the wake of commit da906766c, which
allowed the race condition to occur in pg_upgrade tests.
To fix, add a function recordDependencyOnNewAcl to encapsulate what
callers of get_user_default_acl need to do; while the original call
sites got that right via ad-hoc code, none of the later-added ones
have. Also change GenerateTypeDependencies to generate these
dependencies, which requires adding the typacl to its parameter list.
(That might be annoying if there are any extensions calling that
function directly; but if there are, they're most likely buggy in the
same way as the core callers were, so they need work anyway.) While
I was at it, I changed GenerateTypeDependencies to accept most of its
parameters in the form of a Form_pg_type pointer, making its parameter
list a bit less unwieldy and mistake-prone.
The test race condition is fixed just by wrapping the addition and
removal of default privileges into a single transaction, so that that
state is never visible externally. We might eventually prefer to
separate out tests of default privileges into a script that runs by
itself, but that would be a bigger change and would make the tests
run slower overall.
Back-patch relevant parts to all supported branches.
Discussion: https://postgr.es/m/15719.1541725287@sss.pgh.pa.us
Repeatedly rewriting a mapped catalog table with VACUUM FULL or
CLUSTER could cause logical decoding to fail with:
ERROR, "could not map filenode \"%s\" to relation OID"
To trigger the problem the rewritten catalog had to have live tuples
with toasted columns.
The problem was triggered as during catalog table rewrites the
heap_insert() check that prevents logical decoding information to be
emitted for system catalogs, failed to treat the new heap's toast table
as a system catalog (because the new heap is not recognized as a
catalog table via RelationIsLogicallyLogged()). The relmapper, in
contrast to the normal catalog contents, does not contain historical
information. After a single rewrite of a mapped table the new relation
is known to the relmapper, but if the table is rewritten twice before
logical decoding occurs, the relfilenode cannot be mapped to a
relation anymore. Which then leads us to error out. This only
happens for toast tables, because the main table contents aren't
re-inserted with heap_insert().
The fix is simple, add a new heap_insert() flag that prevents logical
decoding information from being emitted, and accept during decoding
that there might not be tuple data for toast tables.
Unfortunately that does not fix pre-existing logical decoding
errors. Doing so would require not throwing an error when a filenode
cannot be mapped to a relation during decoding, and that seems too
likely to hide bugs. If it's crucial to fix decoding for an existing
slot, temporarily changing the ERROR in ReorderBufferCommit() to a
WARNING appears to be the best fix.
Author: Andres Freund
Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de
Backpatch: 9.4-, where logical decoding was introduced
Index DDL cascading on partitioned tables introduced a way for ALTER
TABLE to be called reentrantly. This caused an an important deficiency
in event trigger support to be exposed: on exiting the reentrant call,
the alter table state object was clobbered, causing a crash when the
outer alter table tries to finalize its processing. Fix the crash by
creating a stack of event trigger state objects. There are still ways
to cause things to misbehave (and probably other crashers) with more
elaborate tricks, but at least it now doesn't crash in the obvious
scenario.
Backpatch to 9.5, where DDL deparsing of event triggers was introduced.
Reported-by: Marco Slot
Authors: Michaël Paquier, Álvaro Herrera
Discussion: https://postgr.es/m/CANNhMLCpi+HQ7M36uPfGbJZEQLyTy7XvX=5EFkpR-b1bo0uJew@mail.gmail.com
Previously, a worker process would establish values for these based on
its own start time. In v10 and up, this can trivially be shown to cause
misbehavior of transaction_timestamp(), timestamp_in(), and related
functions which are (perhaps unwisely?) marked parallel-safe. It seems
likely that other behaviors might diverge from what happens in the parent
as well.
It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure
it's still possible, so back-patch to all branches containing parallel
worker infrastructure.
In HEAD only, mark now() and statement_timestamp() as parallel-safe
(other affected functions already were). While in theory we could
still squeeze that change into v11, it doesn't seem important enough
to force a last-minute catversion bump.
Konstantin Knizhnik, whacked around a bit by me
Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
Historically we forbade datatype-specific comparison functions from
returning INT_MIN, so that it would be safe to invert the sort order
just by negating the comparison result. However, this was never
really safe for comparison functions that directly return the result
of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction
on those library functions. Buildfarm results show that at least on
recent Linux on s390x, memcmp() actually does return INT_MIN sometimes,
causing sort failures.
The agreed-on answer is to remove this restriction and fix relevant
call sites to not make such an assumption; code such as "res = -res"
should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed
in a few places that just directly negated the result of memcmp or
strcmp.
To help find places having this problem, I've also added a compile option
to nbtcompare.c that causes some of the commonly used comparators to
return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be
a good idea to have at least one buildfarm member running with
"-DSTRESS_SORT_INT_MIN". That's far from a complete test of course,
but it should help to prevent fresh introductions of such bugs.
This is a longstanding portability hazard, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
The method we've traditionally used, of redeclaring strerror_r() to
see if the compiler complains of inconsistent declarations, turns out
not to work reliably because some compilers only report a warning,
not an error. Amazingly, this has gone undetected for years, even
though it certainly breaks our detection of whether strerror_r
succeeded.
Let's instead test whether the compiler will take the result of
strerror_r() as a switch() argument. It's possible this won't
work universally either, but it's the best idea I could come up with
on the spur of the moment.
Back-patch of commit 751f532b9. Buildfarm results indicate that only
icc-on-Linux actually has an issue here; perhaps the lack of field
reports indicates that people don't build PG for production that way.
Discussion: https://postgr.es/m/10877.1537993279@sss.pgh.pa.us
The EvalPlanQual machinery assumes that any initplans (that is,
uncorrelated sub-selects) used during an EPQ recheck would have already
been evaluated during the main query; this is implicit in the fact that
execPlan pointers are not copied into the EPQ estate's es_param_exec_vals.
But it's possible for that assumption to fail, if the initplan is only
reached conditionally. For example, a sub-select inside a CASE expression
could be reached during a recheck when it had not been previously, if the
CASE test depends on a column that was just updated.
This bug is old, appearing to date back to my rewrite of EvalPlanQual in
commit 9f2ee8f28, but was not detected until Kyle Samson reported a case.
To fix, force all not-yet-evaluated initplans used within the EPQ plan
subtree to be evaluated at the start of the recheck, before entering the
EPQ environment. This could be inefficient, if such an initplan is
expensive and goes unused again during the recheck --- but that's piling
one layer of improbability atop another. It doesn't seem worth adding
more complexity to prevent that, at least not in the back branches.
It was convenient to use the new-in-v11 ExecEvalParamExecParams function
to implement this, but I didn't like either its name or the specifics of
its API, so revise that.
Back-patch all the way. Rather than rewrite the patch to avoid depending
on bms_next_member() in the oldest branches, I chose to back-patch that
function into 9.4 and 9.3. (This isn't the first time back-patches have
needed that, and it exhausted my patience.) I also chose to back-patch
some test cases added by commits 71404af2a and 342a1ffa2 into 9.4 and 9.3,
so that the 9.x versions of eval-plan-qual.spec are all the same.
Andrew Gierth diagnosed the problem and contributed the added test cases,
though the actual code changes are by me.
Discussion: https://postgr.es/m/A033A40A-B234-4324-BE37-272279F7B627@tripadvisor.com
Allowing sub-select containing LIMIT/OFFSET in workers can lead to
inconsistent results at the top-level as there is no guarantee that the
row order will be fully deterministic. The fix is to prohibit pushing
LIMIT/OFFSET within sub-selects to workers.
Reported-by: Andrew Fletcher
Bug: 15324
Author: Amit Kapila
Reviewed-by: Dilip Kumar
Backpatch-through: 9.6
Discussion: https://postgr.es/m/153417684333.10284.11356259990921828616@wrigleys.postgresql.org
This patch removes two sources of interference between nominally
independent functions when one SPI-using function calls another,
perhaps without knowing that it does so.
Chapman Flack pointed out that xml.c's query_to_xml_internal() expects
SPI_tuptable and SPI_processed to stay valid across datatype output
function calls; but it's possible that such a call could involve
re-entrant use of SPI. It seems likely that there are similar hazards
elsewhere, if not in the core code then in third-party SPI users.
Previously SPI_finish() reset SPI's API globals to zeroes/nulls, which
would typically make for a crash in such a situation. Restoring them
to the values they had at SPI_connect() seems like a considerably more
useful behavior, and it still meets the design goal of not leaving any
dangling pointers to tuple tables of the function being exited.
Also, cause SPI_connect() to reset these variables to zeroes/nulls after
saving them. This prevents interference in the opposite direction: it's
possible that a SPI-using function that's only ever been tested standalone
contains assumptions that these variables start out as zeroes. That was
the case as long as you were the outermost SPI user, but not so much for
an inner user. Now it's consistent.
Report and fix suggestion by Chapman Flack, actual patch by me.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/9fa25bef-2e4f-1c32-22a4-3ad0723c4a17@anastigmatix.net
LockRelationOid and sibling routines supposed that, if our session already
holds the lock they were asked to acquire, they could skip calling
AcceptInvalidationMessages on the grounds that we must have already read
any remote sinval messages issued against the relation being locked.
This is normally true, but there's a critical special case where it's not:
processing inside AcceptInvalidationMessages might attempt to access system
relations, resulting in a recursive call to acquire a relation lock.
Hence, if the outer call had acquired that same system catalog lock, we'd
fall through, despite the possibility that there's an as-yet-unread sinval
message for that system catalog. This could, for example, result in
failure to access a system catalog or index that had just been processed
by VACUUM FULL. This is the explanation for buildfarm failures we've been
seeing intermittently for the past three months. The bug is far older
than that, but commits a54e1f158 et al added a new recursion case within
AcceptInvalidationMessages that is apparently easier to hit than any
previous case.
To fix this, we must not skip calling AcceptInvalidationMessages until
we have *finished* a call to it since acquiring a relation lock, not
merely acquired the lock. (There's already adequate logic inside
AcceptInvalidationMessages to deal with being called recursively.)
Fortunately, we can implement that at trivial cost, by adding a flag
to LOCALLOCK hashtable entries that tracks whether we know we have
completed such a call.
There is an API hazard added by this patch for external callers of
LockAcquire: if anything is testing for LOCKACQUIRE_ALREADY_HELD,
it might be fooled by the new return code LOCKACQUIRE_ALREADY_CLEAR
into thinking the lock wasn't already held. This should be a fail-soft
condition, though, unless something very bizarre is being done in
response to the test.
Also, I added an additional output argument to LockAcquireExtended,
assuming that that probably isn't called by any outside code given
the very limited usefulness of its additional functionality.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/12259.1532117714@sss.pgh.pa.us
There's a project policy against using plain "char buf[BLCKSZ]" local
or static variables as page buffers; preferred style is to palloc or
malloc each buffer to ensure it is MAXALIGN'd. However, that policy's
been ignored in an increasing number of places. We've apparently got
away with it so far, probably because (a) relatively few people use
platforms on which misalignment causes core dumps and/or (b) the
variables chance to be sufficiently aligned anyway. But this is not
something to rely on. Moreover, even if we don't get a core dump,
we might be paying a lot of cycles for misaligned accesses.
To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock
that the compiler must allocate with sufficient alignment, and use
those in place of plain char arrays.
I used these types even for variables where there's no risk of a
misaligned access, since ensuring proper alignment should make
kernel data transfers faster. I also changed some places where
we had been palloc'ing short-lived buffers, for coding style
uniformity and to save palloc/pfree overhead.
Since this seems to be a live portability hazard (despite the lack
of field reports), back-patch to all supported versions.
Patch by me; thanks to Michael Paquier for review.
Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
In general, Postgres requires -fno-strict-aliasing with compilers that
implement C99 strict aliasing rules. There's little hope of getting
rid of that overall. But it seems like it would be a good idea if
storage/checksum_impl.h in particular didn't depend on it, because
that header is explicitly intended to be included by external programs.
We don't have a lot of control over the compiler switches that an
external program might use, as shown by Michael Banck's report of
failure in a privately-modified version of pg_verify_checksums.
Hence, switch to using a union in place of willy-nilly pointer casting
inside this file. I think this makes the code a bit more readable
anyway.
checksum_impl.h hasn't changed since it was introduced in 9.3,
so back-patch all the way.
Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
When expanding an updatable view that is an INSERT's target, the rewriter
failed to rewrite Vars in the ON CONFLICT UPDATE clause. This accidentally
worked if the view was just "SELECT * FROM ...", as the transformation
would be a no-op in that case. With more complicated view targetlists,
this omission would often lead to "attribute ... has the wrong type" errors
or even crashes, as reported by Mario De Frutos Dieguez.
Fix by adding code to rewriteTargetView to fix up the data structure
correctly. The easiest way to update the exclRelTlist list is to rebuild
it from scratch looking at the new target relation, so factor the code
for that out of transformOnConflictClause to make it sharable.
In passing, avoid duplicate permissions checks against the EXCLUDED
pseudo-relation, and prevent useless view expansion of that relation's
dummy RTE. The latter is only known to happen (after this patch) in cases
where the query would fail later due to not having any INSTEAD OF triggers
for the view. But by exactly that token, it would create an unintended
and very poorly tested state of the query data structure, so it seems like
a good idea to prevent it from happening at all.
This has been broken since ON CONFLICT was introduced, so back-patch
to 9.5.
Dean Rasheed, based on an earlier patch by Amit Langote;
comment-kibitzing and back-patching by me
Discussion: https://postgr.es/m/CAFYwGJ0xfzy8jaK80hVN2eUWr6huce0RU8AgU04MGD00igqkTg@mail.gmail.com
Commits 742869946 et al turn out to be a couple bricks shy of a load.
We were dumping the stored values of GUC_LIST_QUOTE variables as they
appear in proconfig or setconfig catalog columns. However, although that
quoting rule looks a lot like SQL-identifier double quotes, there are two
critical differences: empty strings ("") are legal, and depending on which
variable you're considering, values longer than NAMEDATALEN might be valid
too. So the current technique fails altogether on empty-string list
entries (as reported by Steven Winfield in bug #15248) and it also risks
truncating file pathnames during dump/reload of GUC values that are lists
of pathnames.
To fix, split the stored value without any downcasing or truncation,
and then emit each element as a SQL string literal.
This is a tad annoying, because we now have three copies of the
comma-separated-string splitting logic in varlena.c as well as a fourth
one in dumputils.c. (Not to mention the randomly-different-from-those
splitting logic in libpq...) I looked at unifying these, but it would
be rather a mess unless we're willing to tweak the API definitions of
SplitIdentifierString, SplitDirectoriesString, or both. That might be
worth doing in future; but it seems pretty unsafe for a back-patched
bug fix, so for now accept the duplication.
Back-patch to all supported branches, as the previous fix was.
Discussion: https://postgr.es/m/7585.1529435872@sss.pgh.pa.us
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
When vacuum processes a relation it uses the corresponding relcache
entry's relfrozenxid / relminmxid as a cutoff for when to remove
tuples etc. Unfortunately for nailed relations (i.e. critical system
catalogs) bugs could frequently lead to the corresponding relcache
entry being stale.
This set of bugs could cause actual data corruption as vacuum would
potentially not remove the correct row versions, potentially reviving
them at a later point. After 699bf7d05c some corruptions in this vein
were prevented, but the additional error checks could also trigger
spuriously. Examples of such errors are:
ERROR: found xmin ... from before relfrozenxid ...
and
ERROR: found multixact ... from before relminmxid ...
To be caused by this bug the errors have to occur on system catalog
tables.
The two bugs are:
1) Invalidations for nailed relations were ignored, based on the
theory that the relcache entry for such tables doesn't
change. Which is largely true, except for fields like relfrozenxid
etc. This means that changes to relations vacuumed in other
sessions weren't picked up by already existing sessions. Luckily
autovacuum doesn't have particularly longrunning sessions.
2) For shared *and* nailed relations, the shared relcache init file
was never invalidated while running. That means that for such
tables (e.g. pg_authid, pg_database) it's not just already existing
sessions that are affected, but even new connections are as well.
That explains why the reports usually were about pg_authid et. al.
To fix 1), revalidate the rd_rel portion of a relcache entry when
invalid. This implies a bit of extra complexity to deal with
bootstrapping, but it's not too bad. The fix for 2) is simpler,
simply always remove both the shared and local init files.
Author: Andres Freund
Reviewed-By: Alvaro Herrera
Discussion:
https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.dehttps://postgr.es/m/CAMa1XUhKSJd98JW4o9StWPrfS=11bPgG+_GDMxe25TvUY4Sugg@mail.gmail.comhttps://postgr.es/m/CAKMFJucqbuoDRfxPDX39WhA3vJyxweRg_zDVXzncr6+5wOguWA@mail.gmail.comhttps://postgr.es/m/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K+A@mail.gmail.com
Backpatch: 9.3-
Ancient HPUX, for one, does this. We hadn't noticed due to the lack
of regression tests that required a working strtoll.
(I was slightly tempted to remove the other historical spelling,
strto[u]q, since it seems we have no buildfarm members testing that case.
But I refrained.)
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org
Buildfarm member dromedary is still unhappy about the recently-added
ecpg "long long" tests. The reason turns out to be that it includes
"-ansi" in its CFLAGS, and in their infinite wisdom Apple have decided
to hide the declarations of strtoll/strtoull in C89-compliant builds.
(I find it pretty curious that they hide those function declarations
when you can nonetheless declare a "long long" variable, but anyway
that is their behavior, both on dromedary's obsolete macOS version and
the newest and shiniest.) As a result, gcc assumes these functions
return "int", leading naturally to wrong results.
(Looking at dromedary's past build results, it's evident that this
problem also breaks pg_strtouint64() on 32-bit platforms; but we
evidently have no regression tests that exercise that function with
values above 32 bits.)
To fix, supply declarations for these functions when the platform
provides the functions but not the declarations, using the same type
of mechanism as we use for some other similar cases.
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org
This is needed for full support of "long long" variables in ecpg, but
the previous patch for bug #15080 (commits 51057feaa et al) missed it.
In MSVC versions where the functions don't exist under those names,
we can nonetheless use _strtoi64() and _strtoui64().
Like the previous patch, back-patch all the way.
Dang Minh Huong
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
On further reflection, commit e5d83995e didn't go far enough: pretty much
everywhere in the planner that examines a clause's is_pushed_down flag
ought to be changed to use the more complicated behavior where we also
check the clause's required_relids. Otherwise we could make incorrect
decisions about whether, say, a clause is safe to use as a hash clause.
Some (many?) of these places are safe as-is, either because they are
never reached while considering a parameterized path, or because there
are additional checks that would reject a pushed-down clause anyway.
However, it seems smarter to just code them all the same way rather
than rely on easily-broken reasoning of that sort.
In support of that, invent a new macro RINFO_IS_PUSHED_DOWN that should
be used in place of direct tests on the is_pushed_down flag.
Like the previous patch, back-patch to all supported branches.
Discussion: https://postgr.es/m/f8128b11-c5bf-3539-48cd-234178b2314d@proxel.se
In some cases a clause attached to an outer join can be pushed down into
the outer join's RHS even though the clause is not degenerate --- this
can happen if we choose to make a parameterized path for the RHS. If
the clause ends up attached to a lower outer join, we'd misclassify it
as being a "join filter" not a plain "filter" condition at that node,
leading to wrong query results.
To fix, teach extract_actual_join_clauses to examine each join clause's
required_relids, not just its is_pushed_down flag. (The latter now
seems vestigial, or at least in need of rethinking, but we won't do
anything so invasive as redefining it in a bug-fix patch.)
This has been wrong since we introduced parameterized paths in 9.2,
though it's evidently hard to hit given the lack of previous reports.
The test case used here involves a lateral function call, and I think
that a lateral reference may be required to get the planner to select
a broken plan; though I wouldn't swear to that. In any case, even if
LATERAL is needed to trigger the bug, it still affects all supported
branches, so back-patch to all.
Per report from Andreas Karlsson. Thanks to Andrew Gierth for
preliminary investigation.
Discussion: https://postgr.es/m/f8128b11-c5bf-3539-48cd-234178b2314d@proxel.se
The atomics code asserts proper alignment in various places. That's
mainly because the alignment of 64bit integers is not sufficient for
atomic operations on all platforms. Some ABIs only have four byte
alignment, but don't have atomic behavior when crossing page
boundaries.
The flags code isn't affected by that however, as the type alignment
always is sufficient for atomic operations. Nevertheless the code
asserted alignment requirements. Before 8c3debbb it was only broken on
hppa, after it probably affect further platforms.
Thus remove the assertions for pg_atomic_flag operators.
Per buildfarm animal pademelon.
Discussion: https://postgr.es/m/7223.1523124425@sss.pgh.pa.us
Backpatch: 9.5-
The atomics fallback implementation for pg_atomic_flag was broken,
returning the inverted value from pg_atomic_test_set_flag(). This was
unnoticed because a) atomic flags were unused until recently b) the
test code wasn't run when the fallback implementation was in
use (because it didn't allow to test for some edge cases).
Fix the bug, and improve the fallback so it has the same behaviour as
the non-fallback implementation in the problematic edge cases. That
breaks ABI compatibility in the back branches when fallbacks are in
use, but given they were broken until now...
Author: Andres Freund
Reported-by: Daniel Gustafsson
Discussion:
https://postgr.es/m/FB948276-7B32-4B77-83E6-D00167F8EEB4@yesql.sehttps://postgr.es/m/20180406233854.uni2h3mbnveczl32@alap3.anarazel.de
Backpatch: 9.5-, where the atomics abstraction was introduced.
Richard Yen reported that pg_upgrade failed if the target cluster had
force_parallel_mode = on, because binary_upgrade_create_empty_extension()
is marked parallel restricted, allowing it to be executed in parallel
mode, which complains because it tries to acquire an XID.
In general, no function that might try to modify database data should
be considered parallel safe or restricted, since execution of it might
force XID acquisition. We found several other examples of this mistake.
Furthermore, functions that execute user-supplied SQL queries or query
fragments, or pull data from user-supplied cursors, had better be marked
both volatile and parallel unsafe, because we don't know what the supplied
query or cursor might try to do. There were several tsquery and XML
functions that had the wrong proparallel marking for this, and some of
them were even mislabeled as to volatility.
All these bugs are old, dating back to 9.6 for the proparallel mistakes
and much further for the provolatile mistakes. We can't force a
catversion bump in the back branches, but we can at least ensure that
installations initdb'd in future have the right values.
Thomas Munro and Tom Lane
Discussion: https://postgr.es/m/CAEepm=2sNDScSLTfyMYu32Q=ob98ZGW-vM_2oLxinzSABGQ6VA@mail.gmail.com
Code that prints out the contents of setconfig or proconfig arrays in
SQL format needs to handle GUC_LIST_QUOTE variables differently from
other ones, because for those variables, flatten_set_variable_args()
already applied a layer of quoting. The value can therefore safely
be printed as-is, and indeed must be, or flatten_set_variable_args()
will muck it up completely on reload. For all other GUC variables,
it's necessary and sufficient to quote the value as a SQL literal.
We'd recognized the need for this long ago, but mis-analyzed the
need slightly, thinking that all GUC_LIST_INPUT variables needed
the special treatment. That's actually wrong, since a valid value
of a LIST variable might include characters that need quoting,
although no existing variables accept such values.
More to the point, we hadn't made any particular effort to keep the
various places that deal with this up-to-date with the set of variables
that actually need special treatment, meaning that we'd do the wrong
thing with, for example, temp_tablespaces values. This affects dumping
of SET clauses attached to functions, as well as ALTER DATABASE/ROLE SET
commands.
In ruleutils.c we can fix it reasonably honestly by exporting a guc.c
function that allows discovering the flags for a given GUC variable.
But pg_dump doesn't have easy access to that, so continue the old method
of having a hard-wired list of affected variable names. At least we can
fix it to have just one list not two, and update the list to match
current reality.
A remaining problem with this is that it only works for built-in
GUC variables. pg_dump's list obvious knows nothing of third-party
extensions, and even the "ask guc.c" method isn't bulletproof since
the relevant extension might not be loaded. There's no obvious
solution to that, so for now, we'll just have to discourage extension
authors from inventing custom GUCs that need GUC_LIST_QUOTE.
This has been busted for a long time, so back-patch to all supported
branches.
Michael Paquier and Tom Lane, reviewed by Kyotaro Horiguchi and
Pavel Stehule
Discussion: https://postgr.es/m/20180111064900.GA51030@paquier.xyz
The original coding of the SP-GiST scan traversalValue feature (commit
ccd6eb49a) arranged for traversal values to be stored in the query's main
executor context. That's fine if there's only one index scan per query,
but if there are many, we have a memory leak as successive scans create
new traversal values. Fix it by creating a separate memory context for
traversal values, which we can reset during spgrescan(). Back-patch
to 9.6 where this code was introduced.
In principle, adding the traversalCxt field to SpGistScanOpaqueData
creates an ABI break in the back branches. But I (tgl) have little
sympathy for extensions including spgist_private.h, so I'm not very
worried about that. Alternatively we could stick the new field at the
end of the struct in back branches, but that has its own downsides.
Anton Dignös, reviewed by Alexander Kuzmenkov
Discussion: https://postgr.es/m/CALNdv1jb6y2Te-m8xHLxLX12RsBmZJ1f4hESX7J0HjgyOhA9eA@mail.gmail.com
refresh_by_match_merge() has some issues in the way it builds a SQL
query to construct the "diff" table:
1. It doesn't require the selected unique index(es) to be indimmediate.
2. It doesn't pay attention to the particular equality semantics enforced
by a given index, but just assumes that they must be those of the column
datatype's default btree opclass.
3. It doesn't check that the indexes are btrees.
4. It's insufficiently careful to ensure that the parser will pick the
intended operator when parsing the query. (This would have been a
security bug before CVE-2018-1058.)
5. It's not careful about indexes on system columns.
The way to fix#4 is to make use of the existing code in ri_triggers.c
for generating an arbitrary binary operator clause. I chose to move
that to ruleutils.c, since that seems a more reasonable place to be
exporting such functionality from than ri_triggers.c.
While #1, #3, and #5 are just latent given existing feature restrictions,
and #2 doesn't arise in the core system for lack of alternate opclasses
with different equality behaviors, #4 seems like an issue worth
back-patching. That's the bulk of the change anyway, so just back-patch
the whole thing to 9.4 where this code was introduced.
Discussion: https://postgr.es/m/13836.1521413227@sss.pgh.pa.us
"UPDATE/DELETE WHERE CURRENT OF cursor_name" failed, with an error message
like "cannot extract system attribute from virtual tuple", if the cursor
was using a index-only scan for the target table. Fix it by digging the
current TID out of the indexscan state.
It seems likely that the same failure could occur for CustomScan plans
and perhaps some FDW plan types, so that leaving this to be treated as an
internal error with an obscure message isn't as good an idea as it first
seemed. Hence, add a bit of heaptuple.c infrastructure to let us deliver
a more on-topic message. I chose to make the message match what you get
for the case where execCurrentOf can't identify the target scan node at
all, "cursor "foo" is not a simply updatable scan of table "bar"".
Perhaps it should be different, but we can always adjust that later.
In the future, it might be nice to provide hooks that would let custom
scan providers and/or FDWs deal with this in other ways; but that's
not a suitable topic for a back-patchable bug fix.
It's been like this all along, so back-patch to all supported branches.
Yugo Nagata and Tom Lane
Discussion: https://postgr.es/m/20180201013349.937dfc5f.nagata@sraoss.co.jp
One of the things canonicalize_qual() does is to remove constant-NULL
subexpressions of top-level AND/OR clauses. It does that on the assumption
that what it's given is a top-level WHERE clause, so that NULL can be
treated like FALSE. Although this is documented down inside a subroutine
of canonicalize_qual(), it wasn't mentioned in the documentation of that
function itself, and some callers hadn't gotten that memo.
Notably, commit d007a9505 caused get_relation_constraints() to apply
canonicalize_qual() to CHECK constraints. That allowed constraint
exclusion to misoptimize situations in which a CHECK constraint had a
provably-NULL subclause, as seen in the regression test case added here,
in which a child table that should be scanned is not. (Although this
thinko is ancient, the test case doesn't fail before 9.2, for reasons
I've not bothered to track down in detail. There may be related cases
that do fail before that.)
More recently, commit f0e44751d added an independent bug by applying
canonicalize_qual() to index expressions, which is even sillier since
those might not even be boolean. If they are, though, I think this
could lead to making incorrect index entries for affected index
expressions in v10. I haven't attempted to prove that though.
To fix, add an "is_check" parameter to canonicalize_qual() to specify
whether it should assume WHERE or CHECK semantics, and make it perform
NULL-elimination accordingly. Adjust the callers to apply the right
semantics, or remove the call entirely in cases where it's not known
that the expression has one or the other semantics. I also removed
the call in some cases involving partition expressions, where it should
be a no-op because such expressions should be canonical already ...
and was a no-op, independently of whether it could in principle have
done something, because it was being handed the qual in implicit-AND
format which isn't what it expects. In HEAD, add an Assert to catch
that type of mistake in future.
This represents an API break for external callers of canonicalize_qual().
While that's intentional in HEAD to make such callers think about which
case applies to them, it seems like something we probably wouldn't be
thanked for in released branches. Hence, in released branches, the
extra parameter is added to a new function canonicalize_qual_ext(),
and canonicalize_qual() is a wrapper that retains its old behavior.
Patch by me with suggestions from Dean Rasheed. Back-patch to all
supported branches.
Discussion: https://postgr.es/m/24475.1520635069@sss.pgh.pa.us
If convert_to_scalar is passed a pair of datatypes it can't cope with,
its former behavior was just to elog(ERROR). While this is OK so far as
the core code is concerned, there's extension code that would like to use
scalarltsel/scalargtsel/etc as selectivity estimators for operators that
work on non-core datatypes, and this behavior is a show-stopper for that
use-case. If we simply allow convert_to_scalar to return FALSE instead of
outright failing, then the main logic of scalarltsel/scalargtsel will work
fine for any operator that behaves like a scalar inequality comparison.
The lack of conversion capability will mean that we can't estimate to
better than histogram-bin-width precision, since the code will effectively
assume that the comparison constant falls at the middle of its bin. But
that's still a lot better than nothing. (Someday we should provide a way
for extension code to supply a custom version of convert_to_scalar, but
today is not that day.)
While poking at this issue, we noted that the existing code for handling
type bytea in convert_to_scalar is several bricks shy of a load.
It assumes without checking that if the comparison value is type bytea,
the bounds values are too; in the worst case this could lead to a crash.
It also fails to detoast the input values, so that the comparison result is
complete garbage if any input is toasted out-of-line, compressed, or even
just short-header. I'm not sure how often such cases actually occur ---
the bounds values, at least, are probably safe since they are elements of
an array and hence can't be toasted. But that doesn't make this code OK.
Back-patch to all supported branches, partly because author requested that,
but mostly because of the bytea bugs. The change in API for the exposed
routine convert_network_to_scalar() is theoretically a back-patch hazard,
but it seems pretty unlikely that any third-party code is calling that
function directly.
Tomas Vondra, with some adjustments by me
Discussion: https://postgr.es/m/b68441b6-d18f-13ab-b43b-9a72188a4e02@2ndquadrant.com
Although configure-based builds correctly define HAVE_LONG_LONG_INT when
appropriate (in both pg_config.h and ecpg_config.h), builds using the MSVC
scripts failed to do so. This currently has no impact on the backend,
since it uses that symbol nowhere; but it does prevent ecpg from
supporting "long long int". Fix that.
Also, adjust Solution.pm so that in the constructed ecpg_config.h file,
the "#if (_MSC_VER > 1200)" covers only the LONG_LONG_INT-related
#defines, not the whole file. AFAICS this was a thinko on somebody's
part: ENABLE_THREAD_SAFETY should always be defined in Windows builds,
and in branches using USE_INTEGER_DATETIMES, the setting of that shouldn't
depend on the compiler version either. If I'm wrong, I imagine the
buildfarm will say so.
Per bug #15080 from Jonathan Allen; issue diagnosed by Michael Meskes
and Andrew Gierth. Back-patch to all supported branches.
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org
This makes the client programs behave as documented regardless of the
connect-time search_path and regardless of user-created objects. Today,
a malicious user with CREATE permission on a search_path schema can take
control of certain of these clients' queries and invoke arbitrary SQL
functions under the client identity, often a superuser. This is
exploitable in the default configuration, where all users have CREATE
privilege on schema "public".
This changes behavior of user-defined code stored in the database, like
pg_index.indexprs and pg_extension_config_dump(). If they reach code
bearing unqualified names, "does not exist" or "no schema has been
selected to create in" errors might appear. Users may fix such errors
by schema-qualifying affected names. After upgrading, consider watching
server logs for these errors.
The --table arguments of src/bin/scripts clients have been lax; for
example, "vacuumdb -Zt pg_am\;CHECKPOINT" performed a checkpoint. That
now fails, but for now, "vacuumdb -Zt 'pg_am(amname);CHECKPOINT'" still
performs a checkpoint.
Back-patch to 9.3 (all supported versions).
Reviewed by Tom Lane, though this fix strategy was not his first choice.
Reported by Arseniy Sharoglazov.
Security: CVE-2018-1058
Given overlapping or partially redundant join clauses, for example
t1 JOIN t2 ON t1.a = t2.x AND t1.b = t2.x
the planner's EquivalenceClass machinery will ordinarily refactor the
clauses as "t1.a = t1.b AND t1.a = t2.x", so that join processing doesn't
see multiple references to the same EquivalenceClass in a list of join
equality clauses. However, if the join is outer, it's incorrect to derive
a restriction clause on the outer side from the join conditions, so the
clause refactoring does not happen and we end up with overlapping join
conditions. The code that attempted to deal with such cases had several
subtle bugs, which could result in "left and right pathkeys do not match in
mergejoin" or "outer pathkeys do not match mergeclauses" planner errors,
if the selected join plan type was a mergejoin. (It does not appear that
any actually incorrect plan could have been emitted.)
The core of the problem really was failure to recognize that the outer and
inner relations' pathkeys have different relationships to the mergeclause
list. A join's mergeclause list is constructed by reference to the outer
pathkeys, so it will always be ordered the same as the outer pathkeys, but
this cannot be presumed true for the inner pathkeys. If the inner sides of
the mergeclauses contain multiple references to the same EquivalenceClass
({t2.x} in the above example) then a simplistic rendering of the required
inner sort order is like "ORDER BY t2.x, t2.x", but the pathkey machinery
recognizes that the second sort column is redundant and throws it away.
The mergejoin planning code failed to account for that behavior properly.
One error was to try to generate cut-down versions of the mergeclause list
from cut-down versions of the inner pathkeys in the same way as the initial
construction of the mergeclause list from the outer pathkeys was done; this
could lead to choosing a mergeclause list that fails to match the outer
pathkeys. The other problem was that the pathkey cross-checking code in
create_mergejoin_plan treated the inner and outer pathkey lists
identically, whereas actually the expectations for them must be different.
That led to false "pathkeys do not match" failures in some cases, and in
principle could have led to failure to detect bogus plans in other cases,
though there is no indication that such bogus plans could be generated.
Reported by Alexander Kuzmenkov, who also reviewed this patch. This has
been broken for years (back to around 8.3 according to my testing), so
back-patch to all supported branches.
Discussion: https://postgr.es/m/5dad9160-4632-0e47-e120-8e2082000c01@postgrespro.ru
This backpatches 935dee9ad5 to the
the branches requested by extension authors.
Original-Author: Metin Doslu
Original-Committer: Robert Haas
Author: Brian Cloutier
Commit 28724fd90d fixed things so that
if a background worker fails to start due to fork() failure or because
it is terminated before startup succeeds, BGWH_STOPPED will be
reported. However, that only helps if the code that uses the
background worker machinery notices the change in status, and the code
in parallel.c did not.
To fix that, do two things. First, make sure that when a worker
exits, it triggers the leader to read from error queues. That way, if
a worker which has attached to an error queue exits uncleanly, the
leader is sure to throw some error, either the contents of the
ErrorResponse sent by the worker, or "lost connection to parallel
worker" if it exited without sending one. To cover the case where
the worker never starts up in the first place or exits before
attaching to the error queue, the ParallelContext now keeps track
of which workers have sent at least one message via the error
queue. A worker which sends no messages by the time the parallel
operation finishes will be checked to see whether it exited before
attaching to the error queue; if so, a new error message, "parallel
worker failed to initialize", will be reported. If not, we'll
continue to wait until it either starts up and exits cleanly, starts
up and exits uncleanly, or fails to start, and then take the
appropriate action.
Patch by me, reviewed by Amit Kapila.
Discussion: http://postgr.es/m/CA+TgmoYnBgXgdTu6wk5YPdWhmgabYc9nY_pFLq=tB=FSLYkD8Q@mail.gmail.com
Logical decoding's reorderbuffer.c may spill transaction files to disk
when transactions are large. These are supposed to be removed when they
become "too old" by xid; but file removal requires the boundary LSNs of
the transaction to be known. The final_lsn is only set when we see the
commit or abort record for the transaction, but nothing sets the value
for transactions that crash, so the removal code misbehaves -- in
assertion-enabled builds, it crashes by a failed assertion.
To fix, modify the final_lsn of transactions that don't have a value
set, to the LSN of the very latest change in the transaction. This
causes the spilled files to be removed appropriately.
Author: Atsushi Torikoshi
Reviewed-by: Kyotaro HORIGUCHI, Craig Ringer, Masahiko Sawada
Discussion: https://postgr.es/m/54e4e488-186b-a056-6628-50628e4e4ebc@lab.ntt.co.jp
The previous commit has shown that the sanity checks around freezing
aren't strong enough. Strengthening them seems especially important
because the existance of the bug has caused corruption that we don't
want to make even worse during future vacuum cycles.
The errors are emitted with ereport rather than elog, despite being
"should never happen" messages, so a proper error code is emitted. To
avoid superflous translations, mark messages as internal.
Author: Andres Freund and Alvaro Herrera
Reviewed-By: Alvaro Herrera, Michael Paquier
Discussion: https://postgr.es/m/20171102112019.33wb7g5wp4zpjelu@alap3.anarazel.de
Backpatch: 9.3-
This patch fixes a couple of low-probability bugs that could lead to
reporting an irrelevant errno value (and hence possibly a wrong SQLSTATE)
concerning directory-open or file-open failures. It also fixes places
where we took shortcuts in reporting such errors, either by using elog
instead of ereport or by using ereport but forgetting to specify an
errcode. And it eliminates a lot of just plain redundant error-handling
code.
In service of all this, export fd.c's formerly-static function
ReadDirExtended, so that external callers can make use of the coding
pattern
dir = AllocateDir(path);
while ((de = ReadDirExtended(dir, path, LOG)) != NULL)
if they'd like to treat directory-open failures as mere LOG conditions
rather than errors. Also fix FreeDir to be a no-op if we reach it
with dir == NULL, as such a coding pattern would cause.
Then, remove code at many call sites that was throwing an error or log
message for AllocateDir failure, as ReadDir or ReadDirExtended can handle
that job just fine. Aside from being a net code savings, this gets rid of
a lot of not-quite-up-to-snuff reports, as mentioned above. (In some
places these changes result in replacing a custom error message such as
"could not open tablespace directory" with more generic wording "could not
open directory", but it was agreed that the custom wording buys little as
long as we report the directory name.) In some other call sites where we
can't just remove code, change the error reports to be fully
project-style-compliant.
Also reorder code in restoreTwoPhaseData that was acquiring a lock
between AllocateDir and ReadDir; in the unlikely but surely not
impossible case that LWLockAcquire changes errno, AllocateDir failures
would be misreported. There is no great value in opening the directory
before acquiring TwoPhaseStateLock, so just do it in the other order.
Also fix CheckXLogRemoved to guarantee that it preserves errno,
as quite a number of call sites are implicitly assuming. (Again,
it's unlikely but I think not impossible that errno could change
during a SpinLockAcquire. If so, this function was broken for its
own purposes as well as breaking callers.)
And change a few places that were using not-per-project-style messages,
such as "could not read directory" when "could not open directory" is
more correct.
Back-patch the exporting of ReadDirExtended, in case we have occasion
to back-patch some fix that makes use of it; it's not needed right now
but surely making it global is pretty harmless. Also back-patch the
restoreTwoPhaseData and CheckXLogRemoved fixes. The rest of this is
essentially cosmetic and need not get back-patched.
Michael Paquier, with a bit of additional work by me
Discussion: https://postgr.es/m/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
rewriteTargetListUD's processing is dependent on the relkind of the query's
target table. That was fine at the time it was made to act that way, even
for queries on inheritance trees, because all tables in an inheritance tree
would necessarily be plain tables. However, the 9.5 feature addition
allowing some members of an inheritance tree to be foreign tables broke the
assumption that rewriteTargetListUD's output tlist could be applied to all
child tables with nothing more than column-number mapping. This led to
visible failures if foreign child tables had row-level triggers, and would
also break in cases where child tables belonged to FDWs that used methods
other than CTID for row identification.
To fix, delay running rewriteTargetListUD until after the planner has
expanded inheritance, so that it is applied separately to the (already
mapped) tlist for each child table. We can conveniently call it from
preprocess_targetlist. Refactor associated code slightly to avoid the
need to heap_open the target relation multiple times during
preprocess_targetlist. (The APIs remain a bit ugly, particularly around
the point of which steps scribble on parse->targetList and which don't.
But avoiding such scribbling would require a change in FDW callback APIs,
which is more pain than it's worth.)
Also fix ExecModifyTable to ensure that "tupleid" is reset to NULL when
we transition from rows providing a CTID to rows that don't. (That's
really an independent bug, but it manifests in much the same cases.)
Add a regression test checking one manifestation of this problem, which
was that row-level triggers on a foreign child table did not work right.
Back-patch to 9.5 where the problem was introduced.
Etsuro Fujita, reviewed by Ildus Kurbangaliev and Ashutosh Bapat
Discussion: https://postgr.es/m/20170514150525.0346ba72@postgrespro.ru
Originally, we palloc'd this buffer just barely big enough to hold the
largest xlog record seen so far. It turns out that that can result in
valgrind complaints, because some compilers will emit code that assumes
it can safely fetch padding bytes at the end of a struct, and those
padding bytes were unallocated so far as aset.c was concerned. We can
fix that by MAXALIGN'ing the palloc request size, ensuring that it is big
enough to include any possible padding that might've been omitted from
the on-disk record.
An additional objection to the original coding is that it could result in
many repeated palloc cycles, in the worst case where we see a series of
gradually larger xlog records. We can ameliorate that cheaply by
imposing a minimum buffer size that's large enough for most xlog records.
BLCKSZ/2 was chosen after a bit of discussion.
In passing, remove an obsolete comment in struct xl_heap_new_cid that the
combocid field is free due to alignment considerations. Perhaps that was
true at some point, but it's not now.
Back-patch to 9.5 where this code came in.
Discussion: https://postgr.es/m/E1eHa4J-0006hI-Q8@gemulon.postgresql.org
The pending list must (for correctness) always be cleaned up by vacuum, and
should (for the avoidance of surprising behavior) always be cleaned up
by an explicit call to gin_clean_pending_list, but cleanup is optional
when inserting. The old logic got this backward: cleanup was forced
if (stats == NULL), but that's going to be *false* when vacuuming and
*true* for inserts.
Masahiko Sawada, reviewed by me.
Discussion: http://postgr.es/m/CAD21AoBLUSyiYKnTYtSAbC+F=XDjiaBrOUEGK+zUXdQ8owfPKw@mail.gmail.com
Our initial work with int128 neglected alignment considerations, an
oversight that came back to bite us in bug #14897 from Vincent Lachenal.
It is unsurprising that int128 might have a 16-byte alignment requirement;
what's slightly more surprising is that even notoriously lax Intel chips
sometimes enforce that.
Raising MAXALIGN seems out of the question: the costs in wasted disk and
memory space would be significant, and there would also be an on-disk
compatibility break. Nor does it seem very practical to try to allow some
data structures to have more-than-MAXALIGN alignment requirement, as we'd
have to push knowledge of that throughout various code that copies data
structures around.
The only way out of the box is to make type int128 conform to the system's
alignment assumptions. Fortunately, gcc supports that via its
__attribute__(aligned()) pragma; and since we don't currently support
int128 on non-gcc-workalike compilers, we shouldn't be losing any platform
support this way.
Although we could have just done pg_attribute_aligned(MAXIMUM_ALIGNOF) and
called it a day, I did a little bit of extra work to make the code more
portable than that: it will also support int128 on compilers without
__attribute__(aligned()), if the native alignment of their 128-bit-int
type is no more than that of int64.
Add a regression test case that exercises the one known instance of the
problem, in parallel aggregation over a bigint column.
Back-patch of commit 751804998. The code known to be affected only exists
in 9.6 and later, but we do have some stuff using int128 in 9.5, so patch
back to 9.5.
Discussion: https://postgr.es/m/20171110185747.31519.28038@wrigleys.postgresql.org
Generalize section 1 to handle stuff that is principally about the
compiler (not libraries), such as attributes, and collect stuff there
that had been dropped into various other parts of c.h. Also, push
all the gettext macros into section 8, so that section 0 is really
just inclusions rather than inclusions and random other stuff.
The primary goal here is to get pg_attribute_aligned() defined before
section 3, so that we can use it with int128. But this seems like good
cleanup anyway.
This patch just moves macro definitions around, and shouldn't result
in any changes in generated code.
Back-patch of commit 91aec93e6.
Discussion: https://postgr.es/m/20171110185747.31519.28038@wrigleys.postgresql.org
The update path of an INSERT ... ON CONFLICT DO UPDATE requires SELECT
permission on the columns of the arbiter index, but it failed to check
for that in the case of an arbiter specified by constraint name.
In addition, for a table with row level security enabled, it failed to
check updated rows against the table's SELECT policies when the update
path was taken (regardless of how the arbiter index was specified).
Backpatch to 9.5 where ON CONFLICT DO UPDATE and RLS were introduced.
Security: CVE-2017-15099
It turns out we misdiagnosed what the real problem was. Revert the
previous changes, because they may have worse consequences going
forward. A better fix is forthcoming.
The simplistic test case is kept, though disabled.
Discussion: https://postgr.es/m/20171102112019.33wb7g5wp4zpjelu@alap3.anarazel.de
Without this fix, dropping a role can sometimes result in parallel
query failures in sessions that have used "SET ROLE" to assume the
dropped role, even if that setting isn't active any more.
Report by Pavan Deolasee. Patch by Amit Kapila, reviewed by me.
Discussion: http://postgr.es/m/CABOikdOomRcZsLsLK+Z+qENM1zxyaWnAvFh3MJZzZnnKiF+REg@mail.gmail.com
If we try to run a parallel plan in serial mode because, for example,
it's going to be scanned via a cursor, but for some reason we're
already in parallel mode (for example because an outer query is
running in parallel), we'd incorrectly try to launch workers.
Fix by adding a flag to the EState, so that we can be certain that
ExecutePlan() and ExecGather()/ExecGatherMerge() will have the same
idea about whether we are executing serially or in parallel.
Report and fix by Amit Kapila with help from Kuntal Ghosh. A few
tweaks by me.
Discussion: http://postgr.es/m/CAA4eK1+_BuZrmVCeua5Eqnm4Co9DAXdM5HPAOE2J19ePbR912Q@mail.gmail.com
This is epecially useful in the case or "VARIADIC ANY" functions. The
caller can get the artguments and types regardless of whether or not and
explicit VARIADIC array argument has been used. The function also
provides an option to convert arguments on type "unknown" to to "text".
Michael Paquier and me, reviewed by Tom Lane.
Backpatch to 9.4 in order to support the following json bug fix.
If we merge the transition calculations for two different aggregates,
it's reasonable to assume that the transition function should not care
which of those Aggref structs it gets from AggGetAggref(). It is not
reasonable to make the same assumption about an aggregate final function,
however. Commit 804163bc2 broke this, as it will pass whichever Aggref
was first associated with the transition state in both cases.
This doesn't create an observable bug so far as the core system is
concerned, because the only existing uses of AggGetAggref() are in
ordered-set aggregates that happen to not pay attention to anything
but the input properties of the Aggref; and besides that, we disabled
sharing of transition calculations for OSAs yesterday. Nonetheless,
if some third-party code were using AggGetAggref() in a normal aggregate,
they would be entitled to call this a bug. Hence, back-patch the fix
to 9.6 where the problem was introduced.
In passing, improve some of the comments about transition state sharing.
Discussion: https://postgr.es/m/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A@mail.gmail.com
Up to now async.c has used TransactionIdIsInProgress() to detect whether
a notify message's source transaction is still running. However, that
function has a quick-exit path that reports that XIDs before RecentXmin
are no longer running. If a listening backend is doing nothing but
listening, and not running any queries, there is nothing that will advance
its value of RecentXmin. Once 2 billion transactions elapse, the
RecentXmin check causes active transactions to be reported as not running.
If they aren't committed yet according to CLOG, async.c decides they
aborted and discards their messages. The timing for that is a bit tight
but it can happen when multiple backends are sending notifies concurrently.
The net symptom therefore is that a sufficiently-long-surviving
listen-only backend starts to miss some fraction of NOTIFY traffic,
but only under heavy load.
The only function that updates RecentXmin is GetSnapshotData().
A brute-force fix would therefore be to take a snapshot before
processing incoming notify messages. But that would add cycles,
as well as contention for the ProcArrayLock. We can be smarter:
having taken the snapshot, let's use that to check for running
XIDs, and not call TransactionIdIsInProgress() at all. In this
way we reduce the number of ProcArrayLock acquisitions from one
per message to one per notify interrupt; that's the same under
light load but should be a benefit under heavy load. Light testing
says that this change is a wash performance-wise for normal loads.
I looked around for other callers of TransactionIdIsInProgress()
that might be at similar risk, and didn't find any; all of them
are inside transactions that presumably have already taken a
snapshot.
Problem report and diagnosis by Marko Tiikkaja, patch by me.
Back-patch to all supported branches, since it's been like this
since 9.0.
Discussion: https://postgr.es/m/20170926182935.14128.65278@wrigleys.postgresql.org
The logical decoding functions do BeginInternalSubTransaction and
RollbackAndReleaseCurrentSubTransaction to clean up after themselves.
It turns out that AtEOSubXact_SPI has an unrecognized assumption that
we always need to cancel the active SPI operation in the SPI context
that surrounds the subtransaction (if there is one). That's true
when the RollbackAndReleaseCurrentSubTransaction call is coming from
the SPI-using function itself, but not when it's happening inside
some unrelated function invoked by a SPI query. In practice the
affected callers are the various PLs.
To fix, record the current subtransaction ID when we begin a SPI
operation, and clean up only if that ID is the subtransaction being
canceled.
Also, remove AtEOSubXact_SPI's assertion that it must have cleaned
up the surrounding SPI context's active tuptable. That's proven
wrong by the same test case.
Also clarify (or, if you prefer, reinterpret) the calling conventions
for _SPI_begin_call and _SPI_end_call. The memory context cleanup
in the latter means that these have always had the flavor of a matched
resource-management pair, but they weren't documented that way before.
Per report from Ben Chobot.
Back-patch to 9.4 where logical decoding came in. In principle,
the SPI changes should go all the way back, since the problem dates
back to commit 7ec1c5a86. But given the lack of field complaints
it seems few people are using internal subtransactions in this way.
So I don't feel a need to take any risks in 9.2/9.3.
Discussion: https://postgr.es/m/73FBA179-C68C-4540-9473-71E865408B15@silentmedia.com
When some tuple versions in an update chain are frozen due to them being
older than freeze_min_age, the xmax/xmin trail can become broken. This
breaks HOT (and probably other things). A subsequent VACUUM can break
things in more serious ways, such as leaving orphan heap-only tuples
whose root HOT redirect items were removed. This can be seen because
index creation (or REINDEX) complain like
ERROR: XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"
Because of relfrozenxid contraints, we cannot avoid the freezing of the
early tuples, so we must cope with the results: whenever we see an Xmin
of FrozenTransactionId, consider it a match for whatever the previous
Xmax value was.
This problem seems to have appeared in 9.3 with multixact changes,
though strictly speaking it seems unrelated.
Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as
frozen", so the fix is simple: just compare the raw Xmin (still stored
in the tuple header, since freezing merely set an infomask bit) to the
Xmax. But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
the original value is lost and we have nothing to compare the Xmax with.
To cope with that case we need to compare the Xmin with FrozenXid,
assume it's a match, and hope for the best. Sadly, since you can
pg_upgrade a 9.3 instance containing half-frozen pages to newer
releases, we need to keep the old check in newer versions too, which
seems a bit brittle; I hope we can somehow get rid of that.
I didn't optimize the new function for performance. The new coding is
probably a bit slower than before, since there is a function call rather
than a straight comparison, but I'd rather have it work correctly than
be fast but wrong.
This is a followup after 20b6552242 fixed a few related problems.
Apparently, in 9.6 and up there are more ways to get into trouble, but
in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
there must be a separate bug.
Reported-by: Peter Geoghegan
Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
Yi Wen Wong, Álvaro
Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
On Linux, shared memory segments created with shm_open() are backed by
swap files created in tmpfs. If the swap file needs to be extended,
but there's no tmpfs space left, you get a very unfriendly SIGBUS trap.
To avoid this, force allocation of the full request size when we create
the segment. This adds a few cycles, but none that we wouldn't expend
later anyway, assuming the request isn't hugely bigger than the actual
need.
Make this code #ifdef __linux__, because (a) there's not currently a
reason to think the same problem exists on other platforms, and (b)
applying posix_fallocate() to an FD created by shm_open() isn't very
portable anyway.
Back-patch to 9.4 where the DSM code came in.
Thomas Munro, per a bug report from Amul Sul
Discussion: https://postgr.es/m/1002664500.12301802.1471008223422.JavaMail.yahoo@mail.yahoo.com
The order in which GRANTs are output is important as GRANTs which have
been GRANT'd by individuals via WITH GRANT OPTION GRANTs have to come
after the GRANT which included the WITH GRANT OPTION. This happens
naturally in the backend during normal operation as we only change
existing ACLs in-place, only add new ACLs to the end, and when removing
an ACL we remove any which depend on it also.
Also, adjust the comments in acl.h to make this clear.
Unfortunately, the updates to pg_dump to handle initial privileges
involved pulling apart ACLs and then combining them back together and
could end up putting them back together in an invalid order, leading to
dumps which wouldn't restore.
Fix this by adjusting the queries used by pg_dump to ensure that the
ACLs are rebuilt in the same order in which they were originally.
Back-patch to 9.6 where the changes for initial privileges were done.
Instead of using a cast to force the constant to be the right width,
assume we can plaster on an L, UL, LL, or ULL suffix as appropriate.
The old approach to this is very hoary, dating from before we were
willing to require compilers to have working int64 types.
This fix makes the PG_INT64_MIN, PG_INT64_MAX, and PG_UINT64_MAX
constants safe to use in preprocessor conditions, where a cast
doesn't work. Other symbolic constants that might be defined using
[U]INT64CONST are likewise safer than before.
Also fix the SIZE_MAX macro to be similarly safe, if we are forced
to provide a definition for that. The test added in commit 2e70d6b5e
happens to do what we want even with the hack "(size_t) -1" definition,
but we could easily get burnt on other tests in future.
Back-patch to all supported branches, like the previous commits.
Discussion: https://postgr.es/m/15883.1504278595@sss.pgh.pa.us
Pre-C99 platforms may lack <stdint.h> and thereby SIZE_MAX. We have
a couple of places using the hack "(size_t) -1" as a fallback, but
it wasn't universally available; which means the code added in commit
2e70d6b5e fails to compile everywhere. Move that hack to c.h so that
we can rely on having SIZE_MAX everywhere.
Per discussion, it'd be a good idea to make the macro's value safe
for use in #if-tests, but that will take a bit more work. This is
just a quick expedient to get the buildfarm green again.
Back-patch to all supported branches, like the previous commit.
Discussion: https://postgr.es/m/15883.1504278595@sss.pgh.pa.us
2cd7084524 / c6293249d change the way individual attributes in a
TupleDesc are stored / accessed. To reduce the effort of making
extensions compatible with postgresql 11, and to ease future
backpatching, backpatch introduction of TupleDescAttr() to all
releases. Do not backpatch change in storage, as that'd be a breaking
change for existing and working extensions.
Author: Andres Freund
Discussion: https://postgr.es/m/20170820181723.tdswdinzptbcwhrr@alap3.anarazel.de
Backpatch: 9.2-
The sole useful effect of this function, to check that no catcache
entries have positive refcounts at transaction end, has really been
obsolete since we introduced ResourceOwners in PG 8.1. We reduced the
checks to assertions years ago, so that the function was a complete
no-op in production builds. There have been previous discussions about
removing it entirely, but consensus up to now was that it had some small
value as a cross-check for bugs in the ResourceOwner logic.
However, it now emerges that it's possible to trigger these assertions
if you hit an assert-enabled backend with SIGTERM during a call to
SearchCatCacheList, because that function temporarily increases the
refcounts of entries it's intending to add to a catcache list construct.
In a normal ERROR scenario, the extra refcounts are cleaned up by
SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
transaction abort and exit without ever executing PG_CATCH handlers.
There's a case to be made that this is a generic hazard and we should
consider restructuring elog(FATAL) handling so that pending PG_CATCH
handlers do get run. That's pretty scary though: it could easily create
more problems than it solves. Preliminary stress testing by Andreas
Seltenreich suggests that there are not many live problems of this ilk,
so we rejected that idea.
There are more-localized ways to fix the problem; the most principled
one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
But adding cycles to SearchCatCacheList isn't very appealing. We could
also weaken the assertions in AtEOXact_CatCache in some more or less
ad-hoc way, but that just makes its raison d'etre even less compelling.
In the end, the most reasonable solution seems to be to just remove
AtEOXact_CatCache altogether, on the grounds that it's not worth trying
to fix it. It hasn't found any bugs for us in many years.
Per report from Jeevan Chalke. Back-patch to all supported branches.
Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
XLByteToSeg and XLByteToPrevSeg calculate only a segment number. The
definition of these macros were modified by commit
dfda6ebaec but the comment remain
unchanged.
Patch by Yugo Nagata. Back patched to 9.3 and beyond.
When, during logical decoding, a transaction gets too big, it's
contents get spilled to disk. Not just the top-transaction gets
spilled, but *also* all of its subtransactions, even if they're not
that large themselves. Unfortunately we didn't clean up
such small spilled subtransactions from disk.
Fix that, by keeping better track of whether a transaction has been
spilled to disk.
Author: Andres Freund
Reported-By: Dmitriy Sarafannikov, Fabrízio de Royes Mello
Discussion:
https://postgr.es/m/1457621358.355011041@f382.i.mail.ruhttps://postgr.es/m/CAFcNs+qNMhNYii4nxpO6gqsndiyxNDYV0S=JNq0v_sEE+9PHXg@mail.gmail.com
Backpatch: 9.4-, where logical decoding was introduced
Because walsender and normal backends share the same main loop it's
problematic to have two different flag variables, set in signal
handlers, indicating a pending configuration reload. Only certain
walsender commands reach code paths checking for the
variable (START_[LOGICAL_]REPLICATION, CREATE_REPLICATION_SLOT
... LOGICAL, notably not base backups).
This is a bug present since the introduction of walsender, but has
gotten worse in releases since then which allow walsender to do more.
A later patch, not slated for v10, will similarly unify SIGHUP
handling in other types of processes as well.
Author: Petr Jelinek, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170423235941.qosiuoyqprq4nu7v@alap3.anarazel.de
Backpatch: 9.2-, bug is present since 9.0
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.
To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown. Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.
Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
The snapshot assembly during the creation of logical slots relied
waiting for transactions in xl_running_xacts to end, by checking for
their commit/abort records. Unfortunately, despite locking, it is
possible to see an xl_running_xact record listing transactions as
ready, that have already WAL-logged an commit/abort record, as the
locking just prevents the ProcArray to be adjusted, and the commit
record has to be logged first.
That lead to either delayed or hanging snapshot creation, because
snapbuild.c would wait "forever" to see commit/abort records for some
transactions. That hang resolved only if a xl_running_xacts record
without any running transactions happened to be logged, far from
certain on a busy server.
It's impractical to prevent that via more heavyweight locking, the
likelihood of deadlocks and significantly increased contention would
be too big.
Instead change the initial snapshot creation to be solely based on
tracking the oldest running transaction via
xl_running_xacts->oldestRunningXid - that actually ends up
significantly simplifying the code. That has two disadvantages:
1) Because we cannot fully "trust" the contents of xl_running_xacts,
we cannot use it to build the initial snapshot. Instead we have to
wait twice for all running transactions to finish.
2) Previously a slot, unless the race occurred, could be created when
the all transaction perceived as running based on commit/abort
records, now we have to wait for the next xl_running_xacts record.
To address that, trigger logging new xl_running_xacts record from
within snapbuild.c exactly when necessary.
Unfortunately snabuild.c's SnapBuild is stored on disk, one of the
stupider ideas of a certain Mr Freund, so we can't change it in a
minor release. As this is going to be backpatched, we have to hack
around a bit to keep on-disk compatibility. A later commit will
rejigger that on master.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
We have now grown enough registerable syscache-invalidation callback
functions that the original assumption that there would be few of them
is causing performance problems. In particular, let's fix things so that
CallSyscacheCallbacks doesn't have to search the whole array to find
which callback(s) to invoke for a given cache ID. Preserve the original
behavior that callbacks are called in order of registration, just in
case there's someplace that depends on that (which I doubt).
In support of this, export the number of syscaches from syscache.h.
People could have found that out anyway from the enum, but adding a
#define makes that much safer.
This provides a useful additional speedup in Mathieu Fenniak's
logical-decoding test case, although we're reaching the point of
diminishing returns there. I think any further improvement will have
to come from reducing the number of cache invalidations that are
triggered in the first place. Still, we can hope that this change
gives some incremental benefit for all invalidation scenarios.
Back-patch to 9.4 where logical decoding was introduced.
Discussion: https://postgr.es/m/CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com