This prevents a possible longjmp out of the signal handler if a timeout
or SIGINT occurs while something within the handler has transiently set
ImmediateInterruptOK. For safety we must hold off the timeout or cancel
error until we're back in mainline, or at least till we reach the end of
the signal handler when ImmediateInterruptOK was true at entry. This
syncs these functions with the logic now present in handle_sig_alarm.
AFAICT there is no live bug here in 9.0 and up, because I don't think we
currently can wait for any heavyweight lock inside these functions, and
there is no other code (except read-from-client) that will turn on
ImmediateInterruptOK. However, that was not true pre-9.0: in older
branches ProcessIncomingNotify might block trying to lock pg_listener, and
then a SIGINT could lead to undesirable control flow. It might be all
right anyway given the relatively narrow code ranges in which NOTIFY
interrupts are enabled, but for safety's sake I'm back-patching this.
Vacuum recognizes that it can update relfrozenxid by checking whether it has
processed all pages of a relation. Unfortunately it performed that check
after truncating the dead pages at the end of the relation, and used the new
number of pages to decide whether all pages have been scanned. If the new
number of pages happened to be smaller or equal to the number of pages
scanned, it incorrectly decided that all pages were scanned.
This can lead to relfrozenxid being updated, even though some pages were
skipped that still contain old XIDs. That can lead to data loss due to xid
wraparounds with some rows suddenly missing. This likely has escaped notice
so far because it takes a large number (~2^31) of xids being used to see the
effect, while a full-table vacuum before that would fix the issue.
The incorrect logic was introduced by commit
b4b6923e03. Backpatch this fix down to 8.4,
like that commit.
Andres Freund, with some modifications by me.
Previously, if VACUUM skipped vacuuming a page because it's pinned, it
didn't count that page as scanned. However, that meant that relfrozenxid
was not bumped up either, which prevented anti-wraparound vacuum from
doing its job.
Report by Миша Тюрин, analysis and patch by Sergey Burladyn and Jeff Janes.
Backpatch to 9.2, where the skip-locked-pages behavior was introduced.
contain_volatile_functions() is best applied to the output of
expression_planner(), not its input, so that insertion of function
default arguments and constant-folding have been done. (See comments
at CheckMutability, for instance.) It's perhaps unlikely that anyone
will notice a difference in practice, but still we should do it properly.
In passing, change variable type from Node* to Expr* to reduce the net
number of casts needed.
Noted while perusing uses of contain_volatile_functions().
Formerly, when using a SQL-spec timezone setting with a fixed GMT offset
(called a "brute force" timezone in the code), the session_timezone
variable was not updated to match the nominal timezone; rather, all code
was expected to ignore session_timezone if HasCTZSet was true. This is
of course obviously fragile, though a search of the code finds only
timeofday() failing to honor the rule. A bigger problem was that
DetermineTimeZoneOffset() supposed that if its pg_tz parameter was
pointer-equal to session_timezone, then HasCTZSet should override the
parameter. This would cause datetime input containing an explicit zone
name to be treated as referencing the brute-force zone instead, if the
zone name happened to match the session timezone that had prevailed
before installing the brute-force zone setting (as reported in bug #8572).
The same malady could affect AT TIME ZONE operators.
To fix, set up session_timezone so that it matches the brute-force zone
specification, which we can do using the POSIX timezone definition syntax
"<abbrev>offset", and get rid of the bogus lookaside check in
DetermineTimeZoneOffset(). Aside from fixing the erroneous behavior in
datetime parsing and AT TIME ZONE, this will cause the timeofday() function
to print its result in the user-requested time zone rather than some
previously-set zone. It might also affect results in third-party
extensions, if there are any that make use of session_timezone without
considering HasCTZSet, but in all cases the new behavior should be saner
than before.
Back-patch to all supported branches.
Use a critical section when setting the all-visible flag on an empty page,
and WAL-logging it. log_newpage_buffer() contains an assertion that it
must be called inside a critical section, and it's the right thing to do
when modifying a buffer anyway.
Also, the page should be marked dirty before calling log_newpage_buffer(),
per the comment in log_newpage_buffer() and src/backend/access/transam/README.
Patch by Andres Freund, in response to my report. Backpatch to 9.2, like
the patch that introduced these bugs (a6370fd9).
There is a rare race condition, when a transaction that inserted a tuple
aborts while vacuum is processing the page containing the inserted tuple.
Vacuum prunes the page first, which normally removes any dead tuples, but
if the inserting transaction aborts right after that, the loop after
pruning will see a dead tuple and remove it instead. That's OK, but if the
page is on a table with no indexes, and the page becomes completely empty
after removing the dead tuple (or tuples) on it, it will be immediately
marked as all-visible. That's OK, but the sanity check in vacuum would
throw a warning because it thinks that the page contains dead tuples and
was nevertheless marked as all-visible, even though it just vacuumed away
the dead tuples and so it doesn't actually contain any.
Spotted this while reading the code. It's difficult to hit the race
condition otherwise, but can be done by putting a breakpoint after the
heap_page_prune() call.
Backpatch all the way to 8.4, where this code first appeared.
Refactoring as part of commit 8ceb245680
had the unintended effect of making REINDEX TABLE and REINDEX DATABASE
no longer validate constraints enforced by the indexes in question;
REINDEX INDEX still did so. Indexes marked invalid remained so, and
constraint violations arising from data corruption went undetected.
Back-patch to 9.0, like the causative commit.
In most scenarios a portal without a ResourceOwner is dead and not subject
to any further execution, but a portal for a cursor WITH HOLD remains in
existence with no ResourceOwner after the creating transaction is over.
In this situation, if we attempt to "execute" the portal directly to fetch
data from it, we were setting CurrentResourceOwner to NULL, leading to a
segfault if the datatype output code did anything that required a resource
owner (such as trying to fetch system catalog entries that weren't already
cached). The case appears to be impossible to provoke with stock libpq,
but psqlODBC at least is able to cause it when working with held cursors.
Simplest fix is to just skip the assignment to CurrentResourceOwner, so
that any resources used by the data output operations will be managed by
the transaction-level resource owner instead. For consistency I changed
all the places that install a portal's resowner as current, even though
some of them are probably not reachable with a held cursor's portal.
Per report from Joshua Berry (with thanks to Hiroshi Inoue for developing
a self-contained test case). Back-patch to all supported versions.
The new message (and SQLSTATE) matches the corresponding error cases in
namespace.c.
This was thought to be a "can't happen" case when extension.c was written,
so we didn't think hard about how to report it. But it definitely can
happen in 9.2 and later, since we no longer require search_path to contain
any valid schema names. It's probably also possible in 9.1 if search_path
came from a noninteractive source. So, back-patch to all releases
containing this code.
Per report from Sean Chittenden, though this isn't exactly his patch.
When COPY uses the multi-insert method to insert a batch of tuples into the
heap at a time, incorrect line number was printed if something went wrong in
inserting the index tuples (primary key failure, for exampl), or processing
after row triggers.
Fixes bug #8173 reported by Lloyd Albin. Backpatch to 9.2, where the multi-
insert code was added.
Patch b19e4250b4 attempted to
preserve existing behavior regarding statistics generation in the
case that a truncation attempt was canceled due to lock conflicts.
It failed to do this accurately in two regards: (1) autovacuum had
previously generated statistics if the truncate attempt failed to
initially get the lock rather than having started the attempt, and
(2) the VACUUM ANALYZE command had always generated statistics.
Both of these changes were unintended, and are reverted by this
patch. On review, there seems to be consensus that the previous
failure to generate statistics when the truncate was terminated
was more an unfortunate consequence of how that effort was
previously terminated than a feature we want to keep; so this
patch generates statistics even when an autovacuum truncation
attempt terminates early. Another unintended change which is kept
on the basis that it is an improvement is that when a VACUUM
command is truncating, it will the new heuristic for avoiding
blocking other processes, rather than keeping an
AccessExclusiveLock on the table for however long the truncation
takes.
Per multiple reports, with some renaming per patch by Jeff Janes.
Backpatch to 9.0, where problem was created.
There was a high probability of two or more concurrent C.I.C. commands
deadlocking just before completion, because each would wait for the others
to release their reference snapshots. Fix by releasing the snapshot
before waiting for other snapshots to go away.
Per report from Paul Hinze. Back-patch to all active branches.
Since a backend adds itself to the global listener array during
Exec_ListenPreCommit, it's inappropriate for it to remove itself during
Exec_UnlistenCommit or Exec_UnlistenAllCommit --- that leads to failure
when committing a transaction that did UNLISTEN then LISTEN, since we end
up not registered though we should be. (This leads to missing later
notifications, or to Assert failures in assert-enabled builds.) Instead
deal with deregistering at the bottom of AtCommit_Notify, when we know the
final state of the listenChannels list.
Also, simplify the representation of registration status by replacing the
transient backendHasExecutedInitialListen flag with an amRegisteredListener
flag.
Per report from Greg Sabino Mullane. Back-patch to 9.0, where the problem
was introduced during the LISTEN/NOTIFY rewrite.
The original code used freeze_min_age instead of freeze_table_age. The
main consequence of this mistake is that lowering freeze_min_age would
cause full-table scans to occur much more frequently, which causes
serious issues because the number of writes required is much larger.
That feature (freeze_min_age) is supposed to affect only how soon tuples
are frozen; some pages should still be skipped due to the visibility
map.
Backpatch to 8.4, where the freeze_table_age feature was introduced.
Report and patch from Andres Freund
In situations where there are over 8MB of empty pages at the end of
a table, the truncation work for trailing empty pages takes longer
than deadlock_timeout, and there is frequent access to the table by
processes other than autovacuum, there was a problem with the
autovacuum worker process being canceled by the deadlock checking
code. The truncation work done by autovacuum up that point was
lost, and the attempt tried again by a later autovacuum worker. The
attempts could continue indefinitely without making progress,
consuming resources and blocking other processes for up to
deadlock_timeout each time.
This patch has the autovacuum worker checking whether it is
blocking any other thread at 20ms intervals. If such a condition
develops, the autovacuum worker will persist the work it has done
so far, release its lock on the table, and sleep in 50ms intervals
for up to 5 seconds, hoping to be able to re-acquire the lock and
try again. If it is unable to get the lock in that time, it moves
on and a worker will try to continue later from the point this one
left off.
While this patch doesn't change the rules about when and what to
truncate, it does cause the truncation to occur sooner, with less
blocking, and with the consumption of fewer resources when there is
contention for the table's lock.
The only user-visible change other than improved performance is
that the table size during truncation may change incrementally
instead of just once.
Backpatched to 9.0 from initial master commit at
b19e4250b4 -- before that the
differences are too large to be clearly safe.
Jan Wieck
Use of SnapshotNow is known to expose us to race conditions if the tuple(s)
being sought could be updated by concurrently-committing transactions.
CREATE DATABASE and DROP DATABASE are particularly exposed because they do
heavyweight filesystem operations during their scans of pg_tablespace,
so that the scans run for a very long time compared to most. Furthermore,
the potential consequences of a missed or twice-visited row are nastier
than average:
* createdb() could fail with a bogus "file already exists" error, or
silently fail to copy one or more tablespace's worth of files into the
new database.
* remove_dbtablespaces() could miss one or more tablespaces, thus failing
to free filesystem space for the dropped database.
* check_db_file_conflict() could likewise miss a tablespace, leading to an
OID conflict that could result in data loss either immediately or in
future operations. (This seems of very low probability, though, since a
duplicate database OID would be unlikely to start with.)
Hence, it seems worth fixing these three places to use MVCC snapshots, even
though this will someday be superseded by a generic solution to SnapshotNow
race conditions.
Back-patch to all active branches.
Stephen Frost and Tom Lane
If pg_extension_config_dump() is executed again for a table already listed
in the extension's extconfig, the code was blindly making a new array entry.
This does not seem useful. Fix it to replace the existing array entry
instead, so that it's possible for extension update scripts to alter the
filter conditions for configuration tables.
In addition, teach ALTER EXTENSION DROP TABLE to check for an extconfig
entry for the target table, and remove it if present. This is not a 100%
solution because it's allowed for an extension update script to just
summarily DROP a member table, and that code path doesn't go through
ExecAlterExtensionContentsStmt. We could probably make that case clean
things up if we had to, but it would involve sticking a very ugly wart
somewhere in the guts of dependency.c. Since on the whole it seems quite
unlikely that extension updates would want to remove pre-existing
configuration tables, making the case possible with an explicit command
seems sufficient.
Per bug #7756 from Regina Obe. Back-patch to 9.1 where extensions were
introduced.
During crash recovery, we remove disk files belonging to temporary tables,
but the system catalog entries for such tables are intentionally not
cleaned up right away. Instead, the first backend that uses a temp schema
is expected to clean out any leftover objects therein. This approach
requires that we be careful to ignore leftover temp tables (since any
actual access attempt would fail), *even if their BackendId matches our
session*, if we have not yet established use of the session's corresponding
temp schema. That worked fine in the past, but was broken by commit
debcec7dc3 which incorrectly removed the
rd_islocaltemp relcache flag. Put it back, and undo various changes
that substituted tests like "rel->rd_backend == MyBackendId" for use
of a state-aware flag. Per trouble report from Heikki Linnakangas.
Back-patch to 9.1 where the erroneous change was made. In the back
branches, be careful to add rd_islocaltemp in a spot in the struct that
was alignment padding before, so as not to break existing add-on code.
During VACUUM if we pause to perform a cycle
of index cleanup we drop the vmbuffer pin,
so we should do the same thing when heap
scan completes. This avoids holding vmbuffer
pin across the main index cleanup in VACUUM,
which could be minutes or hours longer than
necessary for correctness.
Bug report and suggested fix from Pavan Deolasee
If we had not been holding buffer pin continuously since the tuple was
initially fetched by the UPDATE or DELETE query, it would be possible for
VACUUM or a page-prune operation to move the tuple while we're trying to
copy it. This would result in a garbage "old" tuple value being passed to
an AFTER ROW UPDATE or AFTER ROW DELETE trigger. The preconditions for
this are somewhat improbable, and the timing constraints are very tight;
so it's not so surprising that this hasn't been reported from the field,
even though the bug has been there a long time.
Problem found by Andres Freund. Back-patch to all active branches.
Commit 8cb53654db, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, make the final state be indisvalid = true and indisready = false,
which is otherwise nonsensical. This is pretty ugly but we can't add
another column without forcing initdb, and it's too late for that in 9.2.
(There's a cleaner fix in HEAD.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
Portions of this need to be back-patched even further, but I'll work
on that separately.
Problem reported by Amit Kapila, diagnosis by Pavan Deolasee,
fix by Tom Lane and Andres Freund.
This reverts commit d573e239f0, "Take fewer
snapshots". While that seemed like a good idea at the time, it caused
execution to use a snapshot that had been acquired before locking any of
the tables mentioned in the query. This created user-visible anomalies
that were not present in any prior release of Postgres, as reported by
Tomas Vondra. While this whole area could do with a redesign (since there
are related cases that have anomalies anyway), it doesn't seem likely that
any future patch would be reasonably back-patchable; and we don't want 9.2
to exhibit a behavior that's subtly unlike either past or future releases.
Hence, revert to prior code while we rethink the problem.
The trigger and rule cases need to split up the input name list, but
they mustn't corrupt the passed-in data structure, since it could be part
of a cached utility-statement parsetree. Per bug #7641.
This case got broken in 8.4 by the addition of an error check that
complains if ALTER TABLE ONLY is used on a table that has children.
We do use ONLY for this situation, but it's okay because the necessary
recursion occurs at a higher level. So we need to have a separate
flag to suppress recursion without making the error check.
Reported and patched by Pavan Deolasee, with some editorial adjustments by
me. Back-patch to 8.4, since this is a regression of functionality that
worked in earlier branches.
In its original conception, it was leaving some objects into the old
schema, but without their proper pg_depend entries; this meant that the
old schema could be dropped, causing future pg_dump calls to fail on the
affected database. This was originally reported by Jeff Frost as #6704;
there have been other complaints elsewhere that can probably be traced
to this bug.
To fix, be more consistent about altering a table's subsidiary objects
along the table itself; this requires some restructuring in how tables
are relocated when altering an extension -- hence the new
AlterTableNamespaceInternal routine which encapsulates it for both the
ALTER TABLE and the ALTER EXTENSION cases.
There was another bug lurking here, which was unmasked after fixing the
previous one: certain objects would be reached twice via the dependency
graph, and the second attempt to move them would cause the entire
operation to fail. Per discussion, it seems the best fix for this is to
do more careful tracking of objects already moved: we now maintain a
list of moved objects, to avoid attempting to do it twice for the same
object.
Authors: Alvaro Herrera, Dimitri Fontaine
Reviewed by Tom Lane
The GUC check hooks for transaction_read_only and transaction_isolation
tried to check RecoveryInProgress(), so as to disallow setting read/write
mode or serializable isolation level (respectively) in hot standby
sessions. However, GUC check hooks can be called in many situations where
we're not connected to shared memory at all, resulting in a crash in
RecoveryInProgress(). Among other cases, this results in EXEC_BACKEND
builds crashing during child process start if default_transaction_isolation
is serializable, as reported by Heikki Linnakangas. Protect those calls
by silently allowing any setting when not inside a transaction; which is
okay anyway since these GUCs are always reset at start of transaction.
Also, add a check to GetSerializableTransactionSnapshot() to complain
if we are in hot standby. We need that check despite the one in
check_XactIsoLevel() because default_transaction_isolation could be
serializable. We don't want to complain any sooner than this in such
cases, since that would prevent running transactions at all in such a
state; but a transaction can be run, if SET TRANSACTION ISOLATION is done
before setting a snapshot. Per report some months ago from Robert Haas.
Back-patch to 9.1, since these problems were introduced by the SSI patch.
Kevin Grittner and Tom Lane, with ideas from Heikki Linnakangas
This situation creates a dependency loop that confuses pg_dump and probably
other things. Moreover, since the mental model is that the extension
"contains" schemas it owns, but "is contained in" its extschema (even
though neither is strictly true), having both true at once is confusing for
people too. So prevent the situation from being set up.
Reported and patched by Thom Brown. Back-patch to 9.1 where extensions
were added.
This command generated new pg_depend entries linking the index to the
constraint and the constraint to the table, which match the entries made
when a unique or primary key constraint is built de novo. However, it did
not bother to get rid of the entries linking the index directly to the
table. We had considered the issue when the ADD CONSTRAINT USING INDEX
patch was written, and concluded that we didn't need to get rid of the
extra entries. But this is wrong: ALTER COLUMN TYPE wasn't expecting such
redundant dependencies to exist, as reported by Hubert Depesz Lubaczewski.
On reflection it seems rather likely to break other things as well, since
there are many bits of code that crawl pg_depend for one purpose or
another, and most of them are pretty naive about what relationships they're
expecting to find. Fortunately it's not that hard to get rid of the extra
dependency entries, so let's do that.
Back-patch to 9.1, where ALTER TABLE ADD CONSTRAINT USING INDEX was added.
If a crash occurred immediately after the first nextval() call for a serial
column, WAL replay would restore the sequence to a state in which it
appeared that no nextval() had been done, thus allowing the first sequence
value to be returned again by the next nextval() call; as reported in
bug #6748 from Xiangming Mei.
More generally, the problem would occur if an ALTER SEQUENCE was executed
on a freshly created or reset sequence. (The manifestation with serial
columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step
to serial column creation.) The cause is that sequence creation attempted
to save one WAL entry by writing out a WAL record that made it appear that
the first nextval() had already happened (viz, with is_called = true),
while marking the sequence's in-database state with log_cnt = 1 to show
that the first nextval() need not emit a WAL record. However, ALTER
SEQUENCE would emit a new WAL entry reflecting the actual in-database state
(with is_called = false). Then, nextval would allocate the first sequence
value and set is_called = true, but it would trust the log_cnt value and
not emit any WAL record. A crash at this point would thus restore the
sequence to its post-ALTER state, causing the next nextval() call to return
the first sequence value again.
To fix, get rid of the idea of logging an is_called status different from
reality. This means that the first nextval-driven WAL record will happen
at the first nextval call not the second, but the marginal cost of that is
pretty negligible. In addition, make sure that ALTER SEQUENCE resets
log_cnt to zero in any case where it touches sequence parameters that
affect future nextval results. This will result in some user-visible
changes in the contents of a sequence's log_cnt column, as reflected in the
patch's regression test changes; but no application should be depending on
that anyway, since it was already true that log_cnt changes rather
unpredictably depending on checkpoint timing.
In addition, make some basically-cosmetic improvements to get rid of
sequence.c's undesirable intimacy with page layout details. It was always
really trying to WAL-log the contents of the sequence tuple, so we should
have it do that directly using a HeapTuple's t_data and t_len, rather than
backing into it with some magic assumptions about where the tuple would be
on the sequence's page.
Back-patch to all supported branches.
The initially implemented syntax, "CHECK NO INHERIT (expr)" was not
deemed very good, so switch to "CHECK (expr) NO INHERIT" instead. This
way it looks similar to SQL-standards compliant constraint attribute.
Backport to 9.2 where the new syntax and feature was introduced.
Per discussion.
The code was setting it true for other constraints, which is
bogus. Doing so caused bogus catalog entries for such constraints, and
in particular caused an error to be raised when trying to drop a
constraint of types other than CHECK from a table that has children,
such as reported in bug #6712.
In 9.2, additionally ignore connoinherit=true for other constraint
types, to avoid having to force initdb; existing databases might already
contain bogus catalog entries.
Includes a catversion bump (in HEAD only).
Bug report from Miroslav Šulc
Analysis from Amit Kapila and Noah Misch; Amit also contributed the patch.
Formerly, when trying to copy both indexes and comments, CREATE TABLE LIKE
had to pre-assign names to indexes that had comments, because it made up an
explicit CommentStmt command to apply the comment and so it had to know the
name for the index. This creates bad interactions with other indexes, as
shown in bug #6734 from Daniele Varrazzo: the preassignment logic couldn't
take any other indexes into account so it could choose a conflicting name.
To fix, add a field to IndexStmt that allows it to carry a comment to be
assigned to the new index. (This isn't a user-exposed feature of CREATE
INDEX, only an internal option.) Now we don't need preassignment of index
names in any situation.
I also took the opportunity to refactor DefineIndex to accept the IndexStmt
as such, rather than passing all its fields individually in a mile-long
parameter list.
Back-patch to 9.2, but no further, because it seems too dangerous to change
IndexStmt or DefineIndex's API in released branches. The bug exists back
to 9.0 where CREATE TABLE LIKE grew the ability to copy comments, but given
the lack of prior complaints we'll just let it go unfixed before 9.2.
Per bug #6593, REASSIGN OWNED fails when the affected role has created
an extension. Even though the user related to the extension is not
nominally the owner, its OID appears on pg_shdepend and thus causes
problems when the user is to be dropped.
This commit adds code to change the "ownership" of the extension itself,
not of the contained objects. This is fine because it's currently only
called from REASSIGN OWNED, which would also modify the ownership of the
contained objects. However, this is not sufficient for a working ALTER
OWNER implementation extension.
Back-patch to 9.1, where extensions were introduced.
Bug #6593 reported by Emiliano Leporati.
If a CHECK constraint or index definition contained a whole-row Var (that
is, "table.*"), an attempt to copy that definition via CREATE TABLE LIKE or
table inheritance produced incorrect results: the copied Var still claimed
to have the rowtype of the source table, rather than the created table.
For the LIKE case, it seems reasonable to just throw error for this
situation, since the point of LIKE is that the new table is not permanently
coupled to the old, so there's no reason to assume its rowtype will stay
compatible. In the inheritance case, we should ideally allow such
constraints, but doing so will require nontrivial refactoring of CREATE
TABLE processing (because we'd need to know the OID of the new table's
rowtype before we adjust inherited CHECK constraints). In view of the lack
of previous complaints, that doesn't seem worth the risk in a back-patched
bug fix, so just make it throw error for the inheritance case as well.
Along the way, replace change_varattnos_of_a_node() with a more robust
function map_variable_attnos(), which is capable of being extended to
handle insertion of ConvertRowtypeExpr whenever we get around to fixing
the inheritance case nicely, and in the meantime it returns a failure
indication to the caller so that a helpful message with some context can be
thrown. Also, this code will do the right thing with subselects (if we
ever allow them in CHECK or indexes), and it range-checks varattnos before
using them to index into the map array.
Per report from Sergey Konoplev. Back-patch to all supported branches.
The LISTEN/NOTIFY subsystem got confused if SimpleLruZeroPage failed,
which would typically happen as a result of a write() failure while
attempting to dump a dirty pg_notify page out of memory. Subsequently,
all attempts to send more NOTIFY messages would fail with messages like
"Could not read from file "pg_notify/nnnn" at offset nnnnn: Success".
Only restarting the server would clear this condition. Per reports from
Kevin Grittner and Christoph Berg.
Back-patch to 9.0, where the problem was introduced during the
LISTEN/NOTIFY rewrite.
Because permissions are assigned to element types, not array types,
complaining about permission denied on an array type would be
misleading to users. So adjust the reporting to refer to the element
type instead.
In order not to duplicate the required logic in two dozen places,
refactor the permission denied reporting for types a bit.
pointed out by Yeb Havinga during the review of the type privilege
feature
In lazy_scan_heap, we could issue bogus warnings about incorrect
information in the visibility map, because we checked the visibility
map bit before locking the heap page, creating a race condition. Fix
by rechecking the visibility map bit before we complain. Rejigger
some related logic so that we rely on the possibly-outdated
all_visible_according_to_vm value as little as possible.
In heap_multi_insert, it's not safe to clear the visibility map bit
before beginning the critical section. The visibility map is not
crash-safe unless we treat clearing the bit as a critical operation.
Specifically, if the transaction were to error out after we set the
bit and before entering the critical section, we could end up writing
the heap page to disk (with the bit cleared) and crashing before the
visibility map page made it to disk. That would be bad. heap_insert
has this correct, but somehow the order of operations got rearranged
when heap_multi_insert was added.
Also, add some more comments to visibilitymap_test, lazy_scan_heap,
and IndexOnlyNext, expounding on concurrency issues.
Per extensive code review by Andres Freund, and further review by Tom
Lane, who also made the original report about the bogus warnings.
We allow non-superusers to create procedural languages (with restrictions)
and range datatypes. Previously, the automatically-created support
functions for these objects ended up owned by the creating user. This
represents a rather considerable security hazard, because the owning user
might be able to alter a support function's definition in such a way as to
crash the server, inject trojan-horse SQL code, or even execute arbitrary
C code directly. It appears that right now the only actually exploitable
problem is the infinite-recursion bug fixed in the previous patch for
CVE-2012-2655. However, it's not hard to imagine that future additions of
more ALTER FUNCTION capability might unintentionally open up new hazards.
To forestall future problems, cause these support functions to be owned by
the bootstrap superuser, not the user creating the parent object.
Per recent discussion, the error message for this was actually a trifle
inaccurate, since it said "cannot be cast" which might be incorrect.
Adjust that wording, and add a HINT suggesting that a USING clause might
be needed.
When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.
Noah Misch
If the tablespace directory is missing entirely, we allow DROP TABLESPACE
to go through, on the grounds that it should be possible to clean up the
catalog entry in such a situation. However, we forgot that the pg_tblspc
symlink might still be there. We should try to remove the symlink too
(but not fail if it's no longer there), since not doing so can lead to
weird behavior subsequently, as per report from Michael Nolan.
There was some discussion of adding dependency links to prevent DROP
TABLESPACE when the catalogs still contain references to the tablespace.
That might be worth doing too, but it's an orthogonal question, and in
any case wouldn't be back-patchable.
Back-patch to 9.0, which is as far back as the logic looks like this.
We could possibly do something similar in 8.x, but given the lack of
reports I'm not sure it's worth the trouble, and anyway the case could
not arise in the form the logic is meant to cover (namely, a post-DROP
transaction rollback having resurrected the pg_tablespace entry after
some or all of the filesystem infrastructure is gone).
"Unexpected EOF on client connection" without an open transaction
is mostly noise, so turn it into DEBUG1. With an open transaction it's
still indicating a problem, so keep those as ERROR, and change the message
to indicate that it happened in a transaction.