Add counters for number and size of temporary files used
for spill-to-disk queries for each database to the
pg_stat_database view.
Tomas Vondra, review by Magnus Hagander
Base backup follows recommended procedure, plus goes to great
lengths to ensure that partial page writes are avoided.
Jun Ishizuka and Fujii Masao, with minor modifications
This reports the depth level of triggers currently in execution, or zero
if not called from inside a trigger.
No catversion bump in this patch, but you have to initdb if you want
access to the new function.
Author: Kevin Grittner
Replication occurs only to memory on standby, not to disk,
so provides additional performance if user wishes to
reduce durability level slightly. Adds concept of multiple
independent sync rep queues.
Fujii Masao and Simon Riggs
We log AccessExclusiveLocks for replay onto standby nodes,
but because of timing issues on ProcArray it is possible to
log a lock that is still held by a just committed transaction
that is very soon to be removed. To avoid any timing issue we
avoid applying locks made by transactions with InvalidXid.
Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee
This separates the state (running/idle/idleintransaction etc) into
it's own field ("state"), and leaves the query field containing just
query text.
The query text will now mean "current query" when a query is running
and "last query" in other states. Accordingly,the field has been
renamed from current_query to query.
Since backwards compatibility was broken anyway to make that, the procpid
field has also been renamed to pid - along with the same field in
pg_stat_replication for consistency.
Scott Mead and Magnus Hagander, review work from Greg Smith
In the previous coding, it was possible for a relation to be created
via CREATE TABLE, CREATE VIEW, CREATE SEQUENCE, CREATE FOREIGN TABLE,
etc. in a schema while that schema was meanwhile being concurrently
dropped. This led to a pg_class entry with an invalid relnamespace
value. The same problem could occur if a relation was moved using
ALTER .. SET SCHEMA while the target schema was being concurrently
dropped. This patch prevents both of those scenarios by locking the
schema to which the relation is being added using AccessShareLock,
which conflicts with the AccessExclusiveLock taken by DROP.
As a desirable side effect, this also prevents the use of CREATE OR
REPLACE VIEW to queue for an AccessExclusiveLock on a relation on which
you have no rights: that will now fail immediately with a permissions
error, before trying to obtain a lock.
We need similar protection for all other object types, but as everything
other than relations uses a slightly different set of code paths, I'm
leaving that for a separate commit.
Original complaint (as far as I could find) about CREATE by Nikhil
Sontakke; risk for ALTER .. SET SCHEMA pointed out by Tom Lane;
further details by Dan Farina; patch by me; review by Hitoshi Harada.
In commit 7b0d0e9356, I made CLUSTER and
VACUUM FULL try to preserve toast value OIDs from the original toast table
to the new one. However, if we have to copy both live and recently-dead
versions of a row that has a toasted column, those versions may well
reference the same toast value with the same OID. The patch then led to
duplicate-key failures as we tried to insert the toast value twice with the
same OID. (The previous behavior was not very desirable either, since it
would have silently inserted the same value twice with different OIDs.
That wastes space, but what's worse is that the toast values inserted for
already-dead heap rows would not be reclaimed by subsequent ordinary
VACUUMs, since they go into the new toast table marked live not deleted.)
To fix, check if the copied OID already exists in the new toast table, and
if so, assume that it stores the desired value. This is reasonably safe
since the only case where we will copy an OID from a previous toast pointer
is when toast_insert_or_update was given that toast pointer and so we just
pulled the data from the old table; if we got two different values that way
then we have big problems anyway. We do have to assume that no other
backend is inserting items into the new toast table concurrently, but
that's surely safe for CLUSTER and VACUUM FULL.
Per bug #6393 from Maxim Boguk. Back-patch to 9.0, same as the previous
patch.
The original implementation of this interpreted it as a kind of
"inheritance" facility and named all the internal structures
accordingly. This turned out to be very confusing, because it has
nothing to do with the INHERITS feature. So rename all the internal
parser infrastructure, update the comments, adjust the error messages,
and split up the regression tests.
Historically we've used the SWPB instruction for TAS() on ARM, but this
is deprecated and not available on ARMv6 and later. Instead, make use
of a GCC builtin if available. We'll still fall back to SWPB if not,
so as not to break existing ports using older GCC versions.
Eventually we might want to try using __sync_lock_test_and_set() on some
other architectures too, but for now that seems to present only risk and
not reward.
Back-patch to all supported versions, since people might want to use any
of them on more recent ARM chips.
Martin Pitt
This squeezes out a bunch of alignment padding, reducing the size
from 72 to 56 bytes on my machine. At least in my testing, this
didn't produce any measurable performance improvement, but the space
savings seem like enough justification.
Andres Freund
ALTER TABLE (and ALTER VIEW, ALTER SEQUENCE, etc.) now use a
RangeVarGetRelid callback to check permissions before acquiring a table
lock. We also now use the same callback for all forms of ALTER TABLE,
rather than having separate, almost-identical callbacks for ALTER TABLE
.. SET SCHEMA and ALTER TABLE .. RENAME, and no callback at all for
everything else.
I went ahead and changed the code so that no form of ALTER TABLE works
on foreign tables; you must use ALTER FOREIGN TABLE instead. In 9.1,
it was possible to use ALTER TABLE .. SET SCHEMA or ALTER TABLE ..
RENAME on a foreign table, but not any other form of ALTER TABLE, which
did not seem terribly useful or consistent.
Patch by me; review by Noah Misch.
Previously, this was hardcoded: we always had 8. Performance testing
shows that isn't enough, especially on big SMP systems, so we allow it
to scale up as high as 32 when there's adequate memory. On the flip
side, when shared_buffers is very small, drop the number of CLOG buffers
down to as little as 4, so that we can start the postmaster even
when very little shared memory is available.
Per extensive discussion with Simon Riggs, Tom Lane, and others on
pgsql-hackers.
ALTER DOMAIN / DROP CONSTRAINT on a nonexistent constraint name did
not report any error. Now it reports an error. The IF EXISTS option
was added to get the usual behavior of ignoring nonexistent objects to
drop.
Further testing convinces me that this is helpful at sufficiently high
contention levels, though it's still worrisome that it loses slightly
at lower contention levels.
Per Manabu Ori.
This is allegedly a win, at least on some PPC implementations, according
to the PPC ISA documents. However, as with LWARX hints, some PPC
platforms give an illegal-instruction failure. Use the same trick as
before of assuming that PPC64 platforms will accept it; we might need to
refine that based on experience, but there are other projects doing
likewise according to google.
I did not add an assembler compatibility test because LWSYNC has been
around much longer than hint bits, and it seems unlikely that any
toolchains currently in use don't recognize it.
Previously we defined slock_t as 8 bytes on PPC64, but the TAS assembly
code uses word-wide operations regardless, so that the second word was
just wasted space. There doesn't appear to be any performance benefit
in adding the second word, so get rid of it to simplify the code.
The hint bit makes for a small but measurable performance improvement
in access to contended spinlocks.
On the other hand, some PPC chips give an illegal-instruction failure.
There doesn't seem to be a completely bulletproof way to tell whether the
hint bit will cause an illegal-instruction failure other than by trying
it; but most if not all 64-bit PPC machines should accept it, so follow
the Linux kernel's lead and assume it's okay to use it in 64-bit builds.
Of course we must also check whether the assembler accepts the command,
since even with a recent CPU the toolchain could be old.
Patch by Manabu Ori, significantly modified by me.
All supported platforms support the C89 standard function atexit()
(SunOS 4 probably being the last one not to), and supporting both
makes the code clumsy.
In commit e2c2c2e8b1 I made use of nested
list structures to show which clauses went with which index columns, but
on reflection that's a data structure that only an old-line Lisp hacker
could love. Worse, it adds unnecessary complication to the many places
that don't much care which clauses go with which index columns. Revert
to the previous arrangement of flat lists of clauses, and instead add a
parallel integer list of column numbers. The places that care about the
pairing can chase both lists with forboth(), while the places that don't
care just examine one list the same as before.
The only real downside to this is that there are now two more lists that
need to be passed to amcostestimate functions in case they care about
column matching (which btcostestimate does, so not passing the info is not
an option). Rather than deal with 11-argument amcostestimate functions,
pass just the IndexPath and expect the functions to extract fields from it.
That gets us down to 7 arguments which is better than 11, and it seems
more future-proof against likely additions to the information we keep
about an index path.
It's potentially useful for an index to repeat the same indexable column
or expression in multiple index columns, if the columns have different
opclasses. (If they share opclasses too, the duplicate column is pretty
useless, but nonetheless we've allowed such cases since 9.0.) However,
the planner failed to cope with this, because createplan.c was relying on
simple equal() matching to figure out which index column each index qual
is intended for. We do have that information available upstream in
indxpath.c, though, so the fix is to not flatten the multi-level indexquals
list when putting it into an IndexPath. Then we can rely on the sublist
structure to identify target index columns in createplan.c. There's a
similar issue for index ORDER BYs (the KNNGIST feature), so introduce a
multi-level-list representation for that too. This adds a bit more
representational overhead, but we might more or less buy that back by not
having to search for matching index columns anymore in createplan.c;
likewise btcostestimate saves some cycles.
Per bug #6351 from Christian Rudolph. Likely symptoms include the "btree
index keys must be ordered by attribute" failure shown there, as well as
"operator MMMM is not a member of opfamily NNNN".
Although this is a pre-existing problem that can be demonstrated in 9.0 and
9.1, I'm not going to back-patch it, because the API changes in the planner
seem likely to break things such as index plugins. The corner cases where
this matters seem too narrow to justify possibly breaking things in a minor
release.
When a view is marked as a security barrier, it will not be pulled up
into the containing query, and no quals will be pushed down into it,
so that no function or operator chosen by the user can be applied to
rows not exposed by the view. Views not configured with this
option cannot provide robust row-level security, but will perform far
better.
Patch by KaiGai Kohei; original problem report by Heikki Linnakangas
(in October 2009!). Review (in earlier versions) by Noah Misch and
others. Design advice by Tom Lane and myself. Further review and
cleanup by me.
You could already rename domains using ALTER TYPE, but with this new
command it is more consistent with how other commands treat domains as
a subcategory of types.
In the previous coding, a user could queue up for an AccessExclusiveLock
on a table they did not have permission to cluster, thus potentially
interfering with access by authorized users who got stuck waiting behind
the AccessExclusiveLock. This approach avoids that. cluster() has the
same permissions-checking requirements as REINDEX TABLE, so this commit
moves the now-shared callback to tablecmds.c and renames it, per
discussion with Noah Misch.
When a PORTAL_ONE_SELECT query is executed, we can opportunistically
reuse the parse/plan shot for the execution phase. This cuts down the
number of snapshots per simple query from 2 to 1 for the simple
protocol, and 3 to 2 for the extended protocol. Since we are only
reusing a snapshot taken early in the processing of the same protocol
message, the change shouldn't be user-visible, except that the remote
possibility of the planning and execution snapshots being different is
eliminated.
Note that this change does not make it safe to assume that the parse/plan
snapshot will certainly be reused; that will currently only happen if
PortalStart() decides to use the PORTAL_ONE_SELECT strategy. It might
be worth trying to provide some stronger guarantees here in the future,
but for now we don't.
Patch by me; review by Dimitri Fontaine.
This adds support for the more or less SQL-conforming USAGE privilege
on types and domains. The intent is to be able restrict which users
can create dependencies on types, which restricts the way in which
owners can alter types.
reviewed by Yeb Havinga
On reflection, the original name seems way too generic for a global
symbol. A quick check shows this is the only exported function name
in SP-GiST that doesn't begin with "spg" or contain "SpGist", so the
rest of them seem all right.
This makes them enforceable only on the parent table, not on children
tables. This is useful in various situations, per discussion involving
people bitten by the restrictive behavior introduced in 8.4.
Message-Id:
8762mp93iw.fsf@comcast.netCAFaPBrSMMpubkGf4zcRL_YL-AERUbYF_-ZNNYfb3CVwwEqc9TQ@mail.gmail.com
Authors: Nikhil Sontakke, Alex Hunsaker
Reviewed by Robert Haas and myself
Operator classes can specify whether or not they support this; this
preserves the flexibility to use lossy representations within an index.
In passing, move constant data about a given index into the rd_amcache
cache area, instead of doing fresh lookups each time we start an index
operation. This is mainly to try to make sure that spgcanreturn() has
insignificant cost; I still don't have any proof that it matters for
actual index accesses. Also, get rid of useless copying of FmgrInfo
pointers; we can perfectly well use the relcache's versions in-place.
The need for this was debated when we put in the index-only-scan feature,
but at the time we had no near-term expectation of having AMs that could
support such scans for only some indexes; so we kept it simple. However,
the SP-GiST AM forces the issue, so let's fix it.
This patch only installs the new API; no behavior actually changes.
These entries could never be matched to an index clause because they don't
have the index datatype on the left-hand side of the operator. (Their
commutators are in the opclass, which is sensible, but that doesn't mean
these operators should be.) Spotted by a test that I recently added to
opr_sanity to catch exactly this type of thinko. AFAICT there is no code
in gistproc.c that is specifically meant to cover these cases, so nothing
to remove at that level.
SP-GiST is comparable to GiST in flexibility, but supports non-balanced
partitioned search structures rather than balanced trees. As described at
PGCon 2011, this new indexing structure can beat GiST in both index build
time and query speed for search problems that it is well matched to.
There are a number of areas that could still use improvement, but at this
point the code seems committable.
Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane
Heikki Linnakangas had the idea of rearranging GetSnapshotData to
avoid checking for sub-XIDs when no top-level XID is present. This
patch does that plus further a bit of further, related rearrangement.
Benchmarking show a significant improvement on unlogged tables at
higher concurrency levels, and mostly indifferent result on permanent
tables (which are presumably bottlenecked elsewhere). Most of the
benefit seems to come from using the new NormalTransactionIdPrecedes()
macro rather than the function call TransactionIdPrecedes().
This works the same as include, except that an error is not thrown
if the file is missing. Instead the fact that it's missing is
logged.
Greg Smith, reviewed by Euler Taveira de Oliveira.
Previously, renaming a table, sequence, view, index, foreign table,
column, or trigger checked permissions before locking the object, which
meant that if permissions were revoked during the lock wait, we would
still allow the operation. Similarly, if the original object is dropped
and a new one with the same name is created, the operation will be allowed
if we had permissions on the old object; the permissions on the new
object don't matter. All this is now fixed.
Along the way, attempting to rename a trigger on a foreign table now gives
the same error message as trying to create one there in the first place
(i.e. that it's not a table or view) rather than simply stating that no
trigger by that name exists.
Patch by me; review by Noah Misch.
Removing this bit from xl_info allows us to restore the old limit of four
(not three) separate pages touched by a WAL record, which is needed for the
upcoming SP-GiST feature, and will likely be useful elsewhere in future.
When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that
because no special WAL-visible action was taken when starting a backup.
However, now we force a segment switch when starting a backup, so a
compressing WAL archiver (such as pglesslog) that uses the state shown in
the current page header will not be fooled as to removability of backup
blocks. The only downside is that the archiver will not return to
compressing mode for up to one WAL page after the backup is over, which is
a small price to pay for getting back the extra xl_info bit. In any case
the archiver could look for XLOG_BACKUP_END records if it thought it was
worth the trouble to do so.
Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
I forgot to change the functions to use the PG_GETARG_INET_PP() macro,
when I changed DatumGetInetP() to unpack the datum, like Datum*P macros
usually do. Also, I screwed up the definition of the PG_GETARG_INET_PP()
macro, and didn't notice because it wasn't used.
This fixes the memory leak when sorting inet values, as reported
by Jochen Erwied and debugged by Andres Freund. Backpatch to 8.3, like
the previous patch that broke it.
Original patch by Lars Kanis, reviewed by Nishiyama Tomoaki and tweaked some by me.
This compiler, or at least the latest version of it, is currently broken, and
only passes the regression tests if built with -O0.
we don't reach consistency before replaying all of the WAL. Rename the
variable to reachedConsistency, to make its intention clearer.
In master, that was an active bug because of the recent patch to
immediately PANIC if a reference to a missing page is found in WAL after
reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0,
the only consequence was a misleading "consistent recovery state reached at
%X/%X" message in the log at the beginning of crash recovery (the database
is not consistent at that point yet). In 8.4, the log message was not
printed in crash recovery, even though there was a similar
reachedMinRecoveryPoint local variable that was also set early. So,
backpatch to 9.1 and 9.0.
lost. The only way we detect that at the moment is when write() fails when
we try to write to the socket.
Florian Pflug with small changes by me, reviewed by Greg Jaskiewicz.
Instead, add a function pg_tablespace_location(oid) used to return
the same information, and do this by reading the symbolic link.
Doing it this way makes it possible to relocate a tablespace when the
database is down by simply changing the symbolic link.
This patch creates an API whereby a btree index opclass can optionally
provide non-SQL-callable support functions for sorting. In the initial
patch, we only use this to provide a directly-callable comparator function,
which can be invoked with a bit less overhead than the traditional
SQL-callable comparator. While that should be of value in itself, the real
reason for doing this is to provide a datatype-extensible framework for
more aggressive optimizations, as in Peter Geoghegan's recent work.
Robert Haas and Tom Lane
invalid-page hash table, PANIC immediately. Immediate PANIC is much better
than waiting for end-of-recovery, which is what we did before, because the
end-of-recovery might not come until months later if this is a standby
server.
Also refrain from creating a restartpoint if there are invalid-page entries
in the hash table. Restarting recovery from such a restartpoint would not
see the invalid references, and wouldn't be able to cross-check them when
consistency is reached. That wouldn't matter when things are going smoothly,
but the more sanity checks you have the better.
Fujii Masao
In the previous coding, callers were faced with an awkward choice:
look up the name, do permissions checks, and then lock the table; or
look up the name, lock the table, and then do permissions checks.
The first choice was wrong because the results of the name lookup
and permissions checks might be out-of-date by the time the table
lock was acquired, while the second allowed a user with no privileges
to interfere with access to a table by users who do have privileges
(e.g. if a malicious backend queues up for an AccessExclusiveLock on
a table on which AccessShareLock is already held, further attempts
to access the table will be blocked until the AccessExclusiveLock
is obtained and the malicious backend's transaction rolls back).
To fix, allow callers of RangeVarGetRelid() to pass a callback which
gets executed after performing the name lookup but before acquiring
the relation lock. If the name lookup is retried (because
invalidation messages are received), the callback will be re-executed
as well, so we get the best of both worlds. RangeVarGetRelid() is
renamed to RangeVarGetRelidExtended(); callers not wishing to supply
a callback can continue to invoke it as RangeVarGetRelid(), which is
now a macro. Since the only one caller that uses nowait = true now
passes a callback anyway, the RangeVarGetRelid() macro defaults nowait
as well. The callback can also be used for supplemental locking - for
example, REINDEX INDEX needs to acquire the table lock before the index
lock to reduce deadlock possibilities.
There's a lot more work to be done here to fix all the cases where this
can be a problem, but this commit provides the general infrastructure
and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE,
LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE.
Per discussion with Noah Misch and Alvaro Herrera.
The EvalPlanQual machinery assumes that whole-row Vars generated for the
outputs of non-table RTEs will be of composite types. However, for the
case where the RTE is a function call returning a scalar type, we were
doing the wrong thing, as a result of sharing code with a parser case
where the function's scalar output is wanted. (Or at least, that's what
that case has done historically; it does seem a bit inconsistent.)
To fix, extend makeWholeRowVar's API so that it can support both use-cases.
This fixes Belinda Cussen's report of crashes during concurrent execution
of UPDATEs involving joins to the result of UNNEST() --- in READ COMMITTED
mode, we'd run the EvalPlanQual machinery after a conflicting row update
commits, and it was expecting to get a HeapTuple not a scalar datum from
the "wholerowN" variable referencing the function RTE.
Back-patch to 9.0 where the current EvalPlanQual implementation appeared.
In 9.1 and up, this patch also fixes failure to attach the correct
collation to the Var generated for a scalar-result case. An example:
regression=# select upper(x.*) from textcat('ab', 'cd') x;
ERROR: could not determine which collation to use for upper() function
In the original implementation, a range-contained-by search had to scan
the entire index because an empty range could be lurking anywhere.
Improve that by adding a flag to upper GiST entries that says whether the
represented subtree contains any empty ranges.
Also, make a simple mod to the penalty function to discourage empty ranges
from getting pushed into subtrees without any. This needs more work, and
the picksplit function should be taught about it too, but that code can be
improved without causing an on-disk compatibility break; so we'll leave it
for another day.
Since we're breaking on-disk compatibility of range values anyway, I took
the opportunity to reorganize the range flags bits; the unused
RANGE_xB_NULL bits are now adjacent, which might open the door for using
them in some other way later.
In passing, remove the GiST range opclass entry for <>, which doesn't seem
like it can really be indexed usefully.
Alexander Korotkov, with some editorializing by Tom
This adds some I/O stats to the logging of autovacuum (when the
operation takes long enough that log_autovacuum_min_duration causes it
to be logged), so that it is easier to tune. Notably, it adds buffer
I/O counts (hits, misses, dirtied) and read and write rate.
Authors: Greg Smith and Noah Misch
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order. These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.
Pavan Deolasee, Heikki Linnakangas, Robert Haas
The WITH [NO] DATA option was not supported, nor the ability to specify
replacement column names; the former limitation wasn't even documented, as
per recent complaint from Naoya Anzai. Fix by moving the responsibility
for supporting these options into the executor. It actually takes less
code this way ...
catversion bump due to change in representation of IntoClause, which might
affect stored rules.
It's not clear that a per-datatype typanalyze function would be any more
useful than a generic typanalyze for ranges. What *is* clear is that
letting unprivileged users select typanalyze functions is a crash risk or
worse. So remove the option from CREATE TYPE AS RANGE, and instead put in
a generic typanalyze function for ranges. The generic function does
nothing as yet, but hopefully we'll improve that before 9.2 release.
Per discussion, the zero-argument forms aren't really worth the catalog
space (just write 'empty' instead). The one-argument forms have some use,
but they also have a serious problem with looking too much like functional
cast notation; to the point where in many real use-cases, the parser would
misinterpret what was wanted.
Committing this as a separate patch, with the thought that we might want
to revert part or all of it if we can think of some way around the cast
ambiguity.
Implement these tests directly instead of constructing a singleton range
and then applying range-contains. This saves a range serialize/deserialize
cycle as well as a couple of redundant bound-comparison steps, and adds
very little code on net.
Remove elem_contained_by_range from the GiST opclass: it doesn't belong
there because there is no way to use it in an index clause (where the
indexed column would have to be on the left). Its commutator is in the
opclass, and that's what counts.
Per discussion, relax the range input/construction rules so that the
only hard error is lower bound > upper bound. Cases where the lower
bound is <= upper bound, but the range nonetheless normalizes to empty,
are now permitted.
Fix core dump in range_adjacent when bounds are infinite. Marginal
cleanup of regression test cases, some more code commenting.
Fix up some infelicitous coding in DefineRange, and add some missing error
checks. Rearrange operator strategy number assignments for GiST anyrange
opclass so that they don't make such a mess of opr_sanity's table of
operator names associated with different strategy numbers. Assign
hopefully-temporary selectivity estimators to range operators that didn't
have one --- poor as the estimates are, they're still a lot better than the
default 0.5 estimate, and they'll shut up the opr_sanity test that wants to
see selectivity estimators on all built-in operators.
This gets rid of an impressive amount of duplicative code, with only
minimal behavior changes. DROP FOREIGN DATA WRAPPER now requires object
ownership rather than superuser privileges, matching the documentation
we already have. We also eliminate the historical warning about dropping
a built-in function as unuseful. All operations are now performed in the
same order for all object types handled by dropcmds.c.
KaiGai Kohei, with minor revisions by me
Use of anynonarray was a crude hack to get around ambiguity versus the
array inclusion operators of the same names. My previous patch to extend
the parser's type resolution heuristics makes that unnecessary, so use
the more general declaration instead. This eliminates a wart that these
operators couldn't be used with ranges over arrays, which are otherwise
supported just fine.
Also, mark range_before and range_after as commutator operators,
per discussion with Jeff Davis.
A very long time ago, language names were specified as literals rather
than identifiers, so this code was added to do case-folding. But that
style has ben deprecated for many years so this isn't needed any more.
Language names will still be downcased when specified as unquoted
identifiers, but quoted identifiers or the old style using string
literals will be left as-is.
Fix assorted infelicities, such as dependency on OIDs that aren't
hardwired, as well as outright misdeclaration of daterange_canonical(),
which resulted in crashes if you invoked it directly. Add some more
regression tests to try to catch similar mistakes in future.
Move the responsibility for caching specialized information about range
types into the type cache, so that the catalog lookups only have to occur
once per session. Rearrange APIs a bit so that fn_extra caching is
actually effective in the GiST support code. (Use of OidFunctionCallN is
bad enough for performance in itself, but it also prevents the function
from exploiting fn_extra caching.)
The range I/O functions are still not very bright about caching repeated
lookups, but that seems like material for a separate patch.
Also, avoid unnecessary use of memcpy to fetch/store the range type OID and
flags, and don't use the full range_deserialize machinery when all we need
to see is the flags value.
Also fix API error in range_gist_penalty --- it was failing to set *penalty
for any case involving an empty range.
A range type whose element type has 'd' alignment must have 'd' alignment
itself, else there is no guarantee that the element value can be used
in-place. (Because range_deserialize uses att_align_pointer which forcibly
aligns the given pointer, violations of this rule did not lead to SIGBUS
but rather to garbage data being extracted, as in one of the added
regression test cases.)
Also, you can't put a toast pointer inside a range datum, since the
referenced value could disappear with the range datum still present.
For consistency with the handling of arrays and records, I also forced
decompression of in-line-compressed bound values. It would work to store
them as-is, but our policy is to avoid situations that might result in
double compression.
Add assorted regression tests for this, and bump catversion because of
fixes to built-in pg_type entries.
Also some marginal cleanup of inconsistent/unnecessary error checks.
No functional changes in this commit (except I could not resist the
temptation to re-word a couple of error messages). This is just manual
cleanup after pgindent to make the code look reasonably like other PG
code, in preparation for more detailed code review to come.
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.
This reverts commit 0180bd6180.
contrib/userlock is gone, but user-level locking still exists,
and is exposed via the pg_advisory* family of functions.
This greatly reduces the WAL volume, especially when the table is narrow.
The overhead of locking the heap page is also reduced. Reduced WAL traffic
also makes it scale a lot better, if you run multiple COPY processes at
the same time.
Add PlaceHolderVar wrappers as needed to make UNION ALL sub-select output
expressions appear non-constant and distinct from each other. This makes
the world safe for add_child_rel_equivalences to do what it does. Before,
it was possible for that function to add identical expressions to different
EquivalenceClasses, which logically should imply merging such ECs, which
would be wrong; or to improperly add a constant to an EquivalenceClass,
drastically changing its behavior. Per report from Teodor Sigaev.
The only currently known consequence of this bug is "MergeAppend child's
targetlist doesn't match MergeAppend" planner failures in 9.1 and later.
I am suspicious that there may be other failure modes that could affect
older release branches; but in the absence of any hard evidence, I'll
refrain from back-patching further than 9.1.
a new macro, DatumGetInetPP(), that does not. This brings these macros
in line with other DatumGet*P() macros.
Backpatch to 8.3, where 1-byte header varlenas were introduced.
In a regular VACUUM, it's OK to skip pages for which a cleanup lock
isn't immediately available; the next VACUUM will deal with them. If
we're scanning the entire relation to advance relfrozenxid, we might
need to wait, but only if there are tuples on the page that actually
require freezing. These changes should greatly reduce the incidence
of of vacuum processes getting "stuck".
Simon Riggs and Robert Haas
It was inadvertently changed to 201111111, which is a wrong date. Change it
to current date, and remove the comment that was supposed to remind me to
fix it before committing.
If we use a PlaceHolderVar from the outer relation in an inner indexscan,
we need to reference the PlaceHolderVar as such as the value to be passed
in from the outer relation. The previous code effectively tried to
reconstruct the PHV from its component expression, which doesn't work since
(a) the Vars therein aren't necessarily bubbled up far enough, and (b) it
would be the wrong semantics anyway because of the possibility that the PHV
is supposed to have gone to null at some point before the current join.
Point (a) led to "variable not found in subplan target list" planner
errors, but point (b) would have led to silently wrong answers.
Per report from Roger Niederland.
There was a timing window between when oldestActiveXid was derived
and when it should have been derived that only shows itself under
heavy load. Move code around to ensure correct timing of derivation.
No change to StartupSUBTRANS() code, which is where this failed.
Bug report by Chris Redekop
If a tuple in a syscache contains an out-of-line toasted field, and we
try to fetch that field shortly after some other transaction has committed
an update or deletion of the tuple, there is a race condition: vacuum
could come along and remove the toast tuples before we can fetch them.
This leads to transient failures like "missing chunk number 0 for toast
value NNNNN in pg_toast_2619", as seen in recent reports from Andrew
Hammond and Tim Uckun.
The design idea of syscache is that access to stale syscache entries
should be prevented by relation-level locks, but that fails for at least
two cases where toasted fields are possible: ANALYZE updates pg_statistic
rows without locking out sessions that might want to plan queries on the
same table, and CREATE OR REPLACE FUNCTION updates pg_proc rows without
any meaningful lock at all.
The least risky fix seems to be an idea that Heikki suggested when we
were dealing with a related problem back in August: forcibly detoast any
out-of-line fields before putting a tuple into syscache in the first place.
This avoids the problem because at the time we fetch the parent tuple from
the catalog, we should be holding an MVCC snapshot that will prevent
removal of the toast tuples, even if the parent tuple is outdated
immediately after we fetch it. (Note: I'm not convinced that this
statement holds true at every instant where we could be fetching a syscache
entry at all, but it does appear to hold true at the times where we could
fetch an entry that could have a toasted field. We will need to be a bit
wary of adding toast tables to low-level catalogs that don't have them
already.) An additional benefit is that subsequent uses of the syscache
entry should be faster, since they won't have to detoast the field.
Back-patch to all supported versions. The problem is significantly harder
to reproduce in pre-9.0 releases, because of their willingness to flush
every entry in a syscache whenever the underlying catalog is vacuumed
(cf CatalogCacheFlushRelation); but there is still a window for trouble.
bgwriter is now a much less important process, responsible for page
cleaning duties only. checkpointer is now responsible for checkpoints
and so has a key role in shutdown. Later patches will correct doc
references to the now old idea that bgwriter performs checkpoints.
Has beneficial effect on performance at high write rates, but mainly
refactoring to more easily allow changes for power reduction by
simplifying previously tortuous code around required to allow page
cleaning and checkpointing to time slice in the same process.
Patch by me, Review by Dickson Guedes
This infrastructure doesn't in any way guarantee that the character
we produce will sort before the one we incremented; but it does at least
make it much more likely that we'll end up with something that is a valid
character, which improves our chances.
Kyotaro Horiguchi, with various adjustments by me.
We need not wait until the commit record is durably on disk, because
in the event of a crash the page we're updating with hint bits will
be gone anyway. Per off-list report from Heikki Linnakangas, this
can significantly degrade the performance of unlogged tables; I was
able to show a 2x speedup from this patch on a pgbench run with scale
factor 15. In practice, this will mostly help small, heavily updated
tables, because on larger tables you're unlikely to run into the same
row again before the commit record makes it out to disk.
If the right-hand side of a semijoin is unique, then we can treat it like a
normal join (or another way to say that is: we don't need to explicitly
unique-ify the data before doing it as a normal join). We were recognizing
such cases when the RHS was a sub-query with appropriate DISTINCT or GROUP
BY decoration, but there's another way: if the RHS is a plain relation with
unique indexes, we can check if any of the indexes prove the output is
unique. Most of the infrastructure for that was there already in the join
removal code, though I had to rearrange it a bit. Per reflection about a
recent example in pgsql-performance.
The uniqueness condition might fail to hold intra-transaction, and assuming
it does can give incorrect query results. Per report from Marti Raudsepp,
though this is not his proposed patch.
Back-patch to 9.0, where both these features were introduced. In the
released branches, add the new IndexOptInfo field to the end of the struct,
to try to minimize ABI breakage for third-party code that may be examining
that struct.
A transaction can export a snapshot with pg_export_snapshot(), and then
others can import it with SET TRANSACTION SNAPSHOT. The data does not
leave the server so there are not security issues. A snapshot can only
be imported while the exporting transaction is still running, and there
are some other restrictions.
I'm not totally convinced that we've covered all the bases for SSI (true
serializable) mode, but it works fine for lesser isolation modes.
Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
by Tom Lane
Avoid possibly dumping core when pgstat_track_activity_query_size has a
less-than-default value; avoid uselessly searching for the query string
of a successfully-exited backend; don't bother putting out an ERRDETAIL if
we don't have a query to show; some other minor stylistic improvements.
To avoid minimize risk inside the postmaster, we subject this feature
to a number of significant limitations. We very much wish to avoid
doing any complex processing inside the postmaster, due to the
posssibility that the crashed backend has completely corrupted shared
memory. To that end, no encoding conversion is done; instead, we just
replace anything that doesn't look like an ASCII character with a
question mark. We limit the amount of data copied to 1024 characters,
and carefully sanity check the source of that data. While these
restrictions would doubtless be unacceptable in a general-purpose
logging facility, even this limited facility seems like an improvement
over the status quo ante.
Marti Raudsepp, reviewed by PDXPUG and myself
This gets rid of a significant amount of duplicative code.
KaiGai Kohei, reviewed in earlier versions by Dimitri Fontaine, with
further review and cleanup by me.
There's no particular value in doing AssertMacro((tup) != NULL) in front
of code that's certain to crash anyway if tup is NULL. And if "tup" is
actually the address of a local variable, gcc 4.6 whinges about it. That's
arguably pretty broken on gcc's part, but we might as well remove the
useless test to silence the warnings. This gets rid of all the -Waddress
warnings in the backend; there are some in libpq and psql that are a bit
harder to avoid.
In general the data returned by an index-only scan should have the
datatypes originally computed by FormIndexDatum. If the index opclasses
use "storage" datatypes different from their input datatypes, the scan
tuple will not have the same rowtype attributed to the index; but we had
a hard-wired assumption that that was true in nodeIndexonlyscan.c. We'd
already hacked around the issue for the one case where the types are
different in btree indexes (btree name_ops), but this would definitely
come back to bite us if we ever implement index-only scans in GiST.
To fix, require the index AM to explicitly provide the tupdesc for the
tuple it is returning. btree can just pass back the index's tupdesc, but
GiST will have to work harder when and if it supports index-only scans.
I had previously proposed fixing this by allowing the index AM to fill the
scan tuple slot directly; but on reflection that seemed like a module
layering violation, since TupleTableSlots are creatures of the executor.
At least in the btree case, it would also be less efficient, since the
tuple deconstruction work would occur even for rows later found to be
invisible to the scan's snapshot.
Add a column pg_class.relallvisible to remember the number of pages that
were all-visible according to the visibility map as of the last VACUUM
(or ANALYZE, or some other operations that update pg_class.relpages).
Use relallvisible/relpages, instead of an arbitrary constant, to estimate
how many heap page fetches can be avoided during an index-only scan.
This is pretty primitive and will no doubt see refinements once we've
acquired more field experience with the index-only scan mechanism, but
it's way better than using a constant.
Note: I had to adjust an underspecified query in the window.sql regression
test, because it was changing answers when the plan changed to use an
index-only scan. Some of the adjacent tests perhaps should be adjusted
as well, but I didn't do that here.
This commit changes index-only scans so that data is read directly from the
index tuple without first generating a faux heap tuple. The only immediate
benefit is that indexes on system columns (such as OID) can be used in
index-only scans, but this is necessary infrastructure if we are ever to
support index-only scans on expression indexes. The executor is now ready
for that, though the planner still needs substantial work to recognize
the possibility.
To do this, Vars in index-only plan nodes have to refer to index columns
not heap columns. I introduced a new special varno, INDEX_VAR, to mark
such Vars to avoid confusion. (In passing, this commit renames the two
existing special varnos to OUTER_VAR and INNER_VAR.) This allows
ruleutils.c to handle them with logic similar to what we use for subplan
reference Vars.
Since index-only scans are now fundamentally different from regular
indexscans so far as their expression subtrees are concerned, I also chose
to change them to have their own plan node type (and hence, their own
executor source file).
This was broken in commit 53dbc27c62, which
introduced unlogged tables. Fortunately, as debugging tools go, this one
is pretty cheap, which is probably why it took nine months for someone to
notice, but it's not intended to be enabled by default, so revert.
Noted by Fujii Masao.
We copy all the matched tuples off the page during _bt_readpage, instead of
expensively re-locking the page during each subsequent tuple fetch. This
costs a bit more local storage, but not more than 2*BLCKSZ worth, and the
reduction in LWLock traffic is certainly worth that. What's more, this
lets us get rid of the API wart in the original patch that said an index AM
could randomly decline to supply an index tuple despite having asserted
pg_am.amcanreturn. That will be important for future improvements in the
index-only-scan feature, since the executor will now be able to rely on
having the index data available.
When a btree index contains all columns required by the query, and the
visibility map shows that all tuples on a target heap page are
visible-to-all, we don't need to fetch that heap page. This patch depends
on the previous patches that made the visibility map reliable.
There's a fair amount left to do here, notably trying to figure out a less
chintzy way of estimating the cost of an index-only scan, but the core
functionality seems ready to commit.
Robert Haas and Ibrar Ahmed, with some previous work by Heikki Linnakangas.
CREATE EXTENSION needs to transiently set search_path, as well as
client_min_messages and log_min_messages. We were doing this by the
expedient of saving the current string value of each variable, doing a
SET LOCAL, and then doing another SET LOCAL with the previous value at
the end of the command. This is a bit expensive though, and it also fails
badly if there is anything funny about the existing search_path value,
as seen in a recent report from Roger Niederland. Fortunately, there's a
much better way, which is to piggyback on the GUC infrastructure previously
developed for functions with SET options. We just open a new GUC nesting
level, do our assignments with GUC_ACTION_SAVE, and then close the nesting
level when done. This automatically restores the prior settings without a
re-parsing pass, so (in principle anyway) there can't be an error. And
guc.c still takes care of cleanup in event of an error abort.
The CREATE EXTENSION code for this was modeled on some much older code in
ri_triggers.c, which I also changed to use the better method, even though
there wasn't really much risk of failure there. Also improve the comments
in guc.c to reflect this additional usage.
Arrange for any problems with pre-existing settings to be reported as
WARNING not ERROR, so that we don't undesirably abort the loading of the
incoming add-on module. The bad setting is just discarded, as though it
had never been applied at all. (This requires a change in the API of
set_config_option. After some thought I decided the most potentially
useful addition was to allow callers to just pass in a desired elevel.)
Arrange to restore the complete stacked state of the variable, rather than
cheesily reinstalling only the active value. This ensures that custom GUCs
will behave unsurprisingly even when the module loading operation occurs
within nested subtransactions that have changed the active value. Since a
module load could occur as a result of, eg, a PL function call, this is not
an unlikely scenario.
We used to just remember the GucSource, but saving GucContext too provides
a little more information --- notably, whether a SET was done by a
superuser or regular user. This allows us to rip out the fairly dodgy code
that define_custom_variable used to use to try to infer the context to
re-install a pre-existing setting with. In particular, it now works for
a superuser to SET a extension's SUSET custom variable before loading the
associated extension, because GUC can remember whether the SET was done as
a superuser or not. The plperl regression tests contain an example where
this is useful.
Previously, the code assumed that the only possible action to take was
to delete files behind a certain cutoff point. The async notify code
was already a crock: it used a different "pagePrecedes" function for
truncation than for regular operation. By allowing it to pass a
callback to SlruScanDirectory it can do cleanly exactly what it needs to
do.
The clog.c code also had its own use for SlruScanDirectory, which is
made a bit simpler with this.
This patch has two distinct purposes: to report multiple problems in
postgresql.conf rather than always bailing out after the first one,
and to change the policy for whether changes are applied when there are
unrelated errors in postgresql.conf.
Formerly the policy was to apply no changes if any errors could be
detected, but that had a significant consistency problem, because in some
cases specific values might be seen as valid by some processes but invalid
by others. This meant that the latter processes would fail to adopt
changes in other parameters even though the former processes had done so.
The new policy is that during SIGHUP, the file is rejected as a whole
if there are any errors in the "name = value" syntax, or if any lines
attempt to set nonexistent built-in parameters, or if any lines attempt
to set custom parameters whose prefix is not listed in (the new value of)
custom_variable_classes. These tests should always give the same results
in all processes, and provide what seems a reasonably robust defense
against loading values from badly corrupted config files. If these tests
pass, all processes will apply all settings that they individually see as
good, ignoring (but logging) any they don't.
In addition, the postmaster does not abandon reading a configuration file
after the first syntax error, but continues to read the file and report
syntax errors (up to a maximum of 100 syntax errors per file).
The postmaster will still refuse to start up if the configuration file
contains any errors at startup time, but these changes allow multiple
errors to be detected and reported before quitting.
Alexey Klyukin, reviewed by Andy Colson and av (Alexander ?)
with some additional hacking by Tom Lane
pg_trgm was already doing this unofficially, but the implementation hadn't
been thought through very well and leaked memory. Restructure the core
GiST code so that it actually works, and document it. Ordinarily this
would have required an extra memory context creation/destruction for each
GiST index search, but I was able to avoid that in the normal case of a
non-rescanned search by finessing the handling of the RBTree. It used to
have its own context always, but now shares a context with the
scan-lifespan data structures, unless there is more than one rescan call.
This should make the added overhead unnoticeable in typical cases.
In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do
GetTransactionSnapshot() between AbortTransaction and CleanupTransaction
failed, because GetTransactionSnapshot would recompute the transaction
snapshot (which is already wrong, given the isolation mode) and then
re-register it in the TopTransactionResourceOwner, leading to an Assert
because the TopTransactionResourceOwner should be empty of resources after
AbortTransaction. This is the root cause of bug #6218 from Yamamoto
Takashi. While changing plancache.c to avoid requesting a snapshot when
handling a ROLLBACK masks the problem, I think this is really a snapmgr.c
bug: it's lower-level than the resource manager mechanism and should not be
shutting itself down before we unwind resource manager resources. However,
just postponing the release of the transaction snapshot until cleanup time
didn't work because of the circular dependency with
TopTransactionResourceOwner. Fix by managing the internal reference to
that snapshot manually instead of depending on TopTransactionResourceOwner.
This saves a few cycles as well as making the module layering more
straightforward. predicate.c's dependencies on TopTransactionResourceOwner
go away too.
I think this is a longstanding bug, but there's no evidence that it's more
than a latent bug, so it doesn't seem worth any risk of back-patching.
The constraint exclusion feature checks for contradictions among scan
restriction clauses, as well as contradictions between those clauses and a
table's CHECK constraints. The first aspect of this testing can be useful
for non-table relations (such as subqueries or functions-in-FROM), but the
feature was coded with only the CHECK case in mind so we were applying it
only to plain-table RTEs. Move the relation_excluded_by_constraints call
so that it is applied to all RTEs not just plain tables. With the default
setting of constraint_exclusion this results in no extra work, but with
constraint_exclusion = ON we will detect optimizations that we missed
before (at the cost of more planner cycles than we expended before).
Per a gripe from Gunnlaugur Þór Briem. Experimentation with
his example also showed we were not being very bright about the case where
constraint exclusion is proven within a subquery within UNION ALL, so tweak
the code to allow set_append_rel_pathlist to recognize such cases.
This is not actually used anywhere yet, but it gets the basic
infrastructure in place. It is fairly likely that there are bugs, and
support for some important platforms may be missing, so we'll need to
refine this as we go along.
This provides information about the numbers of tuples that were visited
but not returned by table scans, as well as the numbers of join tuples
that were considered and discarded within a join plan node.
There is still some discussion going on about the best way to report counts
for outer-join situations, but I think most of what's in the patch would
not change if we revise that, so I'm going to go ahead and commit it as-is.
Documentation changes to follow (they weren't in the submitted patch
either).
Marko Tiikkaja, reviewed by Marc Cousin, somewhat revised by Tom
Rewrite plancache.c so that a "cached plan" (which is rather a misnomer
at this point) can support generation of custom, parameter-value-dependent
plans, and can make an intelligent choice between using custom plans and
the traditional generic-plan approach. The specific choice algorithm
implemented here can probably be improved in future, but this commit is
all about getting the mechanism in place, not the policy.
In addition, restructure the API to greatly reduce the amount of extraneous
data copying needed. The main compromise needed to make that possible was
to split the initial creation of a CachedPlanSource into two steps. It's
worth noting in particular that SPI_saveplan is now deprecated in favor of
SPI_keepplan, which accomplishes the same end result with zero data
copying, and no need to then spend even more cycles throwing away the
original SPIPlan. The risk of long-term memory leaks while manipulating
SPIPlans has also been greatly reduced. Most of this improvement is based
on use of the recently-added MemoryContextSetParent primitive.
This function will be useful for altering the lifespan of a context after
creation (for example, by creating it under a transient context and later
reparenting it to belong to a long-lived context). It costs almost no new
code, since we can refactor what was there. Per my proposal of yesterday.
This addresses only those cases that are easy to fix by adding or
moving a const qualifier or removing an unnecessary cast. There are
many more complicated cases remaining.
Add __attribute__ decorations for printf format checking to the places that
were missing them. Fix the resulting warnings. Add
-Wmissing-format-attribute to the standard set of warnings for GCC, so these
don't happen again.
The warning fixes here are relatively harmless. The one serious problem
discovered by this was already committed earlier in
cf15fb5cab.
We were doing some amazingly complicated things in order to avoid running
the very expensive identify_system_timezone() procedure during GUC
initialization. But there is an obvious fix for that, which is to do it
once during initdb and have initdb install the system-specific default into
postgresql.conf, as it already does for most other GUC variables that need
system-environment-dependent defaults. This means that the timezone (and
log_timezone) settings no longer have any magic behavior in the server.
Per discussion.
As per my recent proposal, this refactors things so that these typedefs and
macros are available in a header that can be included in frontend-ish code.
I also changed various headers that were undesirably including
utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly,
this showed that half the system was getting utils/timestamp.h by way of
xlog.h.
No actual code changes here, just header refactoring.
When building a GiST index that doesn't fit in cache, buffers are attached
to some internal nodes in the index. This speeds up the build by avoiding
random I/O that would otherwise be needed to traverse all the way down the
tree to the find right leaf page for tuple.
Alexander Korotkov
walsender.h should depend on xlog.h, not vice versa. (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.) Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h. Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.
This episode reinforces my feeling that pgrminclude should not be run
without adult supervision. Inclusion changes in header files in particular
need to be reviewed with great care. More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
storage/proc.h should not include replication/syncrep.h, especially not
when the latter includes storage/proc.h; but in any case this was a pretty
poor thing from a modular layering standpoint.
Formerly, set_subquery_pathlist and other creators of plans for subqueries
saved only the rangetable and rowMarks lists from the lower-level
PlannerInfo. But there's no reason not to remember the whole PlannerInfo,
and indeed this turns out to simplify matters in a number of places.
The immediate reason for doing this was so that the subroot will still be
accessible when we're trying to extract column statistics out of an
already-planned subquery. But now that I've done it, it seems like a good
code-beautification effort in its own right.
I also chose to get rid of the transient subrtable and subrowmark fields in
SubqueryScan nodes, in favor of having setrefs.c look up the subquery's
RelOptInfo. That required changing all the APIs in setrefs.c to pass
PlannerInfo not PlannerGlobal, which was a large but quite mechanical
transformation.
One side-effect not foreseen at the beginning is that this finally broke
inheritance_planner's assumption that replanning the same subquery RTE N
times would necessarily give interchangeable results each time. That
assumption was always pretty risky, but now we really have to make a
separate RTE for each instance so that there's a place to carry the
separate subroots.
In the past, relhassubclass always remained true if a relation had ever had
child relations, even if the last subclass was long gone. While this had
only marginal performance implications in most cases, it was annoying, and
I'm now considering some planner changes that would raise the cost of a
false positive. It was previously impractical to fix this because of race
condition concerns. However, given the recent change that made tablecmds.c
take ShareExclusiveLock on relations that are gaining a child (commit
fbcf4b92aa), we can now allow ANALYZE to
clear the flag when it's no longer relevant. There is no additional
locking cost to do so, since ANALYZE takes ShareExclusiveLock anyway.
dots. I previously worked around this in initdb, mapping the known
problematic locale names to aliases that work, but Hiroshi Inoue pointed
out that that's not enough because even if you use one of the aliases, like
"Chinese_HKG", setlocale(LC_CTYPE, NULL) returns back the long form, ie.
"Chinese_Hong Kong S.A.R.". When we try to restore an old locale value by
passing that value back to setlocale(), it fails. Note that you are affected
by this bug also if you use one of those short-form names manually, so just
reverting the hack in initdb won't fix it.
To work around that, move the locale name mapping from initdb to a wrapper
around setlocale(), so that the mapping is invoked on every setlocale() call.
Also, add a few checks for failed setlocale() calls in the backend. These
calls shouldn't fail, and if they do there isn't much we can do about it,
but at least you'll get a warning.
Backpatch to 9.1, where the initdb hack was introduced. The Windows bug
affects older versions too if you set locale manually to one of the aliases,
but given the lack of complaints from the field, I'm hesitent to backpatch.
ifdef block. It has nothing to do with whether the replacement snprintf
function is used. It caused no live bug, because the replacement snprintf
function is always used on Win32, but it was nevertheless misplaced.
Per my testing, this works just as well with gcc as it does with HP's
compiler; and there is no reason to think that the effect doesn't occur
with icc, either.
Also, rewrite the header comment about enforcing sequencing around spinlock
operations, per Robert's gripe that it was misleading.
At least on this architecture, it's very important to spin on a
non-atomic instruction and only retry the atomic once it appears
that it will succeed. To fix this, split TAS() into two macros:
TAS(), for trying to grab the lock the first time, and TAS_SPIN(),
for spinning until we get it. TAS_SPIN() defaults to same as TAS(),
but we can override it when we know there's a better way.
It's likely that some of the other cases in s_lock.h require
similar treatment, but this is the only one we've got conclusive
evidence for at present.
Due to tuple-slot mismanagement, evaluation of WHEN conditions for AFTER
ROW UPDATE triggers could crash if there had been a BEFORE ROW trigger
fired for the same update. Fix by not trying to overload the use of
estate->es_trig_tuple_slot. Per report from Yoran Heling.
Back-patch to 9.0, when trigger WHEN conditions were introduced.
This requires adjusting the API for syscache callback functions: they now
get a hash value, not a TID, to identify the target tuple. Most of them
weren't paying any attention to that argument anyway, but plancache did
require a small amount of fixing.
Also, improve performance a trifle by avoiding sending duplicate inval
messages when a heap_update isn't changing the catcache lookup columns.
The previous code tried to synchronize by unlinking the init file twice,
but that doesn't actually work: it leaves a window wherein a third process
could read the already-stale init file but miss the SI messages that would
tell it the data is stale. The result would be bizarre failures in catalog
accesses, typically "could not read block 0 in file ..." later during
startup.
Instead, hold RelCacheInitLock across both the unlink and the sending of
the SI messages. This is more straightforward, and might even be a bit
faster since only one unlink call is needed.
This has been wrong since it was put in (in 2002!), so back-patch to all
supported releases.
The latch infrastructure is now capable of detecting all cases where the
walsender loop needs to wake up, so there is no reason to have an arbitrary
timeout.
Also, modify the walsender loop logic to follow the standard pattern of
ResetLatch, test for work to do, WaitLatch. The previous coding was both
hard to follow and buggy: it would sometimes busy-loop despite having
nothing available to do, eg between receipt of a signal and the next time
it was caught up with new WAL, and it also had interesting choices like
deciding to update to WALSNDSTATE_STREAMING on the strength of information
known to be obsolete.
In pursuit of this (and with the expectation that WaitLatch will be needed
in more places), convert the latch field that was already added to PGPROC
for sync rep into a generic latch that is activated for all PGPROC-owning
processes, and change many of the standard backend signal handlers to set
that latch when a signal happens. This will allow WaitLatch callers to be
wakened properly by these signals.
In passing, fix a whole bunch of signal handlers that had been hacked to do
things that might change errno, without adding the necessary save/restore
logic for errno. Also make some minor fixes in unix_latch.c, and clean
up bizarre and unsafe scheme for disowning the process's latch. Much of
this has to be back-patched into 9.1.
Peter Geoghegan, with additional work by Tom
streamed backup, throw an error and refuse to start up. The restore has not
finished correctly in that case and the data directory is possibly corrupt.
We already errored out in case of archive recovery, but could not during
crash recovery because we couldn't distinguish between the case that
pg_start_backup() was called and the database then crashed (must not error,
data is OK), and the case that we're restoring from a backup and not all
the needed WAL was replayed (data can be corrupt).
To distinguish those cases, add a line to backup_label to indicate
whether the backup was taken with pg_start/stop_backup(), or by streaming
(ie. pg_basebackup).
This requires re-initdb, because of a new field added to the control file.
Improve the documentation around weak-memory-ordering risks, and do a pass
of general editorialization on the comments in the latch code. Make the
Windows latch code more like the Unix latch code where feasible; in
particular provide the same Assert checks in both implementations.
Fix poorly-placed WaitLatch call in syncrep.c.
This patch resolves, for the moment, concerns around weak-memory-ordering
bugs in latch-related code: we have documented the restrictions and checked
that existing calls meet them. In 9.2 I hope that we will install suitable
memory barrier instructions in SetLatch/ResetLatch, so that their callers
don't need to be quite so careful.
Don't try to allocate the default value for a string relopt in the same
palloc chunk as the relopt_string struct. That didn't work too well if you
added a built-in string relopt in the stringRelOpts array, as it's not
possible to have an initializer for a variable length struct in C. This
makes the code slightly simpler too.
While we're at it, move the call to validator function in
add_string_reloption to before the allocation, so that if someone does pass
a bogus default value, we don't leak memory.
A PlaceHolderVar's expression might contain another, lower-level
PlaceHolderVar. If the outer PlaceHolderVar is used, the inner one
certainly will be also, and so we have to make sure that both of them get
into the placeholder_list with correct ph_may_need values during the
initial pre-scan of the query (before deconstruct_jointree starts).
We did this correctly for PlaceHolderVars appearing in the query quals,
but overlooked the issue for those appearing in the top-level targetlist;
with the result that nested placeholders referenced only in the targetlist
did not work correctly, as illustrated in bug #6154.
While at it, add some error checking to find_placeholder_info to ensure
that we don't try to create new placeholders after it's too late to do so;
they have to all be created before deconstruct_jointree starts.
Back-patch to 8.4 where the PlaceHolderVar mechanism was introduced.
This lie has been harmless until now, but has been exposed by the
change to include postgres.h before the python headers, which
in some versions include inttypes.h if HAVE_INTTYPES_H is set.
Somebody thought it'd be cute to invent a set of Node tag numbers that were
defined independently of, and indeed conflicting with, the main tag-number
list. While this accidentally failed to fail so far, it would certainly
lead to trouble as soon as anyone wanted to, say, apply copyObject to these
node types. Clang was already complaining about the use of makeNode on
these tags, and I think quite rightly so. Fix by pushing these node
definitions into the mainstream, including putting replnodes.h where it
belongs.
Instead of entering them on transaction startup, we materialize them
only when someone wants to wait, which will occur only during CREATE
INDEX CONCURRENTLY. In Hot Standby mode, the startup process must also
be able to probe for conflicting VXID locks, but the lock need never be
fully materialized, because the startup process does not use the normal
lock wait mechanism. Since most VXID locks never need to touch the
lock manager partition locks, this can significantly reduce blocking
contention on read-heavy workloads.
Patch by me. Review by Jeff Davis.
glibc renders random() thread-safe by wrapping a futex lock around it;
testing reveals that this limits the performance of pgbench on machines
with many CPU cores. Rather than switching to random_r(), which is
only available on GNU systems and crashes unless you use undocumented
alchemy to initialize the random state properly, switch to our built-in
implementation of erand48(), which is both thread-safe and concurrent.
Since the list of reasons not to use the operating system's erand48()
is getting rather long, rename ours to pg_erand48() (and similarly
for our implementations of lrand48() and srand48()) and just always
use those. We were already doing this on Cygwin anyway, and the
glibc implementation is not quite thread-safe, so pgbench wouldn't
be able to use that either.
Per discussion with Tom Lane.
This kluge was inserted in a spot apparently chosen at random: the lock
manager's state is not yet fully set up for the wait, and in particular
LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock
will not get cleaned up if the ereport is thrown. This seems to not cause
any observable problem in trivial test cases, because LockReleaseAll will
silently clean up the debris; but I was able to cause failures with tests
involving subtransactions.
Fixes breakage induced by commit c85c941470.
Back-patch to all affected branches.
It was initialized in the wrong place and to the wrong value. With bad
luck this could result in incorrect query-cancellation failures in hot
standby sessions, should a HS backend be holding pin on buffer number 1
while trying to acquire a lock.
The original implementation simply did nothing when replacing an existing
object during CREATE EXTENSION. The folly of this was exposed by a report
from Marc Munro: if the existing object belongs to another extension, we
are left in an inconsistent state. We should insist that the object does
not belong to another extension, and then add it to the current extension
if not already a member.
This requires a new shared catalog, pg_shseclabel.
Along the way, fix the security_label regression tests so that they
don't monkey with the labels of any pre-existing objects. This is
unlikely to matter in practice, since only the label for the "dummy"
provider was being manipulated. But this way still seems cleaner.
KaiGai Kohei, with fairly extensive hacking by me.
libxml reports some errors (like invalid xmlns attributes) via the error
handler hook, but still returns a success indicator to the library caller.
This causes us to miss some errors that are important to report. Since the
"generic" error handler hook doesn't know whether the message it's getting
is for an error, warning, or notice, stop using that and instead start
using the "structured" error handler hook, which gets enough information
to be useful.
While at it, arrange to save and restore the error handler hook setting in
each libxml-using function, rather than assuming we can set and forget the
hook. This should improve the odds of working nicely with third-party
libraries that also use libxml.
In passing, volatile-ize some local variables that get modified within
PG_TRY blocks. I noticed this while testing with an older gcc version
than I'd previously tried to compile xml.c with.
Florian Pflug and Tom Lane, with extensive review/testing by Noah Misch
Standby servers can now have WALSender processes, which can work with
either WALReceiver or archive_commands to pass data. Fully updated
docs, including new conceptual terms of sending server, upstream and
downstream servers. WALSenders terminated when promote to master.
Fujii Masao, review, rework and doc rewrite by Simon Riggs
When an AccessShareLock, RowShareLock, or RowExclusiveLock is requested
on an unshared database relation, and we can verify that no conflicting
locks can possibly be present, record the lock in a per-backend queue,
stored within the PGPROC, rather than in the primary lock table. This
eliminates a great deal of contention on the lock manager LWLocks.
This patch also refactors the interface between GetLockStatusData() and
pg_lock_status() to be a bit more abstract, so that we don't rely so
heavily on the lock manager's internal representation details. The new
fast path lock structures don't have a LOCK or PROCLOCK structure to
return, so we mustn't depend on that for purposes of listing outstanding
locks.
Review by Jeff Davis.
We already have similar functions for many other object types, including
operator classes, so it seems like we should have this one, too.
Extracted from a larger patch by Josh Kupershmidt
This function supports untranslated detail messages, in the same way that
errmsg_internal supports untranslated primary messages. We've needed this
for some time IMO, but discussion of some cases in the SSI code provided
the impetus to actually add it.
Kevin Grittner, with minor adjustments by me
GISTInsertStack.childoffnum used to mean "offset of the downlink in this
node, pointing to the child node in the stack". It's now replaced with
downlinkoffnum, which means "offset of the downlink in the parent of this
node". gistFindPath() already used childoffnum with this new meaning, and
had an extra step at the end to pull all the childoffnum values down one
node in the stack, to adjust the stack for the meaning that childoffnum had
elsewhere. That's no longer required.
The reason to do this now is this new representation is more convenient for
the GiST fast build patch that Alexander Korotkov is working on.
While we're at it, replace the linked list used in gistFindPath with a
standard List, and make gistFindPath() static.
Alexander Korotkov, with some changes by me.
Regular aggregate functions in combination with, or within the arguments
of, window functions are OK per spec; they have the semantics that the
aggregate output rows are computed and then we run the window functions
over that row set. (Thus, this combination is not really useful unless
there's a GROUP BY so that more than one aggregate output row is possible.)
The case without GROUP BY could fail, as recently reported by Jeff Davis,
because sloppy construction of the Agg node's targetlist resulted in extra
references to possibly-ungrouped Vars appearing outside the aggregate
function calls themselves. See the added regression test case for an
example.
Fixing this requires modifying the API of flatten_tlist and its underlying
function pull_var_clause. I chose to make pull_var_clause's API for
aggregates identical to what it was already doing for placeholders, since
the useful behaviors turn out to be the same (error, report node as-is, or
recurse into it). I also tightened the error checking in this area a bit:
if it was ever valid to see an uplevel Var, Aggref, or PlaceHolderVar here,
that was a long time ago, so complain instead of ignoring them.
Backpatch into 9.1. The failure exists in 8.4 and 9.0 as well, but seeing
that it only occurs in a basically-useless corner case, it doesn't seem
worth the risks of changing a function API in a minor release. There might
be third-party code using pull_var_clause.
In the previous coding, we would look up a relation in RangeVarGetRelid,
lock the resulting OID, and then AcceptInvalidationMessages(). While
this was sufficient to ensure that we noticed any changes to the
relation definition before building the relcache entry, it didn't
handle the possibility that the name we looked up no longer referenced
the same OID. This was particularly problematic in the case where a
table had been dropped and recreated: we'd latch on to the entry for
the old relation and fail later on. Now, we acquire the relation lock
inside RangeVarGetRelid, and retry the name lookup if we notice that
invalidation messages have been processed meanwhile. Many operations
that would previously have failed with an error in the presence of
concurrent DDL will now succeed.
There is a good deal of work remaining to be done here: many callers
of RangeVarGetRelid still pass NoLock for one reason or another. In
addition, nothing in this patch guards against the possibility that
the meaning of an unqualified name might change due to the creation
of a relation in a schema earlier in the user's search path than the
one where it was previously found. Furthermore, there's nothing at
all here to guard against similar race conditions for non-relations.
For all that, it's a start.
Noah Misch and Robert Haas
We were using GetConfigOption to collect the old value of each setting,
overlooking the possibility that it didn't exist yet. This does happen
in the case of adding a new entry within a custom variable class, as
exhibited in bug #6097 from Maxim Boguk.
To fix, add a missing_ok parameter to GetConfigOption, but only in 9.1
and HEAD --- it seems possible that some third-party code is using that
function, so changing its API in a minor release would cause problems.
In 9.0, create a near-duplicate function instead.
detect postmaster death. Postmaster keeps the write-end of the pipe open,
so when it dies, children get EOF in the read-end. That can conveniently
be waited for in select(), which allows eliminating some of the polling
loops that check for postmaster death. This patch doesn't yet change all
the loops to use the new mechanism, expect a follow-on patch to do that.
This changes the interface to WaitLatch, so that it takes as argument a
bitmask of events that it waits for. Possible events are latch set, timeout,
postmaster death, and socket becoming readable or writeable.
The pipe method behaves slightly differently from the kill() method
previously used in PostmasterIsAlive() in the case that postmaster has died,
but its parent has not yet read its exit code with waitpid(). The pipe
returns EOF as soon as the process dies, but kill() continues to return
true until waitpid() has been called (IOW while the process is a zombie).
Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
PostmasterIsAlive() would claim it's still alive. That could easily lead to
busy-waiting while postmaster is in zombie state.
Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
Florian Pflug.
transactions might not match the order the work done in those transactions
become visible to others. The logic in SSI, however, assumed that it does.
Fix that by having two sequence numbers for each serializable transaction,
one taken before a transaction becomes visible to others, and one after it.
This is easier than trying to make the the transition totally atomic, which
would require holding ProcArrayLock and SerializableXactHashLock at the same
time. By using prepareSeqNo instead of commitSeqNo in a few places where
commit sequence numbers are compared, we can make those comparisons err on
the safe side when we don't know for sure which committed first.
Per analysis by Kevin Grittner and Dan Ports, but this approach to fix it
is different from the original patch.
Per discussion, this structure seems more understandable than what was
there before. Make config.sgml and postgresql.conf.sample agree.
In passing do a bit of editorial work on the variable descriptions.
get_op_btree_interpretation assumed this in order to save some duplication
of code, but it's not true in general anymore because we added <> support
to btree_gist. (We still assume it for btree opclasses, though.)
Also, essentially the same logic was baked into predtest.c. Get rid of
that duplication by generalizing get_op_btree_interpretation so that it
can be used by predtest.c.
Per bug report from Denis de Bernardy and investigation by Jeff Davis,
though I didn't use Jeff's patch exactly as-is.
Back-patch to 9.1; we do not support this usage before that.
\ir is short for "include relative"; when used from a script, the
supplied pathname will be interpreted relative to the input file,
rather than to the current working directory.
Gurjeet Singh, reviewed by Josh Kupershmidt, with substantial further
cleanup by me.
Unlike the relistemp field which it replaced, relpersistence must be
set correctly quite early during the table creation process, as we
rely on it quite early on for a number of purposes, including security
checks. Normally, this is set based on whether the user enters CREATE
TABLE, CREATE UNLOGGED TABLE, or CREATE TEMPORARY TABLE, but a
relation may also be made implicitly temporary by creating it in
pg_temp. This patch fixes the handling of that case, and also
disables creation of unlogged tables in temporary tablespace (such
table indeed skip WAL-logging, but we reject an explicit
specification) and creation of relations in the temporary schemas of
other sessions (which is not very sensible, and didn't work right
anyway).
Report by Amit Khandekar.
This is the proper fix for bug #6082 about
pg_stat_reset_shared(NULL) causing a crash, and it reverts
commit 79aa44536f on head.
The workaround of throwing an error from inside the function is
left on backbranches (including 9.1) since this change requires
a new initdb.
This means that they can initially be added to a large existing table
without checking its initial contents, but new tuples must comply to
them; a separate pass invoked by ALTER TABLE / VALIDATE can verify
existing data and ensure it complies with the constraint, at which point
it is marked validated and becomes a normal part of the table ecosystem.
An non-validated CHECK constraint is ignored in the planner for
constraint_exclusion purposes; when validated, cached plans are
recomputed so that partitioning starts working right away.
This patch also enables domains to have unvalidated CHECK constraints
attached to them as well by way of ALTER DOMAIN / ADD CONSTRAINT / NOT
VALID, which can later be validated with ALTER DOMAIN / VALIDATE
CONSTRAINT.
Thanks to Thom Brown, Dean Rasheed and Jaime Casanova for the various
reviews, and Robert Hass for documentation wording improvement
suggestions.
This patch was sponsored by Enova Financial.
more consistent that way, since all the other PredicateLock* calls are
made in various heapam.c and index AM functions. The call in nodeSeqscan.c
was unnecessarily aggressive anyway, there's no need to try to lock the
relation every time a tuple is fetched, it's enough to do it once.
This has the user-visible effect that if a seq scan is initialized in the
executor, but never executed, we now acquire the predicate lock on the heap
relation anyway. We could avoid that by taking the lock on the first
heap_getnext() call instead, but it doesn't seem worth the trouble given
that it feels more natural to do it in heap_beginscan().
Also, remove the retail PredicateLockTuple() calls from heap_getnext(). In
a seqscan, started with heap_begin(), we're holding a whole-relation
predicate lock on the heap so there's no need to lock the tuples
individually.
Kevin Grittner and me
XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes,
saving considerable space for the vast majority of transaction commits.
XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier.
Leonardo Francalanci and Simon Riggs