If a tuple in a syscache contains an out-of-line toasted field, and we
try to fetch that field shortly after some other transaction has committed
an update or deletion of the tuple, there is a race condition: vacuum
could come along and remove the toast tuples before we can fetch them.
This leads to transient failures like "missing chunk number 0 for toast
value NNNNN in pg_toast_2619", as seen in recent reports from Andrew
Hammond and Tim Uckun.
The design idea of syscache is that access to stale syscache entries
should be prevented by relation-level locks, but that fails for at least
two cases where toasted fields are possible: ANALYZE updates pg_statistic
rows without locking out sessions that might want to plan queries on the
same table, and CREATE OR REPLACE FUNCTION updates pg_proc rows without
any meaningful lock at all.
The least risky fix seems to be an idea that Heikki suggested when we
were dealing with a related problem back in August: forcibly detoast any
out-of-line fields before putting a tuple into syscache in the first place.
This avoids the problem because at the time we fetch the parent tuple from
the catalog, we should be holding an MVCC snapshot that will prevent
removal of the toast tuples, even if the parent tuple is outdated
immediately after we fetch it. (Note: I'm not convinced that this
statement holds true at every instant where we could be fetching a syscache
entry at all, but it does appear to hold true at the times where we could
fetch an entry that could have a toasted field. We will need to be a bit
wary of adding toast tables to low-level catalogs that don't have them
already.) An additional benefit is that subsequent uses of the syscache
entry should be faster, since they won't have to detoast the field.
Back-patch to all supported versions. The problem is significantly harder
to reproduce in pre-9.0 releases, because of their willingness to flush
every entry in a syscache whenever the underlying catalog is vacuumed
(cf CatalogCacheFlushRelation); but there is still a window for trouble.
bgwriter is now a much less important process, responsible for page
cleaning duties only. checkpointer is now responsible for checkpoints
and so has a key role in shutdown. Later patches will correct doc
references to the now old idea that bgwriter performs checkpoints.
Has beneficial effect on performance at high write rates, but mainly
refactoring to more easily allow changes for power reduction by
simplifying previously tortuous code around required to allow page
cleaning and checkpointing to time slice in the same process.
Patch by me, Review by Dickson Guedes
There's no particular value in doing AssertMacro((tup) != NULL) in front
of code that's certain to crash anyway if tup is NULL. And if "tup" is
actually the address of a local variable, gcc 4.6 whinges about it. That's
arguably pretty broken on gcc's part, but we might as well remove the
useless test to silence the warnings. This gets rid of all the -Waddress
warnings in the backend; there are some in libpq and psql that are a bit
harder to avoid.
In general the data returned by an index-only scan should have the
datatypes originally computed by FormIndexDatum. If the index opclasses
use "storage" datatypes different from their input datatypes, the scan
tuple will not have the same rowtype attributed to the index; but we had
a hard-wired assumption that that was true in nodeIndexonlyscan.c. We'd
already hacked around the issue for the one case where the types are
different in btree indexes (btree name_ops), but this would definitely
come back to bite us if we ever implement index-only scans in GiST.
To fix, require the index AM to explicitly provide the tupdesc for the
tuple it is returning. btree can just pass back the index's tupdesc, but
GiST will have to work harder when and if it supports index-only scans.
I had previously proposed fixing this by allowing the index AM to fill the
scan tuple slot directly; but on reflection that seemed like a module
layering violation, since TupleTableSlots are creatures of the executor.
At least in the btree case, it would also be less efficient, since the
tuple deconstruction work would occur even for rows later found to be
invisible to the scan's snapshot.
Add a column pg_class.relallvisible to remember the number of pages that
were all-visible according to the visibility map as of the last VACUUM
(or ANALYZE, or some other operations that update pg_class.relpages).
Use relallvisible/relpages, instead of an arbitrary constant, to estimate
how many heap page fetches can be avoided during an index-only scan.
This is pretty primitive and will no doubt see refinements once we've
acquired more field experience with the index-only scan mechanism, but
it's way better than using a constant.
Note: I had to adjust an underspecified query in the window.sql regression
test, because it was changing answers when the plan changed to use an
index-only scan. Some of the adjacent tests perhaps should be adjusted
as well, but I didn't do that here.
We copy all the matched tuples off the page during _bt_readpage, instead of
expensively re-locking the page during each subsequent tuple fetch. This
costs a bit more local storage, but not more than 2*BLCKSZ worth, and the
reduction in LWLock traffic is certainly worth that. What's more, this
lets us get rid of the API wart in the original patch that said an index AM
could randomly decline to supply an index tuple despite having asserted
pg_am.amcanreturn. That will be important for future improvements in the
index-only-scan feature, since the executor will now be able to rely on
having the index data available.
When a btree index contains all columns required by the query, and the
visibility map shows that all tuples on a target heap page are
visible-to-all, we don't need to fetch that heap page. This patch depends
on the previous patches that made the visibility map reliable.
There's a fair amount left to do here, notably trying to figure out a less
chintzy way of estimating the cost of an index-only scan, but the core
functionality seems ready to commit.
Robert Haas and Ibrar Ahmed, with some previous work by Heikki Linnakangas.
Previously, the code assumed that the only possible action to take was
to delete files behind a certain cutoff point. The async notify code
was already a crock: it used a different "pagePrecedes" function for
truncation than for regular operation. By allowing it to pass a
callback to SlruScanDirectory it can do cleanly exactly what it needs to
do.
The clog.c code also had its own use for SlruScanDirectory, which is
made a bit simpler with this.
pg_trgm was already doing this unofficially, but the implementation hadn't
been thought through very well and leaked memory. Restructure the core
GiST code so that it actually works, and document it. Ordinarily this
would have required an extra memory context creation/destruction for each
GiST index search, but I was able to avoid that in the normal case of a
non-rescanned search by finessing the handling of the RBTree. It used to
have its own context always, but now shares a context with the
scan-lifespan data structures, unless there is more than one rescan call.
This should make the added overhead unnoticeable in typical cases.
As per my recent proposal, this refactors things so that these typedefs and
macros are available in a header that can be included in frontend-ish code.
I also changed various headers that were undesirably including
utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly,
this showed that half the system was getting utils/timestamp.h by way of
xlog.h.
No actual code changes here, just header refactoring.
When building a GiST index that doesn't fit in cache, buffers are attached
to some internal nodes in the index. This speeds up the build by avoiding
random I/O that would otherwise be needed to traverse all the way down the
tree to the find right leaf page for tuple.
Alexander Korotkov
walsender.h should depend on xlog.h, not vice versa. (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.) Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h. Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.
This episode reinforces my feeling that pgrminclude should not be run
without adult supervision. Inclusion changes in header files in particular
need to be reviewed with great care. More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
Don't try to allocate the default value for a string relopt in the same
palloc chunk as the relopt_string struct. That didn't work too well if you
added a built-in string relopt in the stringRelOpts array, as it's not
possible to have an initializer for a variable length struct in C. This
makes the code slightly simpler too.
While we're at it, move the call to validator function in
add_string_reloption to before the allocation, so that if someone does pass
a bogus default value, we don't leak memory.
Standby servers can now have WALSender processes, which can work with
either WALReceiver or archive_commands to pass data. Fully updated
docs, including new conceptual terms of sending server, upstream and
downstream servers. WALSenders terminated when promote to master.
Fujii Masao, review, rework and doc rewrite by Simon Riggs
GISTInsertStack.childoffnum used to mean "offset of the downlink in this
node, pointing to the child node in the stack". It's now replaced with
downlinkoffnum, which means "offset of the downlink in the parent of this
node". gistFindPath() already used childoffnum with this new meaning, and
had an extra step at the end to pull all the childoffnum values down one
node in the stack, to adjust the stack for the meaning that childoffnum had
elsewhere. That's no longer required.
The reason to do this now is this new representation is more convenient for
the GiST fast build patch that Alexander Korotkov is working on.
While we're at it, replace the linked list used in gistFindPath with a
standard List, and make gistFindPath() static.
Alexander Korotkov, with some changes by me.
This means that they can initially be added to a large existing table
without checking its initial contents, but new tuples must comply to
them; a separate pass invoked by ALTER TABLE / VALIDATE can verify
existing data and ensure it complies with the constraint, at which point
it is marked validated and becomes a normal part of the table ecosystem.
An non-validated CHECK constraint is ignored in the planner for
constraint_exclusion purposes; when validated, cached plans are
recomputed so that partitioning starts working right away.
This patch also enables domains to have unvalidated CHECK constraints
attached to them as well by way of ALTER DOMAIN / ADD CONSTRAINT / NOT
VALID, which can later be validated with ALTER DOMAIN / VALIDATE
CONSTRAINT.
Thanks to Thom Brown, Dean Rasheed and Jaime Casanova for the various
reviews, and Robert Hass for documentation wording improvement
suggestions.
This patch was sponsored by Enova Financial.
more consistent that way, since all the other PredicateLock* calls are
made in various heapam.c and index AM functions. The call in nodeSeqscan.c
was unnecessarily aggressive anyway, there's no need to try to lock the
relation every time a tuple is fetched, it's enough to do it once.
This has the user-visible effect that if a seq scan is initialized in the
executor, but never executed, we now acquire the predicate lock on the heap
relation anyway. We could avoid that by taking the lock on the first
heap_getnext() call instead, but it doesn't seem worth the trouble given
that it feels more natural to do it in heap_beginscan().
Also, remove the retail PredicateLockTuple() calls from heap_getnext(). In
a seqscan, started with heap_begin(), we're holding a whole-relation
predicate lock on the heap so there's no need to lock the tuples
individually.
Kevin Grittner and me
XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes,
saving considerable space for the vast majority of transaction commits.
XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier.
Leonardo Francalanci and Simon Riggs
Since the names try_relation_openrv() and try_heap_openrv() don't seem
quite appropriate, rename the functions to relation_openrv_extended()
and heap_openrv_extended(). This is also more general, if we have a
future need for additional parameters that are of interest to only a
few callers.
This is infrastructure for a forthcoming patch to allow
get_object_address() to take a missing_ok argument as well.
Patch by me, review by Noah Misch.
It's been like this since HOT was originally introduced, but the logic
is complex enough that this is a recipe for bugs, as we've already
found out with SSI. So refactor heap_hot_search_buffer() so that it
can satisfy the needs of index_getnext(), and make index_getnext() use
that rather than duplicating the logic.
This change was originally proposed by Heikki Linnakangas as part of a
larger refactoring oriented towards allowing index-only scans. I
extracted and adjusted this part, since it seems to have independent
merit. Review by Jeff Davis.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
Flexible array members are a C99 feature that avoids "cheating" in the
declaration of variable-length arrays at the end of structs. With
Autoconf support, this should be transparent for older compilers.
We start with one use in gist.h because gcc 4.6 started to raise a
warning there. Over time, it can be expanded to other places in the
source, but they will likely need some review of sizeof and offsetof
usage. The current change in gist.h appears to be safe in this
regard.
is added to the end, and existing resource managers keep their old ids.
We're not going to guarantee on-disk compatibility for 2PC state files over
major releases, but it seems better to avoid changing the ids them anyway.
It will help anyone who might want to write external tools to inspect the
state files to work with files from different versions, if nothing else.
Per complaint from Tom Lane.
ReadRecord's habit of using both direct references to tmpRecPtr and
references to *RecPtr (which is pointing at tmpRecPtr) triggers an
optimization bug in gcc 4.6.0, which apparently has forgotten about
aliasing rules. Avoid the compiler bug, and make the code more readable
to boot, by getting rid of the direct references. Improve the comments
while at it.
Back-patch to all supported versions, in case they get built with 4.6.0.
Tom Lane, with some cosmetic suggestions from Alex Hunsaker
Experimentation with contrib/btree_gist shows that the majority of the GIST
support functions potentially need collation information. Safest policy
seems to be to pass it to all of them, instead of making assumptions about
which ones could possibly need it.
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings. Fix by passing it in FunctionCallInfoData
instead. In particular this allows a clean fix for bug #5970 (record_cmp
not working). This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
Also avoid hardcoding the current default state by giving it the name
"on" and replace with a meaningful name that reflects its behaviour.
Coding only, no change in behaviour.
This means one less thing to configure when setting up synchronous
replication, and also avoids some ambiguity around what the behavior
should be when the settings of these variables conflict.
Fujii Masao, with additional hacking by me.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.
Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.
Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
Specifying this option makes the server not wait for the
xlog to be archived, or emit a warning that it can't,
instead leaving the responsibility with the client.
This is useful when the log is being streamed using
the streaming protocol in parallel with the backup,
without having log archiving enabled.
This adds collation support for columns and domains, a COLLATE clause
to override it per expression, and B-tree index support.
Peter Eisentraut
reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
new recovery.conf parameter recovery_target_name allows PITR to
specify named points as recovery targets.
Jaime Casanova, reviewed by Euler Taveira de Oliveira, plus minor edits
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
With this patch, pg_basebackup doesn't write a backup_label file in the
data directory, so it doesn't interfere with a pg_start/stop_backup() based
backup anymore. backup_label is still included in the backup, but it is
injected directly into the tar stream.
Heikki Linnakangas, reviewed by Fujii Masao and Magnus Hagander.
Move the actual functionality into a separate function that's
easier to call internally, and change the SQL-callable function
to be a wrapper calling this.
Also create a pg_abort_backup() function, only callable internally,
that does only the most vital parts of pg_stop_backup(), making it
safe(r) to call from error handlers.
The original coding could combine duplicate entries only when they
originated from the same qual condition. In particular it could not
combine cases where multiple qual conditions all give rise to full-index
scan requests, which is an expensive case well worth optimizing. Refactor
so that duplicates are recognized across all the quals.