invalid-page hash table, PANIC immediately. Immediate PANIC is much better
than waiting for end-of-recovery, which is what we did before, because the
end-of-recovery might not come until months later if this is a standby
server.
Also refrain from creating a restartpoint if there are invalid-page entries
in the hash table. Restarting recovery from such a restartpoint would not
see the invalid references, and wouldn't be able to cross-check them when
consistency is reached. That wouldn't matter when things are going smoothly,
but the more sanity checks you have the better.
Fujii Masao
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.
This greatly reduces the WAL volume, especially when the table is narrow.
The overhead of locking the heap page is also reduced. Reduced WAL traffic
also makes it scale a lot better, if you run multiple COPY processes at
the same time.
In a regular VACUUM, it's OK to skip pages for which a cleanup lock
isn't immediately available; the next VACUUM will deal with them. If
we're scanning the entire relation to advance relfrozenxid, we might
need to wait, but only if there are tuples on the page that actually
require freezing. These changes should greatly reduce the incidence
of of vacuum processes getting "stuck".
Simon Riggs and Robert Haas
If a tuple in a syscache contains an out-of-line toasted field, and we
try to fetch that field shortly after some other transaction has committed
an update or deletion of the tuple, there is a race condition: vacuum
could come along and remove the toast tuples before we can fetch them.
This leads to transient failures like "missing chunk number 0 for toast
value NNNNN in pg_toast_2619", as seen in recent reports from Andrew
Hammond and Tim Uckun.
The design idea of syscache is that access to stale syscache entries
should be prevented by relation-level locks, but that fails for at least
two cases where toasted fields are possible: ANALYZE updates pg_statistic
rows without locking out sessions that might want to plan queries on the
same table, and CREATE OR REPLACE FUNCTION updates pg_proc rows without
any meaningful lock at all.
The least risky fix seems to be an idea that Heikki suggested when we
were dealing with a related problem back in August: forcibly detoast any
out-of-line fields before putting a tuple into syscache in the first place.
This avoids the problem because at the time we fetch the parent tuple from
the catalog, we should be holding an MVCC snapshot that will prevent
removal of the toast tuples, even if the parent tuple is outdated
immediately after we fetch it. (Note: I'm not convinced that this
statement holds true at every instant where we could be fetching a syscache
entry at all, but it does appear to hold true at the times where we could
fetch an entry that could have a toasted field. We will need to be a bit
wary of adding toast tables to low-level catalogs that don't have them
already.) An additional benefit is that subsequent uses of the syscache
entry should be faster, since they won't have to detoast the field.
Back-patch to all supported versions. The problem is significantly harder
to reproduce in pre-9.0 releases, because of their willingness to flush
every entry in a syscache whenever the underlying catalog is vacuumed
(cf CatalogCacheFlushRelation); but there is still a window for trouble.
bgwriter is now a much less important process, responsible for page
cleaning duties only. checkpointer is now responsible for checkpoints
and so has a key role in shutdown. Later patches will correct doc
references to the now old idea that bgwriter performs checkpoints.
Has beneficial effect on performance at high write rates, but mainly
refactoring to more easily allow changes for power reduction by
simplifying previously tortuous code around required to allow page
cleaning and checkpointing to time slice in the same process.
Patch by me, Review by Dickson Guedes
There's no particular value in doing AssertMacro((tup) != NULL) in front
of code that's certain to crash anyway if tup is NULL. And if "tup" is
actually the address of a local variable, gcc 4.6 whinges about it. That's
arguably pretty broken on gcc's part, but we might as well remove the
useless test to silence the warnings. This gets rid of all the -Waddress
warnings in the backend; there are some in libpq and psql that are a bit
harder to avoid.
In general the data returned by an index-only scan should have the
datatypes originally computed by FormIndexDatum. If the index opclasses
use "storage" datatypes different from their input datatypes, the scan
tuple will not have the same rowtype attributed to the index; but we had
a hard-wired assumption that that was true in nodeIndexonlyscan.c. We'd
already hacked around the issue for the one case where the types are
different in btree indexes (btree name_ops), but this would definitely
come back to bite us if we ever implement index-only scans in GiST.
To fix, require the index AM to explicitly provide the tupdesc for the
tuple it is returning. btree can just pass back the index's tupdesc, but
GiST will have to work harder when and if it supports index-only scans.
I had previously proposed fixing this by allowing the index AM to fill the
scan tuple slot directly; but on reflection that seemed like a module
layering violation, since TupleTableSlots are creatures of the executor.
At least in the btree case, it would also be less efficient, since the
tuple deconstruction work would occur even for rows later found to be
invisible to the scan's snapshot.
Add a column pg_class.relallvisible to remember the number of pages that
were all-visible according to the visibility map as of the last VACUUM
(or ANALYZE, or some other operations that update pg_class.relpages).
Use relallvisible/relpages, instead of an arbitrary constant, to estimate
how many heap page fetches can be avoided during an index-only scan.
This is pretty primitive and will no doubt see refinements once we've
acquired more field experience with the index-only scan mechanism, but
it's way better than using a constant.
Note: I had to adjust an underspecified query in the window.sql regression
test, because it was changing answers when the plan changed to use an
index-only scan. Some of the adjacent tests perhaps should be adjusted
as well, but I didn't do that here.
We copy all the matched tuples off the page during _bt_readpage, instead of
expensively re-locking the page during each subsequent tuple fetch. This
costs a bit more local storage, but not more than 2*BLCKSZ worth, and the
reduction in LWLock traffic is certainly worth that. What's more, this
lets us get rid of the API wart in the original patch that said an index AM
could randomly decline to supply an index tuple despite having asserted
pg_am.amcanreturn. That will be important for future improvements in the
index-only-scan feature, since the executor will now be able to rely on
having the index data available.
When a btree index contains all columns required by the query, and the
visibility map shows that all tuples on a target heap page are
visible-to-all, we don't need to fetch that heap page. This patch depends
on the previous patches that made the visibility map reliable.
There's a fair amount left to do here, notably trying to figure out a less
chintzy way of estimating the cost of an index-only scan, but the core
functionality seems ready to commit.
Robert Haas and Ibrar Ahmed, with some previous work by Heikki Linnakangas.
Previously, the code assumed that the only possible action to take was
to delete files behind a certain cutoff point. The async notify code
was already a crock: it used a different "pagePrecedes" function for
truncation than for regular operation. By allowing it to pass a
callback to SlruScanDirectory it can do cleanly exactly what it needs to
do.
The clog.c code also had its own use for SlruScanDirectory, which is
made a bit simpler with this.
pg_trgm was already doing this unofficially, but the implementation hadn't
been thought through very well and leaked memory. Restructure the core
GiST code so that it actually works, and document it. Ordinarily this
would have required an extra memory context creation/destruction for each
GiST index search, but I was able to avoid that in the normal case of a
non-rescanned search by finessing the handling of the RBTree. It used to
have its own context always, but now shares a context with the
scan-lifespan data structures, unless there is more than one rescan call.
This should make the added overhead unnoticeable in typical cases.
As per my recent proposal, this refactors things so that these typedefs and
macros are available in a header that can be included in frontend-ish code.
I also changed various headers that were undesirably including
utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly,
this showed that half the system was getting utils/timestamp.h by way of
xlog.h.
No actual code changes here, just header refactoring.
When building a GiST index that doesn't fit in cache, buffers are attached
to some internal nodes in the index. This speeds up the build by avoiding
random I/O that would otherwise be needed to traverse all the way down the
tree to the find right leaf page for tuple.
Alexander Korotkov
walsender.h should depend on xlog.h, not vice versa. (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.) Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h. Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.
This episode reinforces my feeling that pgrminclude should not be run
without adult supervision. Inclusion changes in header files in particular
need to be reviewed with great care. More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
Don't try to allocate the default value for a string relopt in the same
palloc chunk as the relopt_string struct. That didn't work too well if you
added a built-in string relopt in the stringRelOpts array, as it's not
possible to have an initializer for a variable length struct in C. This
makes the code slightly simpler too.
While we're at it, move the call to validator function in
add_string_reloption to before the allocation, so that if someone does pass
a bogus default value, we don't leak memory.
Standby servers can now have WALSender processes, which can work with
either WALReceiver or archive_commands to pass data. Fully updated
docs, including new conceptual terms of sending server, upstream and
downstream servers. WALSenders terminated when promote to master.
Fujii Masao, review, rework and doc rewrite by Simon Riggs
GISTInsertStack.childoffnum used to mean "offset of the downlink in this
node, pointing to the child node in the stack". It's now replaced with
downlinkoffnum, which means "offset of the downlink in the parent of this
node". gistFindPath() already used childoffnum with this new meaning, and
had an extra step at the end to pull all the childoffnum values down one
node in the stack, to adjust the stack for the meaning that childoffnum had
elsewhere. That's no longer required.
The reason to do this now is this new representation is more convenient for
the GiST fast build patch that Alexander Korotkov is working on.
While we're at it, replace the linked list used in gistFindPath with a
standard List, and make gistFindPath() static.
Alexander Korotkov, with some changes by me.
This means that they can initially be added to a large existing table
without checking its initial contents, but new tuples must comply to
them; a separate pass invoked by ALTER TABLE / VALIDATE can verify
existing data and ensure it complies with the constraint, at which point
it is marked validated and becomes a normal part of the table ecosystem.
An non-validated CHECK constraint is ignored in the planner for
constraint_exclusion purposes; when validated, cached plans are
recomputed so that partitioning starts working right away.
This patch also enables domains to have unvalidated CHECK constraints
attached to them as well by way of ALTER DOMAIN / ADD CONSTRAINT / NOT
VALID, which can later be validated with ALTER DOMAIN / VALIDATE
CONSTRAINT.
Thanks to Thom Brown, Dean Rasheed and Jaime Casanova for the various
reviews, and Robert Hass for documentation wording improvement
suggestions.
This patch was sponsored by Enova Financial.
more consistent that way, since all the other PredicateLock* calls are
made in various heapam.c and index AM functions. The call in nodeSeqscan.c
was unnecessarily aggressive anyway, there's no need to try to lock the
relation every time a tuple is fetched, it's enough to do it once.
This has the user-visible effect that if a seq scan is initialized in the
executor, but never executed, we now acquire the predicate lock on the heap
relation anyway. We could avoid that by taking the lock on the first
heap_getnext() call instead, but it doesn't seem worth the trouble given
that it feels more natural to do it in heap_beginscan().
Also, remove the retail PredicateLockTuple() calls from heap_getnext(). In
a seqscan, started with heap_begin(), we're holding a whole-relation
predicate lock on the heap so there's no need to lock the tuples
individually.
Kevin Grittner and me
XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes,
saving considerable space for the vast majority of transaction commits.
XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier.
Leonardo Francalanci and Simon Riggs
Since the names try_relation_openrv() and try_heap_openrv() don't seem
quite appropriate, rename the functions to relation_openrv_extended()
and heap_openrv_extended(). This is also more general, if we have a
future need for additional parameters that are of interest to only a
few callers.
This is infrastructure for a forthcoming patch to allow
get_object_address() to take a missing_ok argument as well.
Patch by me, review by Noah Misch.
It's been like this since HOT was originally introduced, but the logic
is complex enough that this is a recipe for bugs, as we've already
found out with SSI. So refactor heap_hot_search_buffer() so that it
can satisfy the needs of index_getnext(), and make index_getnext() use
that rather than duplicating the logic.
This change was originally proposed by Heikki Linnakangas as part of a
larger refactoring oriented towards allowing index-only scans. I
extracted and adjusted this part, since it seems to have independent
merit. Review by Jeff Davis.
This involves two main changes from the previous behavior. First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit. Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear. Making this work requires a bit of interface
refactoring.
In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur. Also, remove duplicate definitions of
InvalidXLogRecPtr.
Patch by me, review by Noah Misch.
Flexible array members are a C99 feature that avoids "cheating" in the
declaration of variable-length arrays at the end of structs. With
Autoconf support, this should be transparent for older compilers.
We start with one use in gist.h because gcc 4.6 started to raise a
warning there. Over time, it can be expanded to other places in the
source, but they will likely need some review of sizeof and offsetof
usage. The current change in gist.h appears to be safe in this
regard.
is added to the end, and existing resource managers keep their old ids.
We're not going to guarantee on-disk compatibility for 2PC state files over
major releases, but it seems better to avoid changing the ids them anyway.
It will help anyone who might want to write external tools to inspect the
state files to work with files from different versions, if nothing else.
Per complaint from Tom Lane.
ReadRecord's habit of using both direct references to tmpRecPtr and
references to *RecPtr (which is pointing at tmpRecPtr) triggers an
optimization bug in gcc 4.6.0, which apparently has forgotten about
aliasing rules. Avoid the compiler bug, and make the code more readable
to boot, by getting rid of the direct references. Improve the comments
while at it.
Back-patch to all supported versions, in case they get built with 4.6.0.
Tom Lane, with some cosmetic suggestions from Alex Hunsaker
Experimentation with contrib/btree_gist shows that the majority of the GIST
support functions potentially need collation information. Safest policy
seems to be to pass it to all of them, instead of making assumptions about
which ones could possibly need it.
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings. Fix by passing it in FunctionCallInfoData
instead. In particular this allows a clean fix for bug #5970 (record_cmp
not working). This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
Also avoid hardcoding the current default state by giving it the name
"on" and replace with a meaningful name that reflects its behaviour.
Coding only, no change in behaviour.
This means one less thing to configure when setting up synchronous
replication, and also avoids some ambiguity around what the behavior
should be when the settings of these variables conflict.
Fujii Masao, with additional hacking by me.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.
Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.
Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
Specifying this option makes the server not wait for the
xlog to be archived, or emit a warning that it can't,
instead leaving the responsibility with the client.
This is useful when the log is being streamed using
the streaming protocol in parallel with the backup,
without having log archiving enabled.
This adds collation support for columns and domains, a COLLATE clause
to override it per expression, and B-tree index support.
Peter Eisentraut
reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
new recovery.conf parameter recovery_target_name allows PITR to
specify named points as recovery targets.
Jaime Casanova, reviewed by Euler Taveira de Oliveira, plus minor edits
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
With this patch, pg_basebackup doesn't write a backup_label file in the
data directory, so it doesn't interfere with a pg_start/stop_backup() based
backup anymore. backup_label is still included in the backup, but it is
injected directly into the tar stream.
Heikki Linnakangas, reviewed by Fujii Masao and Magnus Hagander.
Move the actual functionality into a separate function that's
easier to call internally, and change the SQL-callable function
to be a wrapper calling this.
Also create a pg_abort_backup() function, only callable internally,
that does only the most vital parts of pg_stop_backup(), making it
safe(r) to call from error handlers.
The original coding could combine duplicate entries only when they
originated from the same qual condition. In particular it could not
combine cases where multiple qual conditions all give rise to full-index
scan requests, which is an expensive case well worth optimizing. Refactor
so that duplicates are recognized across all the quals.
Per my recent proposal(s). Null key datums can now be returned by
extractValue and extractQuery functions, and will be stored in the index.
Also, placeholder entries are made for indexable items that are NULL or
contain no keys according to extractValue. This means that the index is
now always complete, having at least one entry for every indexed heap TID,
and so we can get rid of the prohibition on full-index scans. A full-index
scan is implemented much the same way as partial-match scans were already:
we build a bitmap representing all the TIDs found in the index, and then
drive the results off that.
Also, introduce a concept of a "search mode" that can be requested by
extractQuery when the operator requires matching to empty items (this is
just as cheap as matching to a single key) or requires a full index scan
(which is not so cheap, but it sure beats failing or giving wrong answers).
The behavior remains backward compatible for opclasses that don't return
any null keys or request a non-default search mode.
Using these features, we can now make the GIN index opclass for anyarray
behave in a way that matches the actual anyarray operators for &&, <@, @>,
and = ... which it failed to do before in assorted corner cases.
This commit fixes the core GIN code and ginarrayprocs.c, updates the
documentation, and adds some simple regression test cases for the new
behaviors using the array operators. The tsearch and contrib GIN opclass
support functions still need to be looked over and probably fixed.
Another thing I intend to fix separately is that this is pretty inefficient
for cases where more than one scan condition needs a full-index search:
we'll run duplicate GinScanEntrys, each one of which builds a large bitmap.
There is some existing logic to merge duplicate GinScanEntrys but it needs
refactoring to make it work for entries belonging to different scan keys.
Note that most of gin.h has been split out into a new file gin_private.h,
so that gin.h doesn't export anything that's not supposed to be used by GIN
opclasses or the rest of the backend. I did quite a bit of other code
beautification work as well, mostly fixing comments and choosing more
appropriate names for things.
This is advantageous first because it allows us to hash the smaller table
regardless of the outer-join type, and second because hash join can be more
flexible than merge join in dealing with arbitrary join quals in a FULL
join. For merge join all the join quals have to be mergejoinable, but hash
join will work so long as there's at least one hashjoinable qual --- the
others can be any condition. (This is true essentially because we don't
keep per-inner-tuple match flags in merge join, while hash join can do so.)
To do this, we need a has-it-been-matched flag for each tuple in the
hashtable, not just one for the current outer tuple. The key idea that
makes this practical is that we can store the match flag in the tuple's
infomask, since there are lots of bits there that are of no interest for a
MinimalTuple. So we aren't increasing the size of the hashtable at all for
the feature.
To write this without turning the hash code into even more of a pile of
spaghetti than it already was, I rewrote ExecHashJoin in a state-machine
style, similar to ExecMergeJoin. Other than that decision, it was pretty
straightforward.
Instead, declare a public wrapper of the sole function using it for
external callers, so that they don't have to always pass a NULL
argument.
Author: Kevin Grittner
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery. Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
cleanup stage to finish incomplete inserts or splits anymore. There was two
reasons for the cleanup step:
1. When a new tuple was inserted to a leaf page, the downlink in the parent
needed to be updated to contain (ie. to be consistent with) the new key.
Updating the parent in turn might require recursively updating the parent of
the parent. We now handle that by updating the parent while traversing down
the tree, so that when we insert the leaf tuple, all the parents are already
consistent with the new key, and the tree is consistent at every step.
2. When a page is split, we need to insert the downlink for the new right
page(s), and update the downlink for the original page to not include keys
that moved to the right page(s). We now handle that by setting a new flag,
F_FOLLOW_RIGHT, on the non-rightmost pages in the split. When that flag is
set, scans always follow the rightlink, regardless of the NSN mechanism used
to detect concurrent page splits. That way the tree is consistent right after
split, even though the downlink is still missing. This is very similar to the
way B-tree splits are handled. When the downlink is inserted in the parent,
the flag is cleared. To keep the insertion algorithm simple, when an
insertion sees an incomplete split, indicated by the F_FOLLOW_RIGHT flag, it
finishes the split before doing anything else.
These changes allow removing the whole "invalid tuple" mechanism, but I
retained the scan code to still follow invalid tuples correctly. While we
don't create any such tuples anymore, we want to handle them gracefully in
case you pg_upgrade a GiST index that has them. If we encounter any on an
insert, though, we just throw an error saying that you need to REINDEX.
The issue that got me into doing this is that if you did a checkpoint while
an insert or split was in progress, and the checkpoint finishes quickly so
that there is no WAL record related to the insert between RedoRecPtr and the
checkpoint record, recovery from that checkpoint would not know to finish
the incomplete insert. IOW, we have the same issue we solved with the
rm_safe_restartpoint mechanism during normal operation too. It's highly
unlikely to happen in practice, and this fix is far too large to backpatch,
so we're just going to live with in previous versions, but this refactoring
fixes it going forward.
With this patch, you don't get the annoying
'index "FOO" needs VACUUM or REINDEX to finish crash recovery' notices
anymore if you crash at an unfortunate moment.
Recent versions of the Linux system header files cause xlogdefs.h to
believe that open_datasync should be the default sync method, whereas
formerly fdatasync was the default on Linux. open_datasync is a bad
choice, first because it doesn't actually outperform fdatasync (in fact
the reverse), and second because we try to use O_DIRECT with it, causing
failures on certain filesystems (e.g., ext4 with data=journal option).
This part of the patch is largely per a proposal from Marti Raudsepp.
More extensive changes are likely to follow in HEAD, but this is as much
change as we want to back-patch.
Also clean up confusing code and incorrect documentation surrounding the
fsync_writethrough option. Those changes shouldn't result in any actual
behavioral change, but I chose to back-patch them anyway to keep the
branches looking similar in this area.
In 9.0 and HEAD, also do some copy-editing on the WAL Reliability
documentation section.
Back-patch to all supported branches, since any of them might get used
on modern Linux versions.
This commit represents a rather heavily editorialized version of
Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1
patches. I redid the opclass API to add a separate Distance method
instead of turning the Consistent method into an illogical mess,
fixed some bit-rot in the rbtree interfaces, and generally worked over
the code style and comments.
There's still no non-code documentation to speak of, but I'll work on
that separately. Some contrib-module changes are also yet to come
(right now, point <-> point is the only KNN-ified operator).
Teodor Sigaev and Tom Lane
This is a heavily revised version of builtin_knngist_core-0.9. The
ordering operators are no longer mixed in with actual quals, which would
have confused not only humans but significant parts of the planner.
Instead, ordering operators are carried separately throughout planning and
execution.
Since the API for ambeginscan and amrescan functions had to be changed
anyway, this commit takes the opportunity to rationalize that a bit.
RelationGetIndexScan no longer forces a premature index_rescan call;
instead, callers of index_beginscan must call index_rescan too. Aside from
making the AM-side initialization logic a bit less peculiar, this has the
advantage that we do not make a useless extra am_rescan call when there are
runtime key values. AMs formerly could not assume that the key values
passed to amrescan were actually valid; now they can.
Teodor Sigaev and Tom Lane
temporary indexes are not WAL-logged. We used a constant LSN for temporary
indexes, on the assumption that we don't need to worry about concurrent page
splits in temporary indexes because they're only visible to the current
session. But that assumption is wrong, it's possible to insert rows and
split pages in the same session, while a scan is in progress. For example,
by opening a cursor and fetching some rows, and INSERTing new rows before
fetching some more.
Fix by generating fake increasing LSNs, used in place of real LSNs in
temporary GiST indexes.
The GIN code has absolutely no business exporting GIN-specific functions
with names as generic as compareItemPointers() or newScanKey(); that's
just trouble waiting to happen. I got annoyed about this again just now
and decided to fix it. This commit ensures that all global symbols
defined in access/gin/ have names including "gin" or "Gin". There were a
couple of cases, like names involving "PostingItem", where arguably the
names were already sufficiently nongeneric; but I figured as long as I was
risking creating merge problems for unapplied GIN patches I might as well
impose a uniform policy.
I didn't touch any static symbol names. There might be some places
where it'd be appropriate to rename some static functions to match
siblings that are exported, but I'll leave that for another time.
The better estimate requires more statistics than we previously stored:
in particular, counts of "entry" versus "data" pages within the index,
as well as knowledge of the number of distinct key values. We collect
this information during initial index build and update it during VACUUM,
storing the info in new fields on the index metapage. No initdb is
required because these fields will read as zeroes in a pre-existing
index, and the new gincostestimate code is coded to behave (reasonably)
sanely if they are zeroes.
Teodor Sigaev, reviewed by Jan Urbanski, Tom Lane, and Itagaki Takahiro.
This patch resurrects some of the information that could be logged by the
old, now-dead implementation of VACUUM FULL, in particular counts of live
and dead tuples and the time taken for the table rebuild proper. There's
still no logging about the ensuing index rebuilds, though.
Itagaki Takahiro
new WAL arrives via streaming replication. This reduces the latency, and
also allows us to use a longer polling interval, which is good for energy
efficiency.
We still need to poll to check for the appearance of a trigger file, but
the interval is now 5 seconds (instead of 100ms), like when waiting for
a new WAL segment to appear in WAL archive.
transaction snapshots, i.e. a snapshot registered at the beginning of
a transaction. Change variable naming and comments to reflect this reality
in preparation for a future, truly serializable mode, e.g.
Serializable Snapshot Isolation (SSI).
For the moment transaction snapshots are still used to implement
SERIALIZABLE, but hopefully not for too much longer. Patch by Kevin
Grittner and Dan Ports with review and some minor wording changes by me.
This allows us to reliably remove all leftover temporary relation
files on cluster startup without reference to system catalogs or WAL;
therefore, we no longer include temporary relations in XLOG_XACT_COMMIT
and XLOG_XACT_ABORT WAL records.
Since these changes require including a backend ID in each
SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id
field has been reduced from two bytes to one, and the maximum number
of connections has been reduced from INT_MAX / 4 to 2^23-1. It would
be possible to remove these restrictions by increasing the size of
SharedInvalidationMessage by 4 bytes, but right now that doesn't seem
like a good trade-off.
Review by Jaime Casanova and Tom Lane.
Although the key-combining code claimed to work correctly if its input
contained both lossy and exact pointers for a single page in a single TID
stream, in fact this did not work, and could not work without pretty
fundamental redesign. Modify keyGetItem so that it will not return such a
stream, by handling lossy-pointer cases a bit more explicitly than we did
before.
Per followup investigation of a gripe from Artur Dabrowski.
An example of a query that failed given his data set is
select count(*) from search_tab where
(to_tsvector('german', keywords ) @@ to_tsquery('german', 'ee:* | dd:*')) and
(to_tsvector('german', keywords ) @@ to_tsquery('german', 'aa:*'));
Back-patch to 8.4 where the lossy pointer code was introduced.
struct representing a tree entry, rather than being a separately allocated
piece of storage. This API is at least as clean as the old one (if not
more so --- there were some bizarre choices in there) and it permits a
very substantial memory savings, on the order of 2X in ginbulk.c's usage.
Also, fix minor memory leaks in code called by ginEntryInsert, in
particular in ginInsertValue and entryFillRoot, as well as ginEntryInsert
itself. These leaks resulted in the GIN index build context continuing
to bloat even after we'd filled it to maintenance_work_mem and started
to dump data out to the index.
In combination these fixes restore the GIN index build code to honoring
the maintenance_work_mem limit about as well as it did in 8.4. Speed
seems on par with 8.4 too, maybe even a bit faster, for a non-pathological
case in which HEAD was formerly slower.
Back-patch to 9.0 so we don't have a performance regression from 8.4.
routines to make them behave better in the presence of "lossy" index pointers.
The previous coding was outright incorrect for some cases, as recently
reported by Artur Dabrowski: scanGetItem would fail to return index entries in
cases where one index key had multiple exact pointers on the same page as
another key had a lossy pointer. Also, keyGetItem was extremely inefficient
for cases where a single index key generates multiple "entry" streams, such as
an @@ operator with a multiple-clause tsquery. The presence of a lossy page
pointer in any one stream defeated its ability to use the opclass
consistentFn, resulting in probing many heap pages that didn't really need to
be visited. In Artur's example case, a query like
WHERE tsvector @@ to_tsquery('a & b')
was about 50X slower than the theoretically equivalent
WHERE tsvector @@ to_tsquery('a') AND tsvector @@ to_tsquery('b')
The way that I chose to fix this was to have GIN call the consistentFn
twice with both TRUE and FALSE values for the in-doubt entry stream,
returning a hit if either call produces TRUE, but not if they both return
FALSE. The code handles this for the case of a single in-doubt entry stream,
but punts (falling back to the stupid behavior) if there's more than one lossy
reference to the same page. The idea could be scaled up to deal with multiple
lossy references, but I think that would probably be wasted complexity. At
least to judge by Artur's example, such cases don't occur often enough to be
worth trying to optimize.
Back-patch to 8.4. 8.3 did not have lossy GIN index pointers, so not
subject to these problems.
Transaction aborts now record their LSN to avoid corner case
behaviour in SR/HS, hence change of name of variables and functions.
As pointed out by Fujii Masao. Cosmetic changes only.
max_standby_streaming_delay, and revise the implementation to avoid assuming
that timestamps found in WAL records can meaningfully be compared to clock
time on the standby server. Instead, the delay limits are compared to the
elapsed time since we last obtained a new WAL segment from archive or since
we were last "caught up" to WAL data arriving via streaming replication.
This avoids problems with clock skew between primary and standby, as well
as other corner cases that the original coding would misbehave in, such
as the primary server having significant idle time between transactions.
Per my complaint some time ago and considerable ensuing discussion.
Do some desultory editing on the hot standby documentation, too.
master. Otherwise a subsequent crash could cause the master to lose WAL that
has already been applied on the slave, resulting in the slave being out of
sync and soon corrupt. Per recent discussion and an example from Robert Haas.
Fujii Masao
confusion with streaming-replication settings. Also, change its default
value to "off", because of concern about executing new and poorly-tested
code during ordinary non-replicating operation. Per discussion.
In passing do some minor editing of related documentation.
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.
Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.
Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
vacuum_log_cleanup_info() now generates log records with a valid
latestRemovedXid set in all cases. Also be careful not to zero the
value when we do a round of vacuuming part-way through lazy_scan_heap().
Incidentally, this reduces frequency of conflicts in Hot Standby.
Also, make the name of the GUC and the name of the backing variable match.
Alnong the way, clean up a couple of slight typographical errors in the
related docs.
through normal backends. Makes code clearer also, since we
avoid various Assert()s. Performance of snapshots taken
during recovery no longer depends upon number of read-only
backends.
after actually removing one, so that if we can't remove segments because
WAL archiving is lagging behind, we don't unnecessarily forbid streaming
the old not-yet-archived segments that are still perfectly valid. Per
suggestion from Fujii Masao.
doesn't take into account how far the WAL senders are. This way a hung
WAL sender doesn't prevent old WAL segments from being recycled/removed
in the primary, ultimately causing the disk to fill up. Instead add
standby_keep_segments setting to control how many old WAL segments are
kept in the primary. This also makes it more reliable to use streaming
replication without WAL archiving, assuming that you set
standby_keep_segments high enough.
The error message now makes explicit reference to the GUC that must be changed
to fix the problem, using wording suggested by Tom Lane. Along the way,
rename the GUC from MaxWalSenders to max_wal_senders for consistency and
grep-ability.
WAL record for btree delete contains a list of tids, even when backup
blocks are present. We follow the tids to their heap tuples, taking
care to follow LP_REDIRECT tuples. We ignore LP_DEAD tuples on the
understanding that they will always have xmin/xmax earlier than any
LP_NORMAL tuples referred to by killed index tuples. Iff all tuples
are LP_DEAD we return InvalidTransactionId. The heap relfilenode is
added to the WAL record, requiring API changes to pass down the heap
Relation. XLOG_PAGE_MAGIC updated.
present since 8.0 was never fully meaningful, since two recovery targets
cannot be specified. Refactor recovery target type to make this change
and associated code easier to understand. No change in function.
Bug report arising from internal support question.
field into WAL record and reset it from there, rather than using
FrozenTransactionId which can lead to some corner case bugs.
Problem report and suggested route to a fix from Heikki, details by me.
enabled. Bypassing the kernel cache is counter-productive in that case,
because the archiver/walsender process will read from the WAL file
soon after it's written, and if it's not cached the read will cause
a physical read, eating I/O bandwidth available on the WAL drive.
Also, walreceiver process does unaligned writes, so disable O_DIRECT
in walreceiver process for that reason too.
In addition, add support for a "payload" string to be passed along with
each notify event.
This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage. There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.
Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
where a database has a non-default tablespaceid. Pass thru MyDatabaseId
and MyDatabaseTableSpace to allow file path to be re-created in
standby and correct invalidation to take place in all cases.
Update and rework xact_commit_desc() debug messages.
Bug report from Tom by code inspection. Fix by me.
several places, but for now only GIN uses it during index creation.
Using self-balanced tree greatly speeds up index creation in corner cases
with preordered data.
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.
Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).
We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
of shared or nailed system catalogs. This has two key benefits:
* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.
* We no longer have to use an unsafe reindex-in-place approach for reindexing
shared catalogs.
CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.
Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.
This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch. As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.
Attributes can now have options, just as relations and tablespaces do, and
the reloptions code is used to parse, validate, and store them. For
simplicity and because these options are not performance critical, we store
them in a separate cache rather than the main relcache.
Thanks to Alex Hunsaker for the review.
that would've been WAL-logged if archiving was enabled. If we encounter
such records in archive recovery anyway, we know that some data is
missing from the log. A WARNING is emitted in that case.
Original patch by Fujii Masao, with changes by me.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
Fujii Masao, with additional hacking by me
Previously, fastgetattr() and heap_getattr() tested their fourth argument
against a null pointer, but any attempt to use them with a literal-NULL
fourth argument evaluated to *(void *)0, resulting in a compiler error.
Remove these NULL tests to avoid leading future readers of this code to
believe that this has a chance of working. Also clean up related legacy
code in nocachegetattr(), heap_getsysattr(), and nocache_index_getattr().
The new coding standard is that any code which calls a getattr-type
function or macro which takes an isnull argument MUST pass a valid
boolean pointer. Per discussion with Bruce Momjian, Tom Lane, Alvaro
Herrera.
This patch only supports seq_page_cost and random_page_cost as parameters,
but it provides the infrastructure to scalably support many more.
In particular, we may want to add support for effective_io_concurrency,
but I'm leaving that as future work for now.
Thanks to Tom Lane for design help and Alvaro Herrera for the review.
to be just a minor extension of the previous patch that made "x IS NULL"
indexable, because we can treat the IS NOT NULL condition as if it were
"x < NULL" or "x > NULL" (depending on the index's NULLS FIRST/LAST option),
just like IS NULL is treated like "x = NULL". Aside from any possible
usefulness in its own right, this is an important improvement for
index-optimized MAX/MIN aggregates: it is now reliably possible to get
a column's min or max value cheaply, even when there are a lot of nulls
cluttering the interesting end of the index.
This is more in keeping with modern practice, and is a first step towards
porting to Win64 (which has sizeof(pointer) > sizeof(long)).
Tsutomu Yamada, Magnus Hagander, Tom Lane
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
be part of multixacts, so allocate a slot for each prepared transaction in
the "oldest member" array in multixact.c. On PREPARE TRANSACTION, transfer
the oldest member value from the current backends slot to the prepared xact
slot. Also save and recover the value from the 2pc state file.
The symptom of the bug was that after a transaction prepared, a shared lock
still held by the prepared transaction was sometimes ignored by other
transactions.
Fix back to 8.1, where both 2PC and multixact were introduced.
that it can scribble on scan->xs_ctup.t_self while following HOT chains,
so we can't rely on that to stay valid between hashgettuple() calls.
Introduce a private variable in HashScanOpaque, instead.
hash indexes keep entries sorted by hash value. First, the original plans for
concurrency assumed that insertions would happen only at the end of a page,
which is no longer true; this could cause scans to transiently fail to find
index entries in the presence of concurrent insertions. We can compensate
by teaching scans to re-find their position after re-acquiring read locks.
Second, neither the bucket split nor the bucket compaction logic had been
fixed to preserve hashvalue ordering, so application of either of those
processes could lead to permanent corruption of an index, in the sense
that searches might fail to find entries that are present.
This patch fixes the split and compaction logic to preserve hashvalue
ordering, but it cannot do anything about pre-existing corruption. We will
need to recommend reindexing all hash indexes in the 8.4.2 release notes.
To buy back the performance loss hereby induced in split and compaction,
fix them to use PageIndexMultiDelete instead of retail PageIndexDelete
operations. We might later want to do something with qsort'ing the
page contents rather than doing a binary search for each insertion,
but that seemed more invasive than I cared to risk in a back-patch.
Per bug #5157 from Jeff Janes and subsequent investigation.
tuple size limit. Improve the error message for index-tuple-too-large
so that it includes the actual size, the limit, and the index name.
Sync with the btree occurrences of the same error.
Back-patch to 8.4 because it appears that the out-of-sync problem
is occurring in the field.
Teodor and Tom
own database's datfrozenxid, if the current value is old enough to be
forcing autovacuums or warning messages. This ensures that a bogus
value is replaced as soon as possible. Per a comment from Heikki.
Recent commits have removed the various uses it was supporting. It was a
performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems
it slowed down user creation after a billion users.
XID) in checkpoint records. This eliminates the need to recompute the value
from scratch during database startup, which is one of the two remaining
reasons for the flatfile code to exist. It should also simplify life for
hot-standby operation.
To avoid bloating the checkpoint records unreasonably, I switched from
tracking the oldest database by name to tracking it by OID. This turns
out to save cycles in general (everywhere but the warning-generating
paths, which we hardly care about) and also helps us deal with the case
that the oldest database got dropped instead of being vacuumed. The prior
coding might go for a long time without updating the wrap limit in that case,
which is bad because it might result in a lot of useless autovacuum activity.
"all tuples visible" flag in heap page headers. The flag update *must*
be applied before calling XLogInsert, but heap_update and the tuple
moving routines in VACUUM FULL were ignoring this rule. A crash and
replay could therefore leave the flag incorrectly set, causing rows
to appear visible in seqscans when they should not be. This might explain
recent reports of data corruption from Jeff Ross and others.
In passing, do a bit of editorialization on comments in visibilitymap.c.
by supporting conversions in places that used to demand exact rowtype match.
Since this issue is certain to come up elsewhere (in fact, already has,
in ExecEvalConvertRowtype), factor out the support code into new core
functions for tuple conversion. I chose to put these in a new source
file since heaptuple.c is already overly long.
Heavily revised version of a patch by Pavel Stehule.
values being complained of.
In passing, also remove the arbitrary length limitation in the similar
error detail message for foreign key violations.
Itagaki Takahiro
The current implementation fires an AFTER ROW trigger for each tuple that
looks like it might be non-unique according to the index contents at the
time of insertion. This works well as long as there aren't many conflicts,
but won't scale to massive unique-key reassignments. Improving that case
is a TODO item.
Dean Rasheed
not forced out-of-line unless that is necessary to make the row fit on a
page. Previously, they were forced out-of-line if needed to get the row
down to the default target size (1/4th page).
Kevin Grittner
archive recovery. Invent a separate state variable and inquiry function
for XLogInsertAllowed() to clarify some tests and make the management of
writing the end-of-recovery checkpoint less klugy. Fix several places
that were incorrectly testing InRecovery when they should be looking at
RecoveryInProgress or XLogInsertAllowed (because they will now be executed
in the bgwriter not startup process). Clarify handling of bad LSNs passed
to XLogFlush during recovery. Use a spinlock for setting/testing
SharedRecoveryInProgress. Improve quite a lot of comments.
Heikki and Tom
during it:
When bgwriter is active, the startup process can't perform mdsync() correctly
because it won't see the fsync requests accumulated in bgwriter's private
pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery
checkpoint as well, when it's active.
When bgwriter is active (= archive recovery), the startup process must not
accumulate fsync requests to its own pendingOpsTable, since bgwriter won't
see them there when it performs restartpoints. Make startup process drop its
pendingOpsTable when bgwriter is launched to avoid that.
Update minimum recovery point one last time when leaving archive recovery.
It won't be updated by the end-of-recovery checkpoint because XLogFlush()
sees us as out of recovery already.
This fixes bug #4879 reported by Fujii Masao.
behavior in cases where we don't know the heap tuple count accurately; in
particular partial vacuum, but this also makes the API a bit more useful
for ANALYZE. This patch adds "estimated_count" flags to both structs so
that an approximate count can be flagged as such, and adjusts the logic
so that approximate counts are not used for updating pg_class.reltuples.
This fixes my previous complaint that VACUUM was putting ridiculous values
into pg_class.reltuples for indexes. The actual impact of that bug is
limited, because the planner only pays attention to reltuples for an index
if the index is partial; which probably explains why beta testers hadn't
noticed a degradation in plan quality from it. But it needs to be fixed.
The whole thing is a bit messy and should be redesigned in future, because
reltuples now has the potential to drift quite far away from reality when
a long period elapses with no non-partial vacuums. But this is as good as
it's going to get for 8.4.
is supposed to remove duplicate heap TIDs, we have to be sure to reduce the
tuple size and posting-item count accordingly in addItemPointersToTuple().
Failing to do so resulted in the effective injection of garbage TIDs into the
index contents, ie, whatever happened to be in the memory palloc'd for the
new tuple. I'm not sure that this fully explains the index corruption
reported by Tatsuo Ishii, but the test case I'm using no longer fails.
should use GinItemPointerGetBlockNumber/GinItemPointerGetOffsetNumber,
not ItemPointerGetBlockNumber/ItemPointerGetOffsetNumber, because the latter
will Assert() on ip_posid == 0, ie a "Min" pointer. (Thus, ItemPointerIsMin
has never worked at all, but it seems unused at present.) I'm not certain
that the case can occur in normal functioning, but it's blowing up on me
while investigating Tatsuo-san's data corruption problem. In any case it
seems like a problem waiting to bite someone.
Back-patch just in case this really is a problem for somebody in the field.
errors when tables are concurrently dropped. To do this we must take lock
on each relation before we check its privileges. The old code was trying
to do that the other way around, which is a bit pointless when there are lots
of other commands that lock relations before checking privileges. I did keep
it checking each relation's privilege before locking the next relation, which
is a detail that ALTER TABLE isn't too picky about.
To implement this without almost duplicating the reloption table, treat
relopt_kind as a bitmask instead of an integer value. This decreases the
range of allowed values, but it's not clear that there's need for that much
values anyway.
This patch also makes heap_reloptions explicitly a no-op for relation kinds
other than heap and TOAST tables.
Patch by ITAGAKI Takahiro with minor edits from me. (In particular I removed
the bit about adding relation kind to an error message, which I intend to
commit separately.)
method to pass extra data to the consistent() and comparePartial() methods.
This is the core infrastructure needed to support the soon-to-appear
contrib/btree_gin module. The APIs are still upward compatible with the
definitions used in 8.3 and before, although *not* with the previous 8.4devel
function definitions.
catversion bump for changes in pg_proc entries (although these are just
cosmetic, since GIN doesn't actually look at the function signature before
calling it...)
Teodor Sigaev and Oleg Bartunov
them from degrading badly when the input is sorted or nearly so. In this
scenario the tree is unbalanced to the point of becoming a mere linked list,
so insertions become O(N^2). The easiest and most safely back-patchable
solution is to stop growing the tree sooner, ie limit the growth of N. We
might later consider a rebalancing tree algorithm, but it's not clear that
the benefit would be worth the cost and complexity. Per report from Sergey
Burladyan and an earlier complaint from Heikki.
Back-patch to 8.2; older versions didn't have GIN indexes.
multiple index entries in a holding area before adding them to the main index
structure. This helps because bulk insert is (usually) significantly faster
than retail insert for GIN.
This patch also removes GIN support for amgettuple-style index scans. The
API defined for amgettuple is difficult to support with fastupdate, and
the previously committed partial-match feature didn't really work with
it either. We might eventually figure a way to put back amgettuple
support, but it won't happen for 8.4.
catversion bumped because of change in GIN's pg_am entry, and because
the format of GIN indexes changed on-disk (there's a metapage now,
and possibly a pending list).
Teodor Sigaev
its usual buffer cleaning duties during archive recovery, and it's responsible
for performing restartpoints.
This requires some changes in postmaster. When the startup process has done
all the initialization and is ready to start WAL redo, it signals the
postmaster to launch the background writer. The postmaster is signaled again
when the point in recovery is reached where we know that the database is in
consistent state. Postmaster isn't interested in that at the moment, but
that's the point where we could let other backends in to perform read-only
queries. The postmaster is signaled third time when the recovery has ended,
so that postmaster knows that it's safe to start accepting connections.
The startup process now traps SIGTERM, and performs a "clean" shutdown. If
you do a fast shutdown during recovery, a shutdown restartpoint is performed,
like a shutdown checkpoint, and postmaster kills the processes cleanly. You
still have to continue the recovery at next startup, though.
Currently, the background writer is only launched during archive recovery.
We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful.
log_restartpoints is gone. Use log_checkpoints instead. This is yet to be
documented.
This whole operation is a pre-requisite for Hot Standby, but has some value of
its own whether the hot standby patch makes 8.4 or not.
Simon Riggs, with lots of modifications by me.
qualifier, and add support for this in pg_dump.
This allows TOAST tables to have user-defined fillfactor, and will also
enable us to move the autovacuum parameters to reloptions without taking
away the possibility of setting values for TOAST tables.
refactor the relcache code that used to do that. This allows other callers
(particularly autovacuum) to do the same without necessarily having to open
and lock a table.
be used instead of the normal exclusive lock, and make WAL redo functions
responsible for calling RestoreBkpBlocks(). They know better what kind of a
lock they need.
At the moment, this just moves things around with no functional change, but
makes the hot standby patch that's under review cleaner.
fillRelOptions routine that stores the parsed values in the struct using a
table-based approach. Per Tom suggestion. Also remove the "continue"
in HANDLE_*_RELOPTION macros, which were useless and in spirit they were
assuming too much of how the macros were going to be used. (Note that these
macros are now unused, but the intention is to introduce some usage in a
future autovacuum patch, which is why they weren't completely removed.)
Also, do not call the string validation routine when not validating. It seems
less error-prone this way, per commentary on the amoptions SGML docs.
bitmap. This is extracted from Greg Stark's posix_fadvise patch; it seems
worth committing separately, since it's potentially useful independently of
posix_fadvise.
a more complete framework for writing custom option processing routines
by user-defined access methods.
Catalog version bumped due to the general API changes, which are going to
affect user-defined "amoptions" routines.
heap page, where a set bit indicates that all tuples on the page are
visible to all transactions, and the page therefore doesn't need
vacuuming. It is stored in a new relation fork.
Lazy vacuum uses the visibility map to skip pages that don't need
vacuuming. Vacuum is also responsible for setting the bits in the map.
In the future, this can hopefully be used to implement index-only-scans,
but we can't currently guarantee that the visibility map is always 100%
up-to-date.
In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
each heap page, also indicating that all tuples on the page are visible to
all transactions. It's important that this flag is kept up-to-date. It
is also used to skip visibility tests in sequential scans, which gives a
small performance gain on seqscans.
that a Portal is a useful and sufficient additional argument for
CreateDestReceiver --- it just isn't, in most cases. Instead formalize
the approach of passing any needed parameters to the receiver separately.
One unexpected benefit of this change is that we can declare typedef Portal
in a less surprising location.
This patch is just code rearrangement and doesn't change any functionality.
I'll tackle the HOLD-cursor-vs-toast problem in a follow-on patch.
truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To
make that cleaner from modularity point of view, move the WAL-logging one
level up to RelationTruncate, and move RelationTruncate and all the
related WAL-logging to new src/backend/catalog/storage.c file. Introduce
new RelationCreateStorage and RelationDropStorage functions that are used
instead of calling smgrcreate/smgrscheduleunlink directly. Move the
pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new
functions. This leaves smgr.c as a thin wrapper around md.c; all the
transactional stuff is now in storage.c.
This will make it easier to add new forks with similar truncation logic,
like the visibility map.
heap_form_tuple. Since this removes the last remaining caller of
heap_addheader, remove it.
Extracted from the column privileges patch from Stephen Frost, with further
code cleanups by me.
(but not locked, as that would risk deadlocks). Also, make it work in a small
ring of buffers to avoid having bulk inserts trash the whole buffer arena.
Robert Haas, after an idea of Simon Riggs'.
and heap_deformtuple in favor of the newer functions heap_form_tuple et al
(which do the same things but use bool control flags instead of arbitrary
char values). Eliminate the former duplicate coding of these functions,
reducing the deprecated functions to mere wrappers around the newer ones.
We can't get rid of them entirely because add-on modules probably still
contain many instances of the old coding style.
Kris Jurka
functions into one ReadBufferExtended function, that takes the strategy
and mode as argument. There's three modes, RBM_NORMAL which is the default
used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
without throwing an error. The FSM needs the new mode to recover from
corrupt pages, which could happend if we crash after extending an FSM file,
and the new page is "torn".
Add fork number to some error messages in bufmgr.c, that still lacked it.
written to temp files by tuplesort.c and tuplestore.c. This saves 2 bytes per
row for 32-bit machines, and 6 bytes per row for 64-bit machines, which seems
worth the slight additional uglification of the tuple read/write routines.
correctly set. As result, killtuple() marks as dead
wrong tuple on page. Bug was introduced by me while fixing
possible duplicates during GiST index scan.
This patch eliminates the marking of subtransactions as SUBCOMMITTED in pg_clog
during their commit; instead they remain in-progress until main transaction
commit. At main transaction commit, the commit protocol is atomic-by-page
instead of one transaction at a time. To avoid a race condition with some
subtransactions appearing committed before others in the case where they span
more than one pg_clog page, we conserve the logic that marks them subcommitted
before marking the parent committed.
Simon Riggs with minor help from me
is NULL but SK_SEARCHNULL is not set. Add checking IS NULL of keys
to set during key initialization. If key is NULL and SK_SEARCHNULL is not
set then nothnig can be satisfied.
With assert-enabled compilation that causes coredump.
Bug was introduced in 8.3 by support of IS NULL index scan.
of referencing a WITH item that's not yet in scope according to the SQL
spec's semantics. This seems to be an easy error to make, and the bare
"relation doesn't exist" message doesn't lead one's mind in the correct
direction to fix it.
free space information is stored in a dedicated FSM relation fork, with each
relation (except for hash indexes; they don't use FSM).
This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
trace of them from the backend, initdb, and documentation.
Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
introduce a new variant of the get_raw_page(regclass, int4, int4) function in
contrib/pageinspect that let's you to return pages from any relation fork, and
a new fsm_page_contents() function to inspect the new FSM pages.
value. This means that hash index lookups are always lossy and have to be
rechecked when the heap is visited; however, the gain in index compactness
outweighs this when the indexed values are wide. Also, we only need to
perform datatype comparisons when the hash codes match exactly, rather than
for every entry in the hash bucket; so it could also win for datatypes that
have expensive comparison functions. A small additional win is gained by
keeping hash index pages sorted by hash code and using binary search to reduce
the number of index tuples we have to look at.
Xiao Meng
This commit also incorporates Zdenek Kotala's patch to isolate hash metapages
and hash bitmaps a bit better from the page header datastructures.
at once and ItemPointers are collected in memory.
Remove tuple's killing by killtuple() if tuple was moved to another
page - it could produce unaceptable overhead.
Backpatch up to 8.1 because the bug was introduced by GiST's concurrency support.
of multiple forks, and each fork can be created and grown separately.
The bulk of this patch is about changing the smgr API to include an extra
ForkNumber argument in every smgr function. Also, smgrscheduleunlink and
smgrdounlink no longer implicitly call smgrclose, because other forks might
still exist after unlinking one. The callers of those functions have been
modified to call smgrclose instead.
This patch in itself doesn't have any user-visible effect, but provides the
infrastructure needed for upcoming patches. The additional forks envisioned
are a rewritten FSM implementation that doesn't rely on a fixed-size shared
memory block, and a visibility map to allow skipping portions of a table in
VACUUM that have no dead tuples.
thereby forestalling any problems with alignment of the data structure placed
there. Since SizeOfPageHeaderData is maxalign'd anyway in 8.3 and HEAD, this
does not actually change anything right now, but it is foreseeable that the
header size will change again someday. I had to fix a couple of places that
were assuming that the content offset is just SizeOfPageHeaderData rather than
MAXALIGN(SizeOfPageHeaderData). Per discussion of Zdenek's page-macros patch.
SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that
makes the code clearer, and avoid casting between Page and PageHeader where
possible. Zdenek Kotala, with some additional cleanup by Heikki Linnakangas.
I did not apply the parts of the proposed patch that would have resulted in
slightly changing the on-disk format of hash indexes; it seems to me that's
not a win as long as there's any chance of having in-place upgrade for 8.4.
corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
forks. XLogOpenRelation() and the associated light-weight relation cache in
xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument,
instead of Relation.
For functions that still need a Relation struct during WAL replay, there's a
new function called CreateFakeRelcacheEntry() that returns a fake entry like
XLogOpenRelation() used to.
algorithm, replacing the original intention of a one-pass search, which
had been hacked up over time to be partially two-pass in hopes of handling
various corner cases better. It still wasn't quite there, especially as
regards emitting unwanted NOTICE messages. More importantly, this approach
lets us fix a number of open bugs concerning concurrent DROP scenarios,
because we can take locks during the first pass and avoid traversing to
dependent objects that were just deleted by someone else.
There is more that can be done here, but I'll go ahead and commit the
base patch before working on the options.
Formerly, the default value of wal_sync_method was determined inside xlog.c,
but now it is determined inside guc.c. guc.c was reading xlogdefs.h
without having read <fcntl.h>, leading to wrong determination of
DEFAULT_SYNC_METHOD. Obviously xlogdefs.h needs to include <fcntl.h>
for itself to ensure stable results.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
<craig@postnewspapers.com.au>.
It was my mistake, I missed limitation of number of held locks, now GIN doesn't
use continiuous locks, but still hold buffers pinned to prevent interference
with vacuum's deletion algorithm.
Backpatch is needed.
corrupted. (Neither is very important if SIGTERM is used to shut down the
whole database cluster together, but there's a problem if someone tries to
SIGTERM individual backends.) To do this, introduce new infrastructure
macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care
of transiently pushing an on_shmem_exit cleanup hook. Also use this method
for createdb cleanup --- that wasn't a shared-memory-corruption problem,
but SIGTERM abort of createdb could leave orphaned files lying around.
Backpatch as far as 8.2. The shmem corruption cases don't exist in 8.1,
and the createdb usage doesn't seem important enough to risk backpatching
further.
instead of plan time. Extend the amgettuple API so that the index AM returns
a boolean indicating whether the indexquals need to be rechecked, and make
that rechecking happen in nodeIndexscan.c (currently the only place where
it's expected to be needed; other callers of index_getnext are just erroring
out for now). For the moment, GIN and GIST have stub logic that just always
sets the recheck flag to TRUE --- I'm hoping to get Teodor to handle pushing
that control down to the opclass consistent() functions. The planner no
longer pays any attention to amopreqcheck, and that catalog column will go
away in due course.
systable_endscan_ordered that have API similar to systable_beginscan etc
(in particular, the passed-in scankeys have heap not index attnums),
but guarantee ordered output, unlike the existing functions. For the moment
these are just very thin wrappers around index_beginscan/index_getnext/etc.
Someday they might need to get smarter; but for now this is just a code
refactoring exercise to reduce the number of direct callers of index_getnext,
in preparation for changing that function's API.
In passing, remove index_getnext_indexitem, which has been dead code for
quite some time, and will have even less use than that in the presence
of run-time-lossy indexes.
indexscan always occurs in one call, and the results are returned in a
TIDBitmap instead of a limited-size array of TIDs. This should improve
speed a little by reducing AM entry/exit overhead, and it is necessary
infrastructure if we are ever to support bitmap indexes.
In an only slightly related change, add support for TIDBitmaps to preserve
(somewhat lossily) the knowledge that particular TIDs reported by an index
need to have their quals rechecked when the heap is visited. This facility
is not really used yet; we'll need to extend the forced-recheck feature to
plain indexscans before it's useful, and that hasn't been coded yet.
The intent is to use it to clean up 8.3's horrid @@@ kluge for text search
with weighted queries. There might be other uses in future, but that one
alone is sufficient reason.
Heikki Linnakangas, with some adjustments by me.
currently support this because we must be able to build Vars referencing
join columns, and varattno is only 16 bits wide. Perhaps this should be
improved in future, but considering that it never came up before, I'm not
sure the problem is worth much effort. Per bug #4070 from Marcello
Ceschia.
The problem seems largely academic in 8.0 and 7.4, because they have
(different) O(N^2) performance issues with such wide joins, but
back-patch all the way anyway.
snapmgmt.c file for the former. The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.
This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.
Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
bucket number, so as to ensure locality of access to the index during the
insertion step. Without this, building an index significantly larger than
available RAM takes a very long time because of thrashing. On the other
hand, sorting is just useless overhead when the index does fit in RAM.
We choose to sort when the initial index size exceeds effective_cache_size.
This is a revised version of work by Tom Raney and Shreya Bhargava.
two buckets at the start, we create a number of buckets appropriate for the
estimated size of the table. This avoids a lot of expensive bucket-split
actions during initial index build on an already-populated table.
This is one of the two core ideas of Tom Raney and Shreya Bhargava's patch
to reduce hash index build time. I'm committing it separately to make it
easier for people to test the effects of this separately from the effects
of their other core idea (pre-sorting the index entries by bucket number).
before it goes groveling through the ProcArray. In situations where the same
recently-committed transaction ID is checked repeatedly by tqual.c, this saves
a lot of shared-memory searches. And it's cheap enough that it shouldn't
hurt noticeably when it doesn't help.
Concept and patch by Simon, some minor tweaking and comment-cleanup by Tom.
it accumulates the set of changes to be made and then applies them. It had
to accumulate the set of changes anyway to prepare a WAL record for the
pruning action, so this isn't an enormous change; the only new complexity is
to not doubly mark tuples that are visited twice in the scan. The main
advantage is that we can substantially reduce the scope of the critical
section in which the changes are applied, thus avoiding PANIC in foreseeable
cases like running out of memory in inval.c. A nice secondary advantage is
that it is now far clearer that WAL replay will actually do the same thing
that the original pruning did.
This commit doesn't do anything about the open problem that
CacheInvalidateHeapTuple doesn't have the right semantics for a CTID change
caused by collapsing out a redirect pointer. But whatever we do about that,
it'll be a good idea to not do it inside a critical section.
temporary table; we can't support that because there's no way to clean up the
source backend's internal state if the eventual COMMIT PREPARED is done by
another backend. This was checked correctly in 8.1 but I broke it in 8.2 :-(.
Patch by Heikki Linnakangas, original trouble report by John Smith.
data structures and backend internal APIs. This solves problems we've seen
recently with inconsistent layout of pg_control between machines that have
32-bit time_t and those that have already migrated to 64-bit time_t. Also,
we can get out from under the problem that Windows' Unix-API emulation is not
consistent about the width of time_t.
There are a few remaining places where local time_t variables are used to hold
the current or recent result of time(NULL). I didn't bother changing these
since they do not affect any cross-module APIs and surely all platforms will
have 64-bit time_t before overflow becomes an actual risk. time_t should
be avoided for anything visible to extension modules, however.
its second pass over the table. It has to start at block zero, else the
"merge join" logic for detecting which TIDs are already in the index
doesn't work. Hence, extend heapam.c's API so that callers can enable or
disable syncscan. (I put in an option to disable buffer access strategy,
too, just in case somebody needs it.) Per report from Hannes Dorbath.
constraint status of copied indexes (bug #3774), as well as various other
small bugs such as failure to pstrdup when needed. Allow INCLUDING INDEXES
indexes to be merged with identical declared indexes (perhaps not real useful,
but the code is there and having it not apply to LIKE indexes seems pretty
unorthogonal). Avoid useless work in generateClonedIndexStmt(). Undo some
poorly chosen API changes, and put a couple of routines in modules that seem
to be better places for them.
but no database changes have been made since the last CommandCounterIncrement.
This should result in a significant improvement in the number of "commands"
that can typically be performed within a transaction before hitting the 2^32
CommandId size limit. In particular this buys back (and more) the possible
adverse consequences of my previous patch to fix plan caching behavior.
The implementation requires tracking whether the current CommandCounter
value has been "used" to mark any tuples. CommandCounter values stored into
snapshots are presumed not to be used for this purpose. This requires some
small executor changes, since the executor used to conflate the curcid of
the snapshot it was using with the command ID to mark output tuples with.
Separating these concepts allows some small simplifications in executor APIs.
Something for the TODO list: look into having CommandCounterIncrement not do
AcceptInvalidationMessages. It seems fairly bogus to be doing it there,
but exactly where to do it instead isn't clear, and I'm disinclined to mess
with asynchronous behavior during late beta.
Else, in a 64-bit machine with maintenance_work_mem set to above 4Gb,
the counter overflows and we never recognize having reached the
maintenance_work_mem limit. I believe this explains out-of-memory
failure recently reported by Sean Davis.
This is a bug, so backpatch to 8.2.
it failed for splits of non-leaf pages because in such pages the first
data key on a page is suppressed, and so we can't just copy the first
key from the right page to reconstitute the left page's high key.
Problem found by Koichi Suzuki, patch by Heikki.
- create a separate archive_mode GUC, on which archive_command is dependent
- %r option in recovery.conf sends last restartpoint to recovery command
- %r used in pg_standby, updated README
- minor other code cleanup in pg_standby
- doc on Warm Standby now mentions pg_standby and %r
- log_restartpoints recovery option emits LOG message at each restartpoint
- end of recovery now displays last transaction end time, as requested
by Warren Little; also shown at each restartpoint
- restart archiver if needed to carry away WAL files at shutdown
Simon Riggs
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version. Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.
In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.
Pavan Deolasee, with help from a bunch of other people.
ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid"
variable that is updated during transaction commit or abort. Since
latestCompletedXid is written only in places that had to lock ProcArrayLock
exclusively anyway, and is read only in places that had to lock ProcArrayLock
shared anyway, it adds no new locking requirements to the system despite being
cluster-wide. Moreover, removing ReadNewTransactionId from snapshot
acquisition eliminates the need to take both XidGenLock and ProcArrayLock at
the same time. Since XidGenLock is sometimes held across I/O this can be a
significant win. Some preliminary benchmarking suggested that this patch has
no effect on average throughput but can significantly improve the worst-case
transaction times seen in pgbench. Concept by Florian Pflug, implementation
by Tom Lane.
rows will normally never obtain an XID at all. We already did things this way
for subtransactions, but this patch extends the concept to top-level
transactions. In applications where there are lots of short read-only
transactions, this should improve performance noticeably; not so much from
removal of the actual XID-assignments, as from reduction of overhead that's
driven by the rate of XID consumption. We add a concept of a "virtual
transaction ID" so that active transactions can be uniquely identified even
if they don't have a regular XID. This is a much lighter-weight concept:
uniqueness of VXIDs is only guaranteed over the short term, and no on-disk
record is made about them.
Florian Pflug, with some editorialization by Tom.
Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing,
so anything that's broken is probably my fault.
Documentation is nonexistent as yet, but let's land the patch so we can
get some portability testing done.
before reporting a transaction committed. Data consistency is still
guaranteed (unlike setting fsync = off), but a crash may lose the effects
of the last few transactions. Patch by Simon, some editorialization by Tom.
and fsync WAL at convenient intervals. For the moment it just tries to
offload this work from backends, but soon it will be responsible for
guaranteeing a maximum delay before asynchronously-committed transactions
will be flushed to disk.
This is a portion of Simon Riggs' async-commit patch, committed to CVS
separately because a background WAL writer seems like it might be a good idea
independently of the async-commit feature. I rebased walwriter.c on
bgwriter.c because it seemed like a more appropriate way of handling signals;
while the startup/shutdown logic in postmaster.c is more like autovac because
we want walwriter to quit before we start the shutdown checkpoint.
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.
Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.
Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
pseudo HeapScanDesc created for a bitmap heap scan. This avoids some useless
overhead during a bitmap scan startup, in particular invoking the syncscan
code. (We might someday want to do that, but right now it's merely useless
contention for shared memory, to say nothing of possibly pushing useful
entries out of syncscan's small LRU list.) This also allows elimination of
ugly pgstat_discount_heap_scan() kluge.
delivering a well-randomized hash value. I got religion on this after
observing that performance of multi-batch hash join degrades terribly if the
higher-order bits of hash values aren't random, as indeed was true for say
hashes of small integer values. It's now expected and documented that hash
functions should use hash_any or some comparable method to ensure that all
bits of their output are about equally random.
initdb forced because this change invalidates existing hash indexes. For the
same reason, this isn't back-patchable; the hash join performance problem
will get a band-aid fix in the back branches.
buffers, rather than blowing out the whole shared-buffer arena. Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer. Those flushes will now occur only once per
ring-ful. The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done. The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.
This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it. To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.
Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.
Original patch by Simon, reworked by Heikki and again by Tom.
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.
Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures. And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
WAL records that shows whether it is safe to remove full-page images
(ie, whether or not an on-line backup was in progress when the WAL entry
was made). Also make provision for an XLOG_NOOP record type that can be
used to fill in the extra space when decompressing the data for restore.
This is the portion of Koichi Suzuki's "full page writes" patch that
has to go into the core database. The remainder of that work is two
external compression and decompression programs, which for the time being
will undergo separate development on pgfoundry. Per discussion.
Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be
possible to compress them (the previous coding caused essential info
to be omitted). The other commonly-used record types seem OK already,
with the possible exception of GIN and GIST WAL records, which I don't
understand well enough to opine on.
to avoid losing useful Xid information in not-so-old tuples. This makes
CLUSTER behave the same as VACUUM as far a tuple-freezing behavior goes
(though CLUSTER does not yet advance the table's relfrozenxid).
While at it, move the actual freezing operation in rewriteheap.c to a more
appropriate place, and document it thoroughly. This part of the patch from
Tom Lane.
pages it intends to zero immediately. Just to show there is some use for that
function besides WAL recovery :-).
Along the way, fold _hash_checkpage and _hash_pageinit calls into _hash_getbuf
and friends, instead of expecting callers to do that separately.
from time_t to TimestampTz representation. This provides full gettimeofday()
resolution of the timestamps, which might be useful when attempting to
do point-in-time recovery --- previously it was not possible to specify
the stop point with sub-second resolution. But mostly this is to get
rid of TimestampTz-to-time_t conversion overhead during commit. Per my
proposal of a day or two back.
messages to the stats collector. This avoids the problem that enabling
stats_row_level for autovacuum has a significant overhead for short
read-only transactions, as noted by Arjen van der Meijden. We can avoid
an extra gettimeofday call by piggybacking on the one done for WAL-logging
xact commit or abort (although that doesn't help read-only transactions,
since they don't WAL-log anything).
In my proposal for this, I noted that we could change the WAL log entries
for commit/abort to record full TimestampTz precision, instead of only
time_t as at present. That's not done in this patch, but will be committed
separately.
failed (due to lock conflicts or out-of-space). We might have already
extended the index's filesystem EOF before failing, causing the EOF to be
beyond what the metapage says is the last used page. Hence the invariant
maintained by the code needs to be "EOF is at or beyond last used page",
not "EOF is exactly the last used page". Problem was created by my patch
of 2006-11-19 that attempted to repair bug #2737. Since that was
back-patched to 7.4, this needs to be as well. Per report and test case
from Vlastimil Krejcir.
(original code *always* created a full-page image for the left page, thus
leaving the intended savings unrealized), avoid risk of not having enough room
on the page during xlog restore, squeeze out another couple bytes in the xlog
record, clean up neglected comments.
index types can be reliably distinguished by examining the special space
on an index page. Per my earlier proposal, plus the realization that
there's no need for btree's vacuum cycle ID to cycle through every possible
16-bit value. Restricting its range a little costs nearly nothing and
eliminates the possibility of collisions.
Memo to self: remember to make bitmap indexes play along with this scheme,
assuming that patch ever gets accepted.
This commit breaks any code that assumes that the mere act of forming a tuple
(without writing it to disk) does not "toast" any fields. While all available
regression tests pass, I'm not totally sure that we've fixed every nook and
cranny, especially in contrib.
Greg Stark with some help from Tom Lane
Add the latter to the values checked in pg_control, since it can't be changed
without invalidating toast table content. This commit in itself shouldn't
change any behavior, but it lays some necessary groundwork for experimentation
with these toast-control numbers.
Note: while TOAST_TUPLE_THRESHOLD can now be changed without initdb, some
thought still needs to be given to needs_toast_table() in toasting.c before
unleashing random changes.
module and teach PREPARE and protocol-level prepared statements to use it.
In service of this, rearrange utility-statement processing so that parse
analysis does not assume table schemas can't change before execution for
utility statements (necessary because we don't attempt to re-acquire locks
for utility statements when reusing a stored plan). This requires some
refactoring of the ProcessUtility API, but it ends up cleaner anyway,
for instance we can get rid of the QueryContext global.
Still to do: fix up SPI and related code to use the plan cache; I'm tempted to
try to make SQL functions use it too. Also, there are at least some aspects
of system state that we want to ensure remain the same during a replan as in
the original processing; search_path certainly ought to behave that way for
instance, and perhaps there are others.
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names. Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught. In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane