streamed backup, throw an error and refuse to start up. The restore has not
finished correctly in that case and the data directory is possibly corrupt.
We already errored out in case of archive recovery, but could not during
crash recovery because we couldn't distinguish between the case that
pg_start_backup() was called and the database then crashed (must not error,
data is OK), and the case that we're restoring from a backup and not all
the needed WAL was replayed (data can be corrupt).
To distinguish those cases, add a line to backup_label to indicate
whether the backup was taken with pg_start/stop_backup(), or by streaming
(ie. pg_basebackup).
This requires re-initdb, because of a new field added to the control file.
The original definition had the problem that timeouts exceeding about 2100
seconds couldn't be specified on 32-bit machines. Milliseconds seem like
sufficient resolution, and finer grain than that would be fantasy anyway
on many platforms.
Back-patch to 9.1 so that this aspect of the latch API won't change between
9.1 and later releases.
Peter Geoghegan
Improve the documentation around weak-memory-ordering risks, and do a pass
of general editorialization on the comments in the latch code. Make the
Windows latch code more like the Unix latch code where feasible; in
particular provide the same Assert checks in both implementations.
Fix poorly-placed WaitLatch call in syncrep.c.
This patch resolves, for the moment, concerns around weak-memory-ordering
bugs in latch-related code: we have documented the restrictions and checked
that existing calls meet them. In 9.2 I hope that we will install suitable
memory barrier instructions in SetLatch/ResetLatch, so that their callers
don't need to be quite so careful.
Such a construction is useless since the lower PlaceHolderVar is already
nullable; no need to make it more so. Noted while pursuing bug #6154.
This is just a minor planner efficiency improvement, since the final plan
will come out the same anyway after PHVs are flattened. So not worth the
risk of back-patching.
Don't try to allocate the default value for a string relopt in the same
palloc chunk as the relopt_string struct. That didn't work too well if you
added a built-in string relopt in the stringRelOpts array, as it's not
possible to have an initializer for a variable length struct in C. This
makes the code slightly simpler too.
While we're at it, move the call to validator function in
add_string_reloption to before the allocation, so that if someone does pass
a bogus default value, we don't leak memory.
A PlaceHolderVar's expression might contain another, lower-level
PlaceHolderVar. If the outer PlaceHolderVar is used, the inner one
certainly will be also, and so we have to make sure that both of them get
into the placeholder_list with correct ph_may_need values during the
initial pre-scan of the query (before deconstruct_jointree starts).
We did this correctly for PlaceHolderVars appearing in the query quals,
but overlooked the issue for those appearing in the top-level targetlist;
with the result that nested placeholders referenced only in the targetlist
did not work correctly, as illustrated in bug #6154.
While at it, add some error checking to find_placeholder_info to ensure
that we don't try to create new placeholders after it's too late to do so;
they have to all be created before deconstruct_jointree starts.
Back-patch to 8.4 where the PlaceHolderVar mechanism was introduced.
Somebody thought it'd be cute to invent a set of Node tag numbers that were
defined independently of, and indeed conflicting with, the main tag-number
list. While this accidentally failed to fail so far, it would certainly
lead to trouble as soon as anyone wanted to, say, apply copyObject to these
node types. Clang was already complaining about the use of makeNode on
these tags, and I think quite rightly so. Fix by pushing these node
definitions into the mainstream, including putting replnodes.h where it
belongs.
The previous limit of 1024 was set on the assumption that all modern syslog
implementations have line length limits of 2KB or so. However, this is
false, as at least Solaris and sysklogd truncate at only 1KB. 900 seems
to leave enough room for the max likely length of the tacked-on prefixes,
so let's go with that.
As with the previous change, it doesn't seem wise to back-patch this into
already-released branches; but it should be OK to sneak it into 9.1.
Noah Misch
Instead of entering them on transaction startup, we materialize them
only when someone wants to wait, which will occur only during CREATE
INDEX CONCURRENTLY. In Hot Standby mode, the startup process must also
be able to probe for conflicting VXID locks, but the lock need never be
fully materialized, because the startup process does not use the normal
lock wait mechanism. Since most VXID locks never need to touch the
lock manager partition locks, this can significantly reduce blocking
contention on read-heavy workloads.
Patch by me. Review by Jeff Davis.
glibc renders random() thread-safe by wrapping a futex lock around it;
testing reveals that this limits the performance of pgbench on machines
with many CPU cores. Rather than switching to random_r(), which is
only available on GNU systems and crashes unless you use undocumented
alchemy to initialize the random state properly, switch to our built-in
implementation of erand48(), which is both thread-safe and concurrent.
Since the list of reasons not to use the operating system's erand48()
is getting rather long, rename ours to pg_erand48() (and similarly
for our implementations of lrand48() and srand48()) and just always
use those. We were already doing this on Cygwin anyway, and the
glibc implementation is not quite thread-safe, so pgbench wouldn't
be able to use that either.
Per discussion with Tom Lane.
This kluge was inserted in a spot apparently chosen at random: the lock
manager's state is not yet fully set up for the wait, and in particular
LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock
will not get cleaned up if the ereport is thrown. This seems to not cause
any observable problem in trivial test cases, because LockReleaseAll will
silently clean up the debris; but I was able to cause failures with tests
involving subtransactions.
Fixes breakage induced by commit c85c941470.
Back-patch to all affected branches.
It was initialized in the wrong place and to the wrong value. With bad
luck this could result in incorrect query-cancellation failures in hot
standby sessions, should a HS backend be holding pin on buffer number 1
while trying to acquire a lock.
Testing shows that the overhead of acquiring and releasing
SInvalReadLock and msgNumLock on high-core count boxes can waste a lot
of CPU time and hurt performance. This patch adds a per-backend flag
that allows us to skip all that locking in most cases. Further
testing shows that this improves performance even when sinval traffic
is very high.
Patch by me. Review and testing by Noah Misch.
It turns out to be possible to link against a libxml2.so that does this
differently than the version we configured and built against, so we need
a runtime check to avoid bizarre behavior. Per report from Bernd Helmle.
Patch by Florian Pflug.
This disables an entirely unnecessary "sanity check" that causes failures
in nonblocking mode, because OpenSSL complains if we move or compact the
write buffer. The only actual requirement is that we not modify pending
data once we've attempted to send it, which we don't. Per testing and
research by Martin Pihlak, though this fix is a lot simpler than his patch.
I put the same change into the backend, although it's less clear whether
it's necessary there. We do use nonblock mode in some situations in
streaming replication, so seems best to keep the same behavior in the
backend as in libpq.
Back-patch to all supported releases.
The original implementation simply did nothing when replacing an existing
object during CREATE EXTENSION. The folly of this was exposed by a report
from Marc Munro: if the existing object belongs to another extension, we
are left in an inconsistent state. We should insist that the object does
not belong to another extension, and then add it to the current extension
if not already a member.
I broke this in commit 5da79169d3, which
was obviously insufficiently well tested. Add some regression tests
in the hope of making future slip-ups more likely to be noticed.
Previously, xpath() simply returned an empty array if the expression did
not yield a node set. This is useless for expressions that return scalars,
such as one with name() at the top level. Arrange to return the scalar
value as a single-element xml array, instead. (String values will be
suitably escaped.)
This change will also cause xpath_exists() to return true, not false,
for such expressions.
Florian Pflug, reviewed by Radoslaw Smogura
Without this it's possible for the output to not be legal XML, as
illustrated by the added regression test cases.
NB: this change will need to be called out as an incompatibility in the
9.2 release notes, since it's possible somebody was relying on the old
behavior, even though it's clearly wrong.
Florian Pflug, reviewed by Radoslaw Smogura
This requires a new shared catalog, pg_shseclabel.
Along the way, fix the security_label regression tests so that they
don't monkey with the labels of any pre-existing objects. This is
unlikely to matter in practice, since only the label for the "dummy"
provider was being manipulated. But this way still seems cleaner.
KaiGai Kohei, with fairly extensive hacking by me.
libxml reports some errors (like invalid xmlns attributes) via the error
handler hook, but still returns a success indicator to the library caller.
This causes us to miss some errors that are important to report. Since the
"generic" error handler hook doesn't know whether the message it's getting
is for an error, warning, or notice, stop using that and instead start
using the "structured" error handler hook, which gets enough information
to be useful.
While at it, arrange to save and restore the error handler hook setting in
each libxml-using function, rather than assuming we can set and forget the
hook. This should improve the odds of working nicely with third-party
libraries that also use libxml.
In passing, volatile-ize some local variables that get modified within
PG_TRY blocks. I noticed this while testing with an older gcc version
than I'd previously tried to compile xml.c with.
Florian Pflug and Tom Lane, with extensive review/testing by Noah Misch
Subtransaction locks now released en masse at main commit, rather than
repeatedly re-scanning for locks as we ascend the nested transaction tree.
Split transaction state TBLOCK_SUBEND into two states, TBLOCK_SUBCOMMIT
and TBLOCK_SUBRELEASE to allow the commit path to be optimised using
the existing code in ResourceOwnerRelease() which appears to have been
intended for this usage, judging from comments therein.
1. In GetLockStatusData, avoid initializing instance before we've ensured
that the array is large enough. Otherwise, if repalloc moves the block
around, we're hosed.
2. Add the word "Relation" to the name of some identifiers, to avoid
assuming that the fast-path mechanism will only ever apply to relations
(though these particular parts certainly will). Some of the macros
could possibly use similar treatment, but the names are getting awfully
long already.
3. Add a missing word to comment in AtPrepare_Locks().
Standby servers can now have WALSender processes, which can work with
either WALReceiver or archive_commands to pass data. Fully updated
docs, including new conceptual terms of sending server, upstream and
downstream servers. WALSenders terminated when promote to master.
Fujii Masao, review, rework and doc rewrite by Simon Riggs
This is more SQL-spec-compliant, more easily extensible, and better
performing than the old method of inventing special variables.
Pavel Stehule, reviewed by Shigeru Hanada and David Wheeler
When an AccessShareLock, RowShareLock, or RowExclusiveLock is requested
on an unshared database relation, and we can verify that no conflicting
locks can possibly be present, record the lock in a per-backend queue,
stored within the PGPROC, rather than in the primary lock table. This
eliminates a great deal of contention on the lock manager LWLocks.
This patch also refactors the interface between GetLockStatusData() and
pg_lock_status() to be a bit more abstract, so that we don't rely so
heavily on the lock manager's internal representation details. The new
fast path lock structures don't have a LOCK or PROCLOCK structure to
return, so we mustn't depend on that for purposes of listing outstanding
locks.
Review by Jeff Davis.
We already have similar functions for many other object types, including
operator classes, so it seems like we should have this one, too.
Extracted from a larger patch by Josh Kupershmidt
Move FileClose's decrement of temporary_files_size up, so that it will be
executed even if elog() throws an error. This is reasonable since if the
unlink() fails, the fact the file is still there is not our fault, and we
are going to forget about it anyhow. So we won't count it against
temp_file_limit anymore.
Update fileSize and temporary_files_size correctly in FileTruncate.
We probably don't have any places that truncate temp files, but fd.c
surely should not assume that.
If a Var was used only in a GROUP BY expression, the previous
implementation would include the Var by itself (as well as the expression)
in the generated targetlist. This wouldn't affect the efficiency of the
scan/join part of the plan at all, but it could result in passing
unnecessarily-wide rows through sorting and grouping steps. It turns out
to take only a little more code, and not noticeably more time, to generate
a tlist without such redundancy, so let's do that. Per a recent gripe from
HarmeekSingh Bedi.
There may be some other places where we should use errdetail_internal,
but they'll have to be evaluated case-by-case. This commit just hits
a bunch of places where invoking gettext is obviously a waste of cycles.
This function supports untranslated detail messages, in the same way that
errmsg_internal supports untranslated primary messages. We've needed this
for some time IMO, but discussion of some cases in the SSI code provided
the impetus to actually add it.
Kevin Grittner, with minor adjustments by me
This fixes SSPI login failures showing "The function
requested is not supported", often showing up when connecting
to localhost. The reason was not properly updating the SSPI
handle when multiple roundtrips were required to complete the
authentication sequence.
Report and analysis by Ahmed Shinwari, patch by Magnus Hagander
The commit action of temporary tables is currently not cataloged, so
we can't easily show it. The previous value was outdated from before
we had different commit actions.
GISTInsertStack.childoffnum used to mean "offset of the downlink in this
node, pointing to the child node in the stack". It's now replaced with
downlinkoffnum, which means "offset of the downlink in the parent of this
node". gistFindPath() already used childoffnum with this new meaning, and
had an extra step at the end to pull all the childoffnum values down one
node in the stack, to adjust the stack for the meaning that childoffnum had
elsewhere. That's no longer required.
The reason to do this now is this new representation is more convenient for
the GiST fast build patch that Alexander Korotkov is working on.
While we're at it, replace the linked list used in gistFindPath with a
standard List, and make gistFindPath() static.
Alexander Korotkov, with some changes by me.
First, when following a right-link, we incorrectly marked the current page
as the parent of the right sibling. In reality, the parent of the right page
is the same as the parent of the current page (or some page to the right of
it, gistFindCorrectParent() will sort that out).
Secondly, when we follow a right-link, we must prepend, not append, the right
page to our list of pages to visit. That's because we assume that once we
hit a leaf page in the list, all the rest are leaf pages too, and give up.
To hit these bugs, you need concurrent actions and several unlucky accidents.
Another backend must split the root page, while you're in process of
splitting a lower-level page. Furthermore, while you scan the internal nodes
to re-find the parent, another backend needs to again split some more internal
pages. Even then, the bugs don't necessarily manifest as user-visible errors
or index corruption.
While we're at it, make the error reporting a bit better if gistFindPath()
fails to re-find the parent. It used to be an assertion, but an elog() seems
more appropriate.
Backpatch to all supported branches.