The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
This bug appears to have been introduced late in the development of
48354581a4 ("Allow Pin/UnpinBuffer to operate in a lockfree
manner.").
Found while debugging a bug which turned out to be independent of the
commit mentioned above.
Backpatch: -
BRIN was relying on the ability to remove a tuple from an index page,
then putting another tuple in the same line pointer. But PageAddItem
refuses to add a tuple beyond the first free item past the last used
item, and in particular, it rejects an attempt to add an item to an
empty page anywhere other than the first line pointer. PageAddItem
issues a WARNING and indicates to the caller that it failed, which in
turn causes the BRIN calling code to issue a PANIC, so the whole
sequence looks like this:
WARNING: specified item offset is too large
PANIC: failed to add BRIN tuple
To fix, create a new function PageAddItemExtended which is like
PageAddItem except that the two boolean arguments become a flags bitmap;
the "overwrite" and "is_heap" boolean flags in PageAddItem become
PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag
PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN.
PageAddItem() retains its original signature, for compatibility with
third-party modules (other callers in core code are not modified,
either).
Also, in the belt-and-suspenders spirit, I added a new sanity check in
brinGetTupleForHeapBlock to raise an error if an TID found in the revmap
is not marked as live by the page header. This causes it to react with
"ERROR: corrupted BRIN index" to the bug at hand, rather than a hard
crash.
Backpatch to 9.5.
Bug reported by Andreas Seltenreich as detected by his handy sqlsmith
fuzzer.
Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
If both timeout indicators are set when we arrive at ProcessInterrupts,
we've historically just reported "lock timeout". However, some buildfarm
members have been observed to fail isolationtester's timeouts test by
reporting "lock timeout" when the statement timeout was expected to fire
first. The cause seems to be that the process is allowed to sleep longer
than expected (probably due to heavy machine load) so that the lock
timeout happens before we reach the point of reporting the error, and
then this arbitrary tiebreak rule does the wrong thing. We can improve
matters by comparing the scheduled timeout times to decide which error
to report.
I had originally proposed greatly reducing the 1-second window between
the two timeouts in the test cases. On reflection that is a bad idea,
at least for the case where the lock timeout is expected to fire first,
because that would assume that it takes negligible time to get from
statement start to the beginning of the lock wait. Thus, this patch
doesn't completely remove the risk of test failures on slow machines.
Empirically, however, the case this handles is the one we are seeing
in the buildfarm. The explanation may be that the other case requires
the scheduler to take the CPU away from a busy process, whereas the
case fixed here only requires the scheduler to not give the CPU back
right away to a process that has been woken from a multi-second sleep
(and, perhaps, has been swapped out meanwhile).
Back-patch to 9.3 where the isolationtester timeouts test was added.
Discussion: <8693.1464314819@sss.pgh.pa.us>
If RAW_EXPRESSION_COVERAGE_TEST is defined, do a no-op tree walk over
every basic DML statement submitted to parse analysis. If we'd had this
in place earlier, bug #14153 would have been caught by buildfarm testing.
The difficulty is that raw_expression_tree_walker() is only used in
limited cases involving CTEs (particularly recursive ones), so it's
very easy for an oversight in it to not be noticed during testing of a
seemingly-unrelated feature.
The type of error we can expect to catch with this is complete omission
of a node type from raw_expression_tree_walker(), and perhaps also
recursion into a field that doesn't contain a node tree, though that
would be an unlikely mistake. It won't catch failure to add new fields
that need to be recursed into, unfortunately.
I'll go enable this on one or two of my own buildfarm animals once
bug #14153 is dealt with.
Discussion: <27861.1464040417@sss.pgh.pa.us>
do_text_output_multiline() would fail (typically with a null pointer
dereference crash) if its input string did not end with a newline. Such
cases do not arise in our current sources; but it certainly could happen
in future, or in extension code's usage of the function, so we should fix
it. To fix, replace "eol += len" with "eol = text + len".
While at it, make two cosmetic improvements: mark the input string const,
and rename the argument from "text" to "txt" to dodge pgindent strangeness
(since "text" is a typedef name).
Even though this problem is only latent at present, it seems like a good
idea to back-patch the fix, since it's a very simple/safe patch and it's
not out of the realm of possibility that we might in future back-patch
something that expects sane behavior from do_text_output_multiline().
Per report from Hao Lee.
Report: <CAGoxFiFPAGyPAJLcFxTB5cGhTW2yOVBDYeqDugYwV4dEd1L_Ag@mail.gmail.com>
This was overlooked in commit 473b93287, which introduced DROP ACCESS
METHOD. Although that command is restricted to superusers, we don't want
even superusers dropping the built-in methods; "DROP ACCESS METHOD btree"
in particular is unrecoverable from. Pin these objects in the same way
that other initdb-created objects are pinned.
I chose to bump catversion for this fix. That's not absolutely necessary
perhaps, but it will ensure that no 9.6 production systems are missing
the pin entries.
This patch essentially reverts commit 4c6780fd17, in favor of a much
simpler solution for the case where the new cluster would choose to create
a TOAST table but the old cluster doesn't have one: just don't create a
TOAST table.
The existing code failed in at least two different ways if the situation
arose: (1) ALTER TABLE RESET didn't grab an exclusive lock, so that the
lock sanity check in create_toast_table failed; (2) pg_upgrade did not
provide a pg_type OID for the new toast table, so that the crosscheck in
TypeCreate failed. While both these problems were introduced by later
patches, they show that the hack being used to cause TOAST table creation
is overwhelmingly fragile (and untested). I also note that before the
TypeCreate crosscheck was added, the code would have resulted in assigning
an indeterminate pg_type OID to the toast table, possibly causing a later
OID conflict in that catalog; so that it didn't really work even when
committed.
If we simply don't create a TOAST table, there will only be a problem if
the code tries to store a tuple that's wider than a page, and field
compression isn't sufficient to get it under a page. Given that the TOAST
creation threshold is intended to be about a quarter of a page, it's very
hard to believe that cross-version differences in the do-we-need-a-toast-
table heuristic could result in an observable problem. So let's just
follow the old version's conclusion about whether a TOAST table is needed.
(If we ever do change needs_toast_table() so much that this conclusion
doesn't apply, we can devise a solution at that time, and hopefully do
it in a less klugy way than 4c6780fd17 did.)
Back-patch to 9.3, like the previous patch.
Discussion: <8110.1462291671@sss.pgh.pa.us>
Hash indexes are not WAL-logged, and so do not maintain the LSN of
index pages. Since the "snapshot too old" feature counts on
detecting error conditions using the LSN of a table and all indexes
on it, this makes it impossible to safely do early vacuuming on any
table with a hash index, so add this to the tests for whether the
xid used to vacuum a table can be adjusted based on
old_snapshot_threshold.
While at it, add a paragraph to the docs for old_snapshot_threshold
which specifically mentions this and other aspects of the feature
which may otherwise surprise users.
Problem reported and patch reviewed by Amit Kapila
Move fmtReloptionsArray() from pg_dump.c to string_utils.c so that it
is available to other frontend code. In particular psql's \ev and \sv
commands need it to handle view reloptions. Also rename the function
to appendReloptionsArray(), which is a more accurate description of
what it does.
Author: Dean Rasheed
Reviewed-by: Peter Eisentraut
Discussion: http://www.postgresql.org/message-id/CAEZATCWZjCgKRyM-agE0p8ax15j9uyQoF=qew7D2xB6cF76T8A@mail.gmail.com
The similarity of the original names to SQL keywords seems like a bad
idea. Rename them before we're stuck with 'em forever.
In passing, minor code and docs cleanup.
Discussion: <4875.1462210058@sss.pgh.pa.us>
This reverts commits f07d18b6e9, 82c83b3372, 3a3b309041, and
24c5f1a103.
This feature has shown enough immaturity that it was deemed better to
rip it out before rushing some more fixes at the last minute. There are
discussions on larger changes in this area for the next release.
Conversion functions were previously marked as parallel-unsafe, since
that is the default, but in fact they are safe. Parallel-safe
functions defined in pg_proc.h and redefined in system_views.sql were
ending up as parallel-unsafe because the redeclarations were not
marked PARALLEL SAFE. While editing system_views.sql, mark ts_debug()
parallel safe also.
Andreas Karlsson
Back in 3b02ea4f07 I added some comments in various places to explain
how logical decoding and other things worked. Not all of the changes
were welcome, because they were misleading or wrong. This changes them
a little bit to make them more accurate.
Some other comments are also changed to be more accurate. Also, fix a
bunch of typos.
Author: Álvaro Herrera, Craig Ringer
Andres Freund reviewed some parts of this.
Commit 7117685461 made pg_start_backup
parallel-restricted rather than parallel-safe, because it now relies
on backend-private state that won't be synchronized with the parallel
worker. However, it didn't update pg_proc.h. Separately, Andreas
Karlsson observed that system_views.sql neglected to reiterate the
parallel-safety markings whe redefining various functions, including
this one; so add a PARALLEL RESTRICTED declaration there to match
the new value in pg_proc.h.
Given a three-or-more-way equivalence class, such as X.Y = Y.Y = Z.Z,
it was possible for the planner to omit one of the quals needed to
enforce that all members of the equivalence class are actually equal.
This only happened in the case of a parameterized join node for two
of the relations, that is a plan tree like
Nested Loop
-> Scan X
-> Nested Loop
-> Scan Y
-> Scan Z
Filter: Z.Z = X.X
The eclass machinery normally expects to apply X.X = Y.Y when those
two relations are joined, but in this shape of plan tree they aren't
joined until the top node --- and, if the lower nested loop is marked
as parameterized by X, the top node will assume that the relevant eclass
condition(s) got pushed down into the lower node. On the other hand,
the scan of Z assumes that it's only responsible for constraining Z.Z
to match any one of the other eclass members. So one or another of
the required quals sometimes fell between the cracks, depending on
whether consideration of the eclass in get_joinrel_parampathinfo()
for the lower nested loop chanced to generate X.X = Y.Y or X.X = Z.Z
as the appropriate constraint there. If it generated the latter,
it'd erroneously suppose that the Z scan would take care of matters.
To fix, force X.X = Y.Y to be generated and applied at that join node
when this case occurs.
This is *extremely* hard to hit in practice, because various planner
behaviors conspire to mask the problem; starting with the fact that the
planner doesn't really like to generate a parameterized plan of the
above shape. (It might have been impossible to hit it before we
tweaked things to allow this plan shape for star-schema cases.) Many
thanks to Alexander Kirkouski for submitting a reproducible test case.
The bug can be demonstrated in all branches back to 9.2 where parameterized
paths were introduced, so back-patch that far.
Without a few entries beyond old_snapshot_threshold, the lookup
would often fail, resulting in the more aggressive pruning or
vacuum being skipped often enough to matter. This was very clearly
shown by a python test script posted by Ants Aasma, and was likely
a factor in an earlier but somewhat less clear-cut test case posted
by Jeff Janes.
This patch makes no change to the logic, per se -- it just makes
the array of mapping entries big enough to make lookup misses based
on timing much less likely. An occasional miss is still possible
if a thread stalls for more than 10 minutes, but that does not
create any problem with correctness of behavior. Besides, if
things are so busy that a thread is stalling for more than 10
minutes, it is probably OK to skip the more aggressive cleanup at
that particular point in time.
Adjust the way we detect the locale. As a result the minumum Windows
version supported by VS2015 and later is Windows Vista. Add some tweaks
to remove new compiler warnings. Remove documentation references to the
now obsolete msysGit.
Michael Paquier, somewhat edited by me, reviewed by Christian Ullrich.
Backpatch to 9.5
Commit 23a41573c attempted to fix the DatumGetBool macro to ignore bits
in a Datum that are to the left of the actual bool value. But it did that
by casting the Datum to bool; and on compilers that use C99 semantics for
bool, that ends up being a whole-word test, not a 1-byte test. This seems
to be the true explanation for contrib/seg failing in VS2015. To fix, use
GET_1_BYTE() explicitly. I think in the previous patch, I'd had some idea
of not having to commit to bool being exactly 1 byte wide, but regardless
of what the compiler's bool is, boolean columns and Datums are certainly
1 byte wide.
The previous fix was (eventually) back-patched into all active versions,
so do likewise with this one.
Previously, ginInsertCleanup could exit early if it detects that someone else
is cleaning up the pending list, without waiting for that someone else to
finish the job. But in this case vacuum could miss tuples to be deleted.
Cleanup process now locks metapage with a help of heavyweight
LockPage(ExclusiveLock), and it guarantees that there is no another cleanup
process at the same time. Lock is taken differently depending on caller of
cleanup process: any vacuums and gin_clean_pending_list() will be blocked
until lock becomes available, ordinary insert uses conditional lock to
prevent indefinite waiting on lock.
Insert into pending list doesn't use this lock, so insertion isn't blocked.
Also, patch adds stopping of cleanup process when at-start-cleanup-tail is
reached in order to prevent infinite cleanup in case of massive insertion. But
it will stop only for automatic maintenance tasks like autovacuum.
Patch introduces choice of limit of memory to use: autovacuum_work_mem,
maintenance_work_mem or work_mem depending on call path.
Patch for previous releases should be reworked due to changes between 9.6 and
previous ones in this area.
Discover and diagnostics by Jeff Janes and Tomas Vondra
Patch by me with some ideas of Jeff Janes
Commit 989be0810d added a flex/bison lexer/parser to interpret
synchronous_standby_names. It was done in a pretty crufty way, though,
making assorted end-use sites responsible for calling the parser at the
right times. That was not only vulnerable to errors of omission, but made
it possible that lexer/parser errors occur at very undesirable times,
and created memory leakages even if there was no error.
Instead, perform the parsing once during check_synchronous_standby_names
and let guc.c manage the resulting data. To do that, we have to flatten
the parsed representation into a single hunk of malloc'd memory, but that
is not very hard.
While at it, work a little harder on making useful error reports for
parsing problems; the previous code felt that "synchronous_standby_names
parser returned 1" was an appropriate user-facing error message. (To
be fair, it did also log a syntax error message, but separately from the
GUC problem report, which is at best confusing.) It had some outright
bugs in the face of invalid input, too.
I (tgl) also concluded that we need to restrict unquoted names in
synchronous_standby_names to be just SQL identifiers. The previous coding
would accept darn near anything, which (1) makes the quoting convention
both nearly-unnecessary and formally ambiguous, (2) makes it very hard to
understand what is a syntax error and what is a creative interpretation of
the input as a standby name, and (3) makes it impossible to further extend
the syntax in future without a compatibility break. I presume that we're
intending future extensions of the syntax, else this parsing infrastructure
is massive overkill, so (3) is an important objection. Since we've taken
a compatibility hit for non-identifier names with this change anyway, we
might as well lock things down now and insist that users use double quotes
for standby names that aren't identifiers.
Kyotaro Horiguchi and Tom Lane
The way that PartialAggregate and FinalizeAggregate plan nodes were
displaying output columns before was bogus. Now, FinalizeAggregate
produces the same outputs as an Aggregate would have produced, while
PartialAggregate produces each of those outputs prefixed by the word
PARTIAL.
Discussion: 12585.1460737650@sss.pgh.pa.us
Patch by me, reviewed by David Rowley.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.
Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.
The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.
Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.
XXX:
Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.comCAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
pg_atomic_compare_exchange_*_impl() were providing only the semantics of
an acquire barrier. Buildfarm members hornet and mandrill revealed this
deficit beginning with commit 008608b9d5.
While we have no report of symptoms in 9.5, we can't rule out the
possibility of certain compilers, hardware, or extension code relying on
these functions' specified barrier semantics. Back-patch to 9.5, where
commit b64d92f1a5 introduced atomics.
Reviewed by Andres Freund.
The coverage was rather lean for cases that bind() or listen() might
return. Add entries for everything that there's a direct equivalent
for in the set of Unix errnos that elog.c has heard of.
There's no longer a need for the MSVC-version-specific code stanza that
forcibly redefines errno code symbols, because since commit 73838b52 we're
unconditionally redefining them in the stanza before this one anyway.
Now it's merely confusing and ugly, so get rid of it; and improve the
comment that explains what's going on here.
Although this is just cosmetic, back-patch anyway since I'm intending
to back-patch some less-cosmetic changes in this same hunk of code.
Revert commit 7cb1db1d95, which represented
a misunderstanding of the problem (if snapmgr.h weren't already included
in bufmgr.h, things wouldn't compile anywhere). Instead install what
I think is the real fix.
Even with old_snapshot_threshold = -1 (which disables the "snapshot
too old" feature), performance regressions were seen at moderate to
high concurrency. For example, a one-socket, four-core system
running 200 connections at saturation could see up to a 2.3%
regression, with larger regressions possible on NUMA machines.
By inlining the early (smaller, faster) tests in the
TestForOldSnapshot() function, the i7 case dropped to a 0.2%
regression, which could easily just be noise, and is clearly an
improvement. Further testing will show whether more is needed.
Commit 36a35c550a turned the interface between ginPlaceToPage and
its subroutines in gindatapage.c and ginentrypage.c into a royal mess:
page-update critical sections were started in one place and finished in
another place not even in the same file, and the very same subroutine
might return having started a critical section or not. Subsequent patches
band-aided over some of the problems with this design by making things
even messier.
One user-visible resulting problem is memory leaks caused by the need for
the subroutines to allocate storage that would survive until ginPlaceToPage
calls XLogInsert (as reported by Julien Rouhaud). This would not typically
be noticeable during retail index updates. It could be visible in a GIN
index build, in the form of memory consumption swelling to several times
the commanded maintenance_work_mem.
Another rather nasty problem is that in the internal-page-splitting code
path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before
entering the critical section that it's supposed to be cleared in; a
failure in between would leave the index in a corrupt state. There were
also assorted coding-rule violations with little immediate consequence but
possible long-term hazards, such as beginning an XLogInsert sequence before
entering a critical section, or calling elog(DEBUG) inside a critical
section.
To fix, redefine the API between ginPlaceToPage() and its subroutines
by splitting the subroutines into two parts. The "beginPlaceToPage"
subroutine does what can be done outside a critical section, including
full computation of the result pages into temporary storage when we're
going to split the target page. The "execPlaceToPage" subroutine is called
within a critical section established by ginPlaceToPage(), and it handles
the actual page update in the non-split code path. The critical section,
as well as the XLOG insertion call sequence, are both now always started
and finished in ginPlaceToPage(). Also, make ginPlaceToPage() create and
work in a short-lived memory context to eliminate the leakage problem.
(Since a short-lived memory context had been getting created in the most
common code path in the subroutines, this shouldn't cause any noticeable
performance penalty; we're just moving the overhead up one call level.)
In passing, fix a bunch of comments that had gone unmaintained throughout
all this klugery.
Report: <571276DD.5050303@dalibo.com>
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature. Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming). The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.
This change should have little or no effect on generated executable
code.
Motivated by the back-patching pain of Tom Lane and Robert Haas
Per discussion, there doesn't seem to be much value in having
NUM_SPINLOCK_SEMAPHORES set to 1024: under any scenario where you are
running more than a few backends concurrently, you really had better have a
real spinlock implementation if you want tolerable performance. And 1024
semaphores is a sizable fraction of the system-wide SysV semaphore limit
on many platforms. Therefore, reduce this setting's default value to 128
to make it less likely to cause out-of-semaphores problems.
The previous display was sort of confusing, because it didn't
distinguish between the number of workers that we planned to launch
and the number that actually got launched. This has already confused
several people, so display both numbers and label them clearly.
Julien Rouhaud, reviewed by me.
pg_xlogdump includes bufmgr.h. With a compiler that emits code for
static inline functions even when they're unreferenced, that leads
to unresolved external references in the new static-inline version
of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done
for similar issues elsewhere. Per buildfarm member pademelon.
My previous attempt at doing so, in 80abbeba23, was not sufficient. While that
fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant
expressions in the struct initializer, because the file/line/function
information comes from the caller of s_lock().
Give up on using a macro, and use a static inline instead.
Discussion: 4369.1460435533@sss.pgh.pa.us
Commit 314cbfc5da redefined the signature of this hook as
typedef int (*walrcv_receive_type) (char **buffer, int *wait_fd);
But in fact the type of the "wait_fd" variable ought to be pgsocket,
which is what WaitLatchOrSocket expects, and which is necessary if
we want to be able to assign PGINVALID_SOCKET to it on Windows.
So fix that.
Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.
Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.
Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.
Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one. This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.
The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).
This also fixes what file/line number is reported for waits via
s_lock().
As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.
Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us
The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581a), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.
On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail. As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.
In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.
Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us
It's silly to define these counts as narrower than they might someday
need to be. Also, I believe that the BLCKSZ * nflush calculation in
mdwriteback was capable of overflowing an int.
I've seen one too many "could not bind IPv4 socket: No error" log entries
from the Windows buildfarm members. Per previous discussion, this is
likely caused by the fact that we're doing nothing to translate
WSAGetLastError() to errno. Put in a wrapper layer to do that.
If this works as expected, it should get back-patched, but let's see what
happens in the buildfarm first.
Discussion: <4065.1452450340@sss.pgh.pa.us>
The original patch kind of ignored the fact that we were doing something
different from a costing point of view, but nobody noticed. This patch
fixes that oversight.
David Rowley
Per discussion, this gives potential users of the hook more flexibility,
because they can build custom Paths that implement only one stage of
upper processing atop core-provided Paths for earlier stages.
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future. Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.
Alexander Korotkov, adjusted slightly by me
I was initially concerned that the some of the hundreds of
references to BufferGetPage() where the literal
BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a
macro, leading to some hard-to-find performance regressions in
corner cases. Inspection of disassembled code has shown identical
code at all inspected locations, and the size difference doesn't
amount to even one byte per such call. So make it readable.
Per gripes from Álvaro Herrera and Tom Lane
Previously we used a spinlock, in adition to the atomically manipulated
->state field, to protect the wait queue. But it's pretty simple to
instead perform the locking using a flag in state.
Due to 6150a1b0 BufferDescs, on platforms (like PPC) with > 1 byte
spinlocks, increased their size above 64byte. As 64 bytes are the size
we pad allocated BufferDescs to, this can increase false sharing;
causing performance problems in turn. Together with the previous commit
this reduces the size to <= 64 bytes on all common platforms.
Author: Andres Freund
Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
This routine is unsafe as implemented, because it invalidates the page
image pointers returned by previous GenericXLogRegister() calls.
Rather than complicate the API or the implementation to avoid that,
let's just get rid of it; the use-case for having it seems much
too thin to justify a lot of work here.
While at it, do some wordsmithing on the SGML docs for generic WAL.
\crosstabview is a completely different way to display results from a
query: instead of a vertical display of rows, the data values are placed
in a grid where the column and row headers come from the data itself,
similar to a spreadsheet.
The sort order of the horizontal header can be specified by using
another column in the query, and the vertical header determines its
ordering from the order in which they appear in the query.
This only allows displaying a single value in each cell. If more than
one value correspond to the same cell, an error is thrown. Merging of
values can be done in the query itself, if necessary. This may be
revisited in the future.
Author: Daniel Verité
Reviewed-by: Pavel Stehule, Dean Rasheed
Previously bcac23d exposed a subset of support functions, namely the
ones Kaigai found useful. In
20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de I mentioned that
there's some functions missing to use the facility in an external
project.
To avoid having to add functions piecemeal, add all the functions which
are used to define READ_* and WRITE_* macros; users of the extensible
node functionality are likely to need these. Additionally expose
outDatum(), which doesn't have it's own WRITE_ macro, as it needs
information from the embedding struct.
Discussion: 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de
This creates an initial set of default roles which administrators may
use to grant access to, historically, superuser-only functions. Using
these roles instead of granting superuser access reduces the number of
superuser roles required for a system. Documention for each of the
default roles has been added to user-manag.sgml.
Bump catversion to 201604082, as we had a commit that bumped it to
201604081 and another that set it back to 201604071...
Reviews by José Luis Tallón and Robert Haas
This will prevent users from creating roles which begin with "pg_" and
will check for those roles before allowing an upgrade using pg_upgrade.
This will allow for default roles to be provided at initdb time.
Reviews by José Luis Tallón and Robert Haas
This feature is controlled by a new old_snapshot_threshold GUC. A
value of -1 disables the feature, and that is the default. The
value of 0 is just intended for testing. Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect. The xmin associated with a transaction ID does
still protect dead tuples. A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.
This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in. It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point. This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third. The follow-on patch will change the places where the test
needs to be made.
These parameters are available for SSPI authentication only, to make
it possible to make it behave more like "normal gssapi", while
making it possible to maintain compatibility.
compat_realm is on by default, but can be turned off to make the
authentication use the full Kerberos realm instead of the NetBIOS name.
upn_username is off by default, and can be turned on to return the users
Kerberos UPN rather than the SAM-compatible name (a user in Active
Directory can have both a legacy SAM-compatible username and a new
Kerberos one. Normally they are the same, but not always)
Author: Christian Ullrich
Reviewed by: Robbie Harwood, Alvaro Herrera, me
Create a "bsd" auth method that works the same as "password" so far as
clients are concerned, but calls the BSD Authentication service to
check the password. This is currently only available on OpenBSD.
Marisa Emerson, reviewed by Thomas Munro
This allows parallel aggregation to use them. It may seem surprising
that we use float8_combine for both float4_accum and float8_accum
transition functions, but that's because those functions differ only
in the type of the non-transition-state argument.
Haribabu Kommi, reviewed by David Rowley and Tomas Vondra
As noticed by Tom Lane changing operation's number in commit
bb140506df causes on-disk format incompatibility.
Revert to previous numbering, that is reason to add special array to store
priorities of operation. Also it reverts order of tsquery to previous.
Author: Dmitry Ivanov
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.
Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.
Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me. Some further hacking by me.
The PAM_RHOST item is set to the remote IP address or host name and can
be used by PAM modules. A pg_hba.conf option is provided to choose
between IP address and resolved host name.
From: Grzegorz Sampolski <grzsmp@gmail.com>
Reviewed-by: Haribabu Kommi <kommi.haribabu@gmail.com>
We still use replacement selection for the first run of the sort only
and only when the number of tuples is relatively small. Otherwise,
the first run, and subsequent runs in all cases, are produced using
quicksort. This tends to be faster except perhaps for very small
amounts of working memory.
Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy,
Greg Stark, and me.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate. Experimentation shows the extending the relation
multiple blocks at a time improves scalability.
Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
In cases where joins use multiple columns we currently assess each join
separately causing gross mis-estimates for join cardinality.
This patch adds use of FK information for the first time into the
planner. When FKs are present and we have multi-column join information,
plan estimates will be drastically improved. Cases with multiple FKs
are handled, though partial matches are ignored currently.
Net effect is substantial performance improvements for joins in many
common cases. Additional planning time is isolated to cases that are
currently performing poorly, measured at 0.08 - 0.15 ms.
Please watch for planner performance regressions; circumstances seem
unlikely but the law of unintended consequences may apply somewhen.
Additional complex tests welcome to prove this before release.
Tests can be performed using SET enable_fkey_estimates = on | off
using scripts provided during Hackers discussions, message id:
552335D9.3090707@2ndquadrant.com
Authors: Tomas Vondra and David Rowley
Reviewed and tested by Simon Riggs, adding comments only
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery.
On-disk and binary in/out format of tsquery are backward compatible.
It has two side effect:
- change order for tsquery, so, users, who has a btree index over tsquery,
should reindex it
- less number of parenthesis in tsquery output, and tsquery becomes more
readable
Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov
Reviewers: Alexander Korotkov, Artur Zakirov
Now that all of the infrastructure exists, add in the ability to
dump out the ACLs of the objects inside of pg_catalog or the ACLs
for objects which are members of extensions, but only if they have
been changed from their original values.
The original values are tracked in pg_init_privs. When pg_dump'ing
9.6-and-above databases, we will dump out the ACLs for all objects
in pg_catalog and the ACLs for all extension members, where the ACL
has been changed from the original value which was set during either
initdb or CREATE EXTENSION.
This should not change dumps against pre-9.6 databases.
Reviews by Alexander Korotkov, Jose Luis Tallon
This new catalog holds the privileges which the system was
initialized with at initdb time, along with any permissions set
by extensions at CREATE EXTENSION time. This allows pg_dump
(and any other similar use-cases) to detect when the privileges
set on initdb-created or extension-created objects have been
changed from what they were set to at initdb/extension-creation
time and handle those changes appropriately.
Reviews by Alexander Korotkov, Jose Luis Tallon
It inserts a new value into an jsonb array at arbitrary position or
a new key to jsonb object.
Author: Dmitry Dolgov
Reviewers: Petr Jelinek, Vitaly Burovoy, Andrew Dunstan
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.
Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.
(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)
Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
Previously synchronous replication offered only the ability to confirm
that all changes made by a transaction had been transferred to at most
one synchronous standby server.
This commit extends synchronous replication so that it supports multiple
synchronous standby servers. It enables users to consider one or more
standby servers as synchronous, and increase the level of transaction
durability by ensuring that transaction commits wait for replies from
all of those synchronous standbys.
Multiple synchronous standby servers are configured in
synchronous_standby_names which is extended to support new syntax of
'num_sync ( standby_name [ , ... ] )', where num_sync specifies
the number of synchronous standbys that transaction commits need to
wait for replies from and standby_name is the name of a standby
server.
The syntax of 'standby_name [ , ... ]' which was used in 9.5 or before
is also still supported. It's the same as new syntax with num_sync=1.
This commit doesn't include "quorum commit" feature which was discussed
in pgsql-hackers. Synchronous standbys are chosen based on their priorities.
synchronous_standby_names determines the priority of each standby for
being chosen as a synchronous standby. The standbys whose names appear
earlier in the list are given higher priority and will be considered as
synchronous. Other standby servers appearing later in this list
represent potential synchronous standbys.
The regression test for multiple synchronous standbys is not included
in this commit. It should come later.
Authors: Sawada Masahiko, Beena Emerson, Michael Paquier, Fujii Masao
Reviewed-By: Kyotaro Horiguchi, Amit Kapila, Robert Haas, Simon Riggs,
Amit Langote, Thomas Munro, Sameer Thakur, Suraj Kharage, Abhijit Menon-Sen,
Rajeev Rastogi
Many thanks to the various individuals who were involved in
discussing and developing this feature.
This introduces a new dependency type which marks an object as depending
on an extension, such that if the extension is dropped, the object
automatically goes away; and also, if the database is dumped, the object
is included in the dump output. Currently the grammar supports this for
indexes, triggers, materialized views and functions only, although the
utility code is generic so adding support for more object types is a
matter of touching the parser rules only.
Author: Abhijit Menon-Sen
Reviewed-by: Alexander Korotkov, Álvaro Herrera
Discussion: http://www.postgresql.org/message-id/20160115062649.GA5068@toroid.org
has_parallel_hazard() was ignoring the proparallel markings for
aggregates, which is no good. Fix that. There was no way to mark
an aggregate as actually being parallel-safe, either, so add a
PARALLEL option to CREATE AGGREGATE.
Patch by me, reviewed by David Rowley.
This lets us use parallel aggregate for a variety of useful cases
that didn't work before, like sum(int8), sum(numeric), several
versions of avg(), and various other functions.
Add some regression tests, as well, testing the general sanity of
these and future catalog entries.
David Rowley, reviewed by Tomas Vondra, with a few further changes
by me.
Previously non-exclusive backups had to be done using the replication protocol
and pg_basebackup. With this commit it's now possible to make them using
pg_start_backup/pg_stop_backup as well, as long as the backup program can
maintain a persistent connection to the database.
Doing this, backup_label and tablespace_map are returned as results from
pg_stop_backup() instead of being written to the data directory. This makes
the server safe from a crash during an ongoing backup, which can be a problem
with exclusive backups.
The old syntax of the functions remain and work exactly as before, but since the
new syntax is safer this should eventually be deprecated and removed.
Only reference documentation is included. The main section on backup still needs
to be rewritten to cover this, but since that is already scheduled for a separate
large rewrite, it's not included in this patch.
Reviewed by David Steele and Amit Kapila
This elevel is useful for logging audit messages and similar information
that should not be passed to the client. It's equivalent to LOG in terms
of decisions about logging priority in the postmaster log, but messages
with this elevel will never be sent to the client.
In the current implementation, it's just an alias for the longstanding
COMMERROR elevel (or more accurately, we've made COMMERROR an alias for
this). At some point it might be interesting to allow a LOG_ONLY flag to
be attached to any elevel, but that would be considerably more complicated,
and it's not clear there's enough use-cases to justify the extra work.
For now, let's just take the easy 90% solution.
David Steele, reviewed by Fabien Coelho, Petr Jelínek, and myself
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.
Bump XLOG_PAGE_MAGIC because of new record type.
Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
Michael Paquier
Previously, the planner would reject an index-only scan if any restriction
clause for its table used a column not available from the index, even
if that restriction clause would later be dropped from the plan entirely
because it's implied by the index's predicate. This is a fairly common
situation for partial indexes because predicates using columns not included
in the index are often the most useful kind of predicate, and we have to
duplicate (or at least imply) the predicate in the WHERE clause in order
to get the index to be considered at all. So index-only scans were
essentially unavailable with such partial indexes.
To fix, we have to do detection of implied-by-predicate clauses much
earlier in the planner. This patch puts it in check_index_predicates
(nee check_partial_indexes), meaning it gets done for every partial index,
whereas we previously only considered this issue at createplan time,
so that the work was only done for an index actually selected for use.
That could result in a noticeable planning slowdown for queries against
tables with many partial indexes. However, testing suggested that there
isn't really a significant cost, especially not with reasonable numbers
of partial indexes. We do get a small additional benefit, which is that
cost_index is more accurate since it correctly discounts the evaluation
cost of clauses that will be removed. We can also avoid considering such
clauses as potential indexquals, which saves useless matching cycles in
the case where the predicate columns aren't in the index, and prevents
generating bogus plans that double-count the clause's selectivity when
the columns are in the index.
Tomas Vondra and Kyotaro Horiguchi, reviewed by Kevin Grittner and
Konstantin Knizhnik, and whacked around a little by me
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline. This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.
This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.
Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.
Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
Formerly, the geometric I/O routines such as box_in and point_out relied
directly on strtod() and sprintf() for conversion of the float8 component
values of their data types. However, the behavior of those functions is
pretty platform-dependent, especially for edge-case values such as
infinities and NaNs. This was exposed by commit acdf2a8b37, which
added test cases involving boxes with infinity endpoints, and immediately
failed on Windows and AIX buildfarm members. We solved these problems
years ago in the main float8in and float8out functions, so let's fix it
by making the geometric types use that code instead of depending directly
on the platform-supplied functions.
To do this, refactor the float8in code so that it can be used to parse
just part of a string, and as a convenience make the guts of float8out
usable without going through DirectFunctionCall.
While at it, get rid of geo_ops.c's fairly shaky assumptions about the
maximum output string length for a double, by having it build results in
StringInfo buffers instead of fixed-length strings.
In passing, convert all the "invalid input syntax for type foo" messages
in this area of the code into "invalid input syntax for type %s" to reduce
the number of distinct translatable strings, per recent discussion.
We would have needed a fair number of the latter anyway for code-sharing
reasons, so we might as well just go whole hog.
Note: this patch is by no means intended to guarantee that the geometric
types uniformly behave sanely for infinity or NaN component values.
But any bugs we have in that line were there all along, they were just
harder to reach in a platform-independent way.
Patch implements quad-tree over boxes, naive approach of 2D quad tree will not
work for any non-point objects because splitting space on node is not
efficient. The idea of pathc is treating 2D boxes as 4D points, so,
object will not overlap (in 4D space).
The performance tests reveal that this technique especially beneficial
with too much overlapping objects, so called "spaghetti data".
Author: Alexander Lebedev with editorization by Emre Hasegeli and me
During scan sometimes it would be very helpful to know some information about
parent node or all ancestor nodes. Right now reconstructedValue could be used
but it's not a right usage of it (range opclass uses that).
traversalValue is arbitrary piece of memory in separate MemoryContext while
reconstructedVale should have the same type as indexed column.
Subsequent patches for range opclass and quad4d tree will use it.
Author: Alexander Lebedev, Teodor Sigaev
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk. That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.
To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby. But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.
Thomas Munro, reviewed by Michael Paquier and by me. Some additional
tweaks by me.
With the original SQL-function implementation, such cases failed because
we don't support infinite intervals. Converting the function to C lets
us bypass the interval representation, which should be a bit faster as
well as more flexible.
Vitaly Burovoy, reviewed by Anastasia Lubennikova
This is necessary infrastructure for supporting parallel aggregation
for aggregates whose transition type is "internal". Such values
can't be passed between cooperating processes, because they are
just pointers.
David Rowley, reviewed by Tomas Vondra and by me.
Per discussion, the new extensible node framework is thought to be
better designed than the custom path/scan/scanstate stuff we added
in PostgreSQL 9.5. Rework the latter to be more like the former.
This is not backward-compatible, but we generally don't promise that
for C APIs, and there probably aren't many people using this yet
anyway.
KaiGai Kohei, reviewed by Petr Jelinek and me. Some further
cosmetic changes by me.
Commit fbe5a3fb73 accidentally changed
this behavior; put things back the way they were, and add some
regression tests.
Report by Andres Freund; patch by Ashutosh Bapat, with a bit of
kibitzing by me.
Up until now, we've been using timezone and tzname, but Visual Studio
2015 (for which we wish to add support) no longer declares those
symbols. All versions since Visual Studio 2003 apparently support the
underscore-equipped names, and we don't support anything older than
Visual Studio 2005, so this should work OK everywhere. But let's see
what the buildfarm thinks.
Michael Paquier, reviewed by Petr Jelinek
Returning the direct result of bit arithmetic, in a macro intended to be
used in a boolean manner, can be problematic if the return value is
stored in a variable of type 'bool'. If bool is implemented using C99's
_Bool, that can lead to comparison failures if the variable is then
compared again with the expression (see ginStepRight() for an example
that fails), as _Bool forces the result to be 0/1. That happens in some
configurations of newer MSVC compilers. It's also problematic when
storing the result of such an expression in a narrower type.
Several gin macros have been declared in that style since gin's initial
commit in 8a3631f8d8.
There's a lot more macros like this, but this is the only one causing
regression test failures; and I don't want to commit and backpatch a
larger patch with lots of conflicts just before the next set of minor
releases.
Discussion: 20150811154237.GD17575@awork2.anarazel.de
Backpatch: All supported branches
This avoids leaving dangling links in pg_operator; which while fairly
harmless are also unsightly.
While we're at it, simplify OperatorUpd, which went through
heap_modify_tuple for no very good reason considering it had already made
a tuple copy it could just scribble on.
Roma Sokolov, reviewed by Tomas Vondra, additional hacking by Robert Haas
and myself.
This completes (at least for now) the project of getting rid of ad-hoc
linkages among the src/bin/ subdirectories. Everything they share is now
in src/fe_utils/ and is included from a static library at link time.
A side benefit is that we can restore the FLEX_NO_BACKUP check for
psqlscanslash.l. We might need to think of another way to do that check
if we ever need to build two lexers with that property in the same source
directory, but there's no foreseeable reason to need that.
Per discussion, we want to create a static library and put the stuff into
it that until now has been shared across src/bin/ directories by ad-hoc
methods like symlinking a source file. This commit creates the library and
populates it with a couple of files that contain the widely-useful portions
of pg_dump's dumputils.c file. dumputils.c survives, because it has some
stuff that didn't seem appropriate for fe_utils, but it's significantly
smaller and is no longer referenced from any other directory.
Follow-on patches will move more stuff into fe_utils.
The Mkvcbuild.pm hacking here is just a best guess; we'll see how the
buildfarm likes it.
This enables external code to create access methods. This is useful so
that extensions can add their own access methods which can be formally
tracked for dependencies, so that DROP operates correctly. Also, having
explicit support makes pg_dump work correctly.
Currently only index AMs are supported, but we expect different types to
be added in the future.
Authors: Alexander Korotkov, Petr Jelínek
Reviewed-By: Teodor Sigaev, Petr Jelínek, Jim Nasby
Commitfest-URL: https://commitfest.postgresql.org/9/353/
Discussion: https://www.postgresql.org/message-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com
Now that we have src/common/ for code shared between frontend and backend,
we can get rid of (most of) the klugy ways that the keyword table and
keyword lookup code were formerly shared between different uses.
This is a first step towards a more general plan of getting rid of
special-purpose kluges for sharing code in src/bin/.
I chose to merge kwlookup.c back into keywords.c, as it once was, and
always has been so far as keywords.h is concerned. We could have
kept them separate, but there is noplace that uses ScanKeywordLookup
without also wanting access to the backend's keyword list, so there
seems little point.
ecpg is still a bit weird, but at least now the trickiness is documented.
I think that the MSVC build script should require no adjustments beyond
what's done here ... but we'll soon find out.
Parallel workers can now partially aggregate the data and pass the
transition values back to the leader, which can combine the partial
results to produce the final answer.
David Rowley, based on earlier work by Haribabu Kommi. Reviewed by
Álvaro Herrera, Tomas Vondra, Amit Kapila, James Sewell, and me.
Commit ac1d794 ("Make idle backends exit if the postmaster dies.")
introduced a regression on, at least, large linux systems. Constantly
adding the same postmaster_alive_fds to the OSs internal datastructures
for implementing poll/select can cause significant contention; leading
to a performance regression of nearly 3x in one example.
This can be avoided by using e.g. linux' epoll, which avoids having to
add/remove file descriptors to the wait datastructures at a high rate.
Unfortunately the current latch interface makes it hard to allocate any
persistent per-backend resources.
Replace, with a backward compatibility layer, WaitLatchOrSocket with a
new WaitEventSet API. Users can allocate such a Set across multiple
calls, and add more than one file-descriptor to wait on. The latter has
been added because there's upcoming postgres features where that will be
helpful.
In addition to the previously existing poll(2), select(2),
WaitForMultipleObjects() implementations also provide an epoll_wait(2)
based implementation to address the aforementioned performance
problem. Epoll is only available on linux, but that is the most likely
OS for machines large enough (four sockets) to reproduce the problem.
To actually address the aforementioned regression, create and use a
long-lived WaitEventSet for FE/BE communication. There are additional
places that would benefit from a long-lived set, but that's a task for
another day.
Thanks to Amit Kapila, who helped make the windows code I blindly wrote
actually work.
Reported-By: Dmitry Vasilyev Discussion:
CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com20160114143931.GG10941@awork2.anarazel.de
Previously latches for windows and unix had been implemented in
different files. A later patch introduce an expanded wait
infrastructure, keeping the implementation separate would introduce too
much duplication.
This basically just moves the functions, without too much change. The
reason to keep this separate is that it allows blame to continue working
a little less badly; and to make review a tiny bit easier.
Discussion: 20160114143931.GG10941@awork2.anarazel.de
The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability. This is now a long time ago. We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.
Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses. The old values
are still accepted but are converted internally.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: David Steele <david@pgmasters.net>
postgres_fdw can now sent an UPDATE or DELETE statement directly to
the foreign server in simple cases, rather than sending a SELECT FOR
UPDATE statement and then updating or deleting rows one-by-one.
Etsuro Fujita, reviewed by Rushabh Lathia, Shigeru Hanada, Kyotaro
Horiguchi, Albe Laurenz, Thom Brown, and me.
- allow to use non-ascii characters as affix flag. Non-numeric affix flags now
are stored as string instead of numeric value of character.
- allow to use 0 as affix flag in numeric encoded affixes
That adds support for arabian, hungarian, turkish and
brazilian portuguese languages.
Author: Artur Zakirov with heavy editorization by me
Tighten the semantics of boundary-case timestamptz so that we allow
timestamps >= '4714-11-24 00:00+00 BC' and < 'ENDYEAR-01-01 00:00+00 AD'
exactly, no more and no less, but it is allowed to enter timestamps
within that range using non-GMT timezone offsets (which could make the
nominal date 4714-11-23 BC or ENDYEAR-01-01 AD). This eliminates
dump/reload failure conditions for timestamps near the endpoints.
To do this, separate checking of the inputs for date2j() from the
final range check, and allow the Julian date code to handle a range
slightly wider than the nominal range of the datatypes.
Also add a bunch of checks to detect out-of-range dates and timestamps
that formerly could be returned by operations such as date-plus-integer.
All C-level functions that return date, timestamp, or timestamptz should
now be proof against returning a value that doesn't pass IS_VALID_DATE()
or IS_VALID_TIMESTAMP().
Vitaly Burovoy, reviewed by Anastasia Lubennikova, and substantially
whacked around by me
There's a lot more that could be done here yet - in particular, this
reports only very coarse-grained information about the index vacuuming
phase - but even as it stands, the new pg_stat_progress_vacuum can
tell you quite a bit about what a long-running vacuum is actually
doing.
Amit Langote and Robert Haas, based on earlier work by Vinayak Pokale
and Rahila Syed.
Previously, we included <xlocale.h> only if necessary to get the definition
of type locale_t. According to notes in PGAC_TYPE_LOCALE_T, this is
important because on some versions of glibc that file supplies an
incompatible declaration of locale_t. (This info may be obsolete, because
on my RHEL6 box that seems to be the *only* definition of locale_t; but
there may still be glibc's in the wild for which it's a live concern.)
It turns out though that on FreeBSD and maybe other BSDen, you can get
locale_t from stdlib.h or locale.h but mbstowcs_l() and friends only from
<xlocale.h>. This was leaving us compiling calls to mbstowcs_l() and
friends with no visible prototype, which causes a warning and could
possibly cause actual trouble, since it's not declared to return int.
Hence, adjust the configure checks so that we'll include <xlocale.h>
either if it's necessary to get type locale_t or if it's necessary to
get a declaration of mbstowcs_l().
Report and patch by Aleksander Alekseev, somewhat whacked around by me.
Back-patch to all supported branches, since we have been using
mbstowcs_l() since 9.1.
This is basically like the just-added create_upper_paths_hook, but
control is funneled only to the FDW responsible for all the baserels
of the current query; so providing such a callback is much less likely
to add useless overhead than using the hook function is.
The documentation is a bit sketchy. We'll likely want to improve it,
and/or adjust the call conventions, when we get some experience with
actually using this callback. Hopefully somebody will find time to
experiment with it before 9.6 feature freeze.
In the initial revision of the upper-planner pathification work, the only
available way for an FDW or custom-scan provider to inject Paths
representing post-scan-join processing was to insert them during scan-level
GetForeignPaths or similar processing. While that's not impossible, it'd
require quite a lot of duplicative processing to look forward and see if
the extension would be capable of implementing the whole query. To improve
matters for custom-scan providers, provide a hook function at the point
where the core code is about to start filling in upperrel Paths. At this
point Paths are available for the whole scan/join tree, which should reduce
the amount of redundant effort considerably.
(An alternative design that was suggested was to provide a separate hook
for each post-scan-join processing step, but that seems messy and not
clearly more useful.)
Following our time-honored tradition, there's no documentation for this
hook outside the source code.
As-is, this hook is only meant for custom scan providers, which we can't
assume very much about. A followon patch will implement an FDW callback
to let FDWs do the same thing in a somewhat more structured fashion.
Although the default choice of rel->reltarget should typically be
sufficient for scan or join paths, it's not at all sufficient for the
purposes PathTargets were invented for; in particular not for
upper-relation Paths. So break API compatibility by adding a PathTarget
argument to create_foreignscan_path(). To ease updating of existing
code, accept a NULL value of the argument as selecting rel->reltarget.
In commit 19a541143a I did not make PathTarget a subtype of Node,
and embedded a RelOptInfo's reltarget directly into it rather than having
a separately-allocated Node. In hindsight that was misguided
micro-optimization, enabled by the fact that at that point we didn't have
any Paths with custom PathTargets. Now that PathTarget processing has
been fleshed out some more, it's easier to see that it's better to have
PathTarget as an indepedent Node type, even if it does cost us one more
palloc to create a RelOptInfo. So change it while we still can.
This commit just changes the representation, without doing anything more
interesting than that.
This patch widens SPI_processed, EState's es_processed field, PortalData's
portalPos field, FuncCallContext's call_cntr and max_calls fields,
ExecutorRun's count argument, PortalRunFetch's result, and the max number
of rows in a SPITupleTable to uint64, and deals with (I hope) all the
ensuing fallout. Some of these values were declared uint32 before, and
others "long".
I also removed PortalData's posOverflow field, since that logic seems
pretty useless given that portalPos is now always 64 bits.
The user-visible results are that command tags for SELECT etc will
correctly report tuple counts larger than 4G, as will plpgsql's GET
GET DIAGNOSTICS ... ROW_COUNT command. Queries processing more tuples
than that are still not exactly the norm, but they're becoming more
common.
Most values associated with FETCH/MOVE distances, such as PortalRun's count
argument and the count argument of most SPI functions that have one, remain
declared as "long". It's not clear whether it would be worth promoting
those to int64; but it would definitely be a large dollop of additional
API churn on top of this, and it would only help 32-bit platforms which
seem relatively less likely to see any benefit.
Andreas Scherbaum, reviewed by Christian Ullrich, additional hacking by me
CitusDB is using these and don't wish to redesign their code right now.
I am not on board with this being a good idea, or a good precedent,
but I lack the energy to fight about it.
Adds several tsvector editting function: convert tsvector to/from text array,
set weight for given lexemes, delete lexeme(s), unnest, filter lexemes
with given weights
Author: Stas Kelvich with some editorization by me
Reviewers: Tomas Vondram, Teodor Sigaev
Teach make_group_input_target() and make_window_input_target() to work
entirely with the PathTarget representation of tlists, rather than
constructing a tlist and immediately deconstructing it into PathTarget
format. In itself this only saves a few palloc's; the bigger picture is
that it opens the door for sharing cost_qual_eval work across all of
planner.c's constructions of PathTargets. I'll come back to that later.
In support of this, flesh out tlist.c's infrastructure for PathTargets
a bit more.
emit_log_hook could only see the translated text, making it harder to identify
which message was being sent. Pass original text to allow the exact message to
be identified, whichever language is used for logging.
Discussion: 20160216.184755.59721141.horiguchi.kyotaro@lab.ntt.co.jp
Author: Kyotaro Horiguchi
Up to now checkpoints were written in the order they're in the
BufferDescriptors. That's nearly random in a lot of cases, which
performs badly on rotating media, but even on SSDs it causes slowdowns.
To avoid that, sort checkpoints before writing them out. We currently
sort by tablespace, relfilenode, fork and block number.
One of the major reasons that previously wasn't done, was fear of
imbalance between tablespaces. To address that balance writes between
tablespaces.
The other prime concern was that the relatively large allocation to sort
the buffers in might fail, preventing checkpoints from happening. Thus
pre-allocate the required memory in shared memory, at server startup.
This particularly makes it more efficient to have checkpoint flushing
enabled, because that'll often result in a lot of writes that can be
coalesced into one flush.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
All along, this function should have treated WindowFuncs in a manner
similar to Aggrefs, ie with an option whether or not to recurse into them.
By not considering the case, it was always recursing, which is OK for most
callers (although I suspect that the case in prepare_sort_from_pathkeys
might represent a bug). But now we need return-without-recursing behavior
as well. There are also more than a few callers that should never see a
WindowFunc, and now we'll get some error checking on that.
In commit 1d97c19a0f and later c1d9579dd8, we extended
pull_var_clause's API by adding enum-type arguments. That's sort of a pain
to maintain, though, because it means every time we add a new behavior we
must touch every last one of the call sites, even if there's a reasonable
default behavior that most of them could use. Let's switch over to using a
bitmask of flags, instead; that seems more maintainable and might save a
nanosecond or two as well. This commit changes no behavior in itself,
though I'm going to follow it up with one that does add a new behavior.
In passing, remove flatten_tlist(), which has not been used since 9.1
and would otherwise need the same API changes.
Removing these enums means that optimizer/tlist.h no longer needs to
depend on optimizer/var.h. Changing that caused a number of C files to
need addition of #include "optimizer/var.h" (probably we can thank old
runs of pgrminclude for that); but on balance it seems like a good change
anyway.
Earlier version committed in 9.0 caused spurious waits in some cases.
New infrastructure for lock waits in 9.3 used to correct and improve this.
Jeff Janes based upon a proposal by Simon Riggs, who also reviewed
Additional review comments from Amit Kapila
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting. Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.
Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes; especially on filesystems like xfs and ext4 when mounted
with data=writeback. To be certain that a rename() atomically replaces
the previous file contents in the face of crashes and different
filesystems, one has to fsync the old filename, rename the file, fsync
the new filename, fsync the containing directory. This sequence is not
generally adhered to currently; which exposes us to data loss risks. To
avoid having to repeat this arduous sequence, introduce
durable_rename(), which wraps all that.
Also add durable_link_or_rename(). Several places use link() (with a
fallback to rename()) to rename a file, trying to avoid replacing the
target file out of paranoia. Some of those rename sequences need to be
durable as well. There seems little reason extend several copies of the
same logic, so centralize the link() callers.
This commit does not yet make use of the new functions; they're used in
a followup commit.
Author: Michael Paquier, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
Refactor so that the internal APIs in planner.c deal in PathTargets not
targetlists, and establish a more regular structure for deriving the
targets needed for successive steps.
There is more that could be done here; calculating the eval costs of each
successive target independently is both inefficient and wrong in detail,
since we won't actually recompute values available from the input node's
tlist. But it's no worse than what happened before the pathification
rewrite. In any case this seems like a good starting point for considering
how to handle Konstantin Knizhnik's function-evaluation-postponement patch.
Instead of having planner.c compute a groupColIdx array and store it in
GroupingSetsPaths, make create_groupingsets_plan() find the grouping
columns by searching in the child plan node's tlist. Although that's
probably a bit slower for create_groupingsets_plan(), it's more like
the way every other plan node type does this, and it provides positive
confirmation that we know which child output columns we're supposed to be
grouping on. (Indeed, looking at this now, I'm not at all sure that it
wasn't broken before, because create_groupingsets_plan() isn't demanding
an exact tlist match from its child node.) Also, this allows substantial
simplification in planner.c, because it no longer needs to compute the
groupColIdx array at all; no other cases were using it.
I'd intended to put off this refactoring until later (like 9.7), but
in view of the likely bug fix and the need to rationalize planner.c's
tlist handling so we can do something sane with Konstantin Knizhnik's
function-evaluation-postponement patch, I think it can't wait.
This patch removes some redundant cost calculations that I left for later
cleanup in commit 3fc6e2d7f5. There's now a uniform policy that the
make_foo() convenience functions don't do any cost calculations. Most of
their callers copy costs from the source Path node, and for those that
don't, the calculation in the make_foo() function wasn't necessarily right
anyhow. (make_result() was particularly a mess, as it was serving multiple
callers using cost calcs designed for only the first one or two that had
ever existed.) Aside from saving a few cycles, this ensures that what
EXPLAIN prints matches the costs we used for planning purposes. It does
not change any planner decisions, since the decisions are already made.
For example, if you want to perform an ioctl() on a file descriptor
opened through the fd.c routines, there's no way to do that without
being able to get at the underlying fd.
KaiGai Kohei
Commit a892234f83 added a second bit per
page to the visibility map, which still seems like a good idea, but it
also added a second page-level bit alongside PD_ALL_VISIBLE to track
whether the visibility map bit was set. That no longer seems like a
clever plan, because we don't really need that bit for anything. We
always clear both bits when the page is modified anyway.
Patch by me, reviewed by Kyotaro Horiguchi and Masahiko Sawada.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
Logical decoding's reorderbuffer keeps transactions in an LSN ordered
list for efficiency. To make that's efficiently possible upper-level
xids are forced to be logged before nested subtransaction xids. That
only works though if these records are all looked at: Unfortunately we
didn't do so for e.g. row level locks, which are otherwise uninteresting
for logical decoding.
This could lead to errors like:
"ERROR: subxact logged without previous toplevel record".
It's not sufficient to just look at row locking records, the xid could
appear first due to a lot of other types of records (which will trigger
the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny).
So invent infrastructure to tell reorderbuffer about xids seen, when
they'd otherwise not pass through reorderbuffer.c.
Reported-By: Jarred Ward
Bug: #13844
Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org
Backpatch: 9.4, where logical decoding was added
Add four new SQL accessible functions: pg_control_system(),
pg_control_checkpoint(), pg_control_recovery(), and pg_control_init()
which expose a subset of the control file data.
Along the way move the code to read and validate the control file to
src/common, where it can be shared by the new backend functions
and the original pg_controldata frontend program.
Patch by me, significant input, testing, and review by Michael Paquier.
Now it's possible to load recent version of Hunspell for several languages.
To handle these dictionaries Hunspell patch adds support for:
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65535)
* AF parameter - alias for flag's set
Also it moves test dictionaries into separate directory.
Author: Artur Zakirov with editorization by me
606c0123d6 attempted to reduce cost of index scans using > and <
strategies, though got that completely wrong in a few complex cases.
Revert whole patch until we find a safe optimization.
Commits 9ff60273e3 and dbe2328959 adjusted the declarations
of some core functions referenced by contrib/tsearch2's install script,
forgetting that in a pg_upgrade situation, we'll be trying to restore
operator class definitions that reference the old signatures. We've
hit this problem before; solve it in the same way as before, namely by
installing stub functions that have the expected signature and just
invoke the correct function. Per report from Jeff Janes.
(Someday we ought to stop supporting contrib/tsearch2, but I'm not
sure today is that day.)
The new bit indicates whether every tuple on the page is already frozen.
It is cleared only when the all-visible bit is cleared, and it can be
set only when we vacuum a page and find that every tuple on that page is
both visible to every transaction and in no need of any future
vacuuming.
A future commit will use this new bit to optimize away full-table scans
that would otherwise be triggered by XID wraparound considerations. A
page which is merely all-visible must still be scanned in that case, but
a page which is all-frozen need not be. This commit does not attempt
that optimization, although that optimization is the goal here. It
seems better to get the basic infrastructure in place first.
Per discussion, it's very desirable for pg_upgrade to automatically
migrate existing VM forks from the old format to the new format. That,
too, will be handled in a follow-on patch.
Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit
Kapila, Simon Riggs, Andres Freund, and others, and substantially
revised by me.
This is basically a bug fix; the old code assumes that a ForeignScan
is always parallel-safe, but for postgres_fdw, for example, this is
definitely false. It should be true for file_fdw, though, since a
worker can read a file from the filesystem just as well as any other
backend process.
Original patch by Thomas Munro. Documentation, and changes to the
comments, by me.
This patch introduces "pg_blocking_pids(int) returns int[]", which returns
the PIDs of any sessions that are blocking the session with the given PID.
Historically people have obtained such information using a self-join on
the pg_locks view, but it's unreasonably tedious to do it that way with any
modicum of correctness, and the addition of parallel queries has pretty
much broken that approach altogether. (Given some more columns in the view
than there are today, you could imagine handling parallel-query cases with
a 4-way join; but ugh.)
The new function has the following behaviors that are painful or impossible
to get right via pg_locks:
1. Correctly understands which lock modes block which other ones.
2. In soft-block situations (two processes both waiting for conflicting lock
modes), only the one that's in front in the wait queue is reported to
block the other.
3. In parallel-query cases, reports all sessions blocking any member of
the given PID's lock group, and reports a session by naming its leader
process's PID, which will be the pg_backend_pid() value visible to
clients.
The motivation for doing this right now is mostly to fix the isolation
tests. Commit 38f8bdcac4 lobotomized
isolationtester's is-it-waiting query by removing its ability to recognize
nonconflicting lock modes, as a crude workaround for the inability to
handle soft-block situations properly. But even without the lock mode
tests, the old query was excessively slow, particularly in
CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new
deadlock-hard test because the deadlock timeout elapses before they can
probe the waiting status of all eight sessions. Replacing the pg_locks
self-join with use of pg_blocking_pids() is not only much more correct, but
a lot faster: I measure it at about 9X faster in a typical dev build with
Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide
enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the
test, without having to lengthen deadlock_timeout yet more and thus slow
down the test for everyone else.
We don't really need this field, because it's either zero or redundant with
PGPROC.pid. The use of zero to mark "not a group leader" is not necessary
since we can just as well test whether lockGroupLeader is NULL. This does
not save very much, either as to code or data, but the simplification seems
worthwhile anyway.
Coverity griped about use of unchecked strcpy() into a local variable.
There's unlikely to be any actual bug there, since no caller would be
passing a path longer than MAXPGPATH, but nonetheless use of strlcpy()
seems preferable.
While at it, get rid of unmaintainable separation between list of
field names and list of field values in favor of initializing them
in parallel. And we might as well declare get_configdata()'s path
argument as const char *, even though no current caller needs that.
Reflow text in lock manager README so that it fits within 80 columns.
Correct some mistakes. Expand the README to explain not only why group
locking exists but also the data structures that support it. Improve
comments related to group locking several files. Change the name of a
macro argument for improved clarity.
Most of these problems were reported by Tom Lane, but I found a few
of them myself.
Robert Haas and Tom Lane
This will parse strings in the format produced by pg_size_pretty() and
return sizes in bytes. This allows queries to be written with clauses
like "pg_total_relation_size(oid) > pg_size_bytes('10 GB')".
Author: Pavel Stehule with various improvements by Vitaly Burovoy
Discussion: http://www.postgresql.org/message-id/CAFj8pRD-tGoDKnxdYgECzA4On01_uRqPrwF-8LdkSE-6bDHp0w@mail.gmail.com
Reviewed-by: Vitaly Burovoy, Oleksandr Shulgin, Kyotaro Horiguchi,
Michael Paquier and Robert Haas
Architecture reference material specifies this order, and s_lock.h
inline assembly agrees. The former order failed to provide mutual
exclusion to lwlock.c and perhaps to other clients. The two xlc
buildfarm members, hornet and mandrill, have failed sixteen times with
duplicate key errors involving pg_class_oid_index or pg_type_oid_index.
Back-patch to 9.5, where commit b64d92f1a5
introduced atomics.
Reviewed by Andres Freund and Tom Lane.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
Move and refactor the underlying code for the pg_config client
application to src/common in support of sharing it with a new
system information SRF called pg_config() which makes the same
information available via SQL. Additionally wrap the SRF with a
new system view, as called pg_config.
Patch by me with extensive input and review by Michael Paquier
and additional review by Alvaro Herrera.
When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.
Peter Geoghegan, reviewd by Andreas Karlsson and by me.
In commit 7b4bfc87 the DATA and DESCR entries for the new
row_security_active() function were inadvertantly put after
the PROVOLATILE defines, rather than before as they should
have been placed. Move them up where they belong.
Backpatch to 9.5 where the new entries were introduced.
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al. But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4. The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.
This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.
To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.
Discussion: 20160118163908.GW10941@awork2.anarazel.de
Reportedly, some compilers warn about tests like "c < 0" if c is unsigned,
and hence complain about the character range checks I added in commit
3bb3f42f37. This is a bit of a pain since
the regex library doesn't really want to assume that chr is unsigned.
However, since any such reconfiguration would involve manual edits of
regcustom.h anyway, we can put it on the shoulders of whoever wants to
do that to adjust this new range-checking macro correctly.
Per gripes from Coverity and Andres.
An extensible node is always tagged T_Extensible, but the extnodename
field identifies it more specifically; it may also include arbitrary
private data. Extensible nodes can be copied, tested for equality,
serialized, and deserialized, but the core system doesn't know
anything about them otherwise. Some extensions may find it useful to
include these nodes in fdw_private or custom_private lists in lieu of
arm-wrestling their data into a format that the core code can
understand.
Along the way, so as not to burden the authors of such extensible
node types too much, expose the functions for writing serialized
tokens, and for serializing and deserializing bitmapsets.
KaiGai Kohei, per a design suggested by me. Reviewed by Andres Freund
and by me, and further edited by me.
If a GROUP BY clause includes all columns of a non-deferred primary key,
as well as other columns of the same relation, those other columns are
redundant and can be dropped from the grouping; the pkey is enough to
ensure that each row of the table corresponds to a separate group.
Getting rid of the excess columns will reduce the cost of the sorting or
hashing needed to implement GROUP BY, and can indeed remove the need for
a sort step altogether.
This seems worth testing for since many query authors are not aware of
the GROUP-BY-primary-key exception to the rule about queries not being
allowed to reference non-grouped-by columns in their targetlists or
HAVING clauses. Thus, redundant GROUP BY items are not uncommon. Also,
we can make the test pretty cheap in most queries where it won't help
by not looking up a rel's primary key until we've found that at least
two of its columns are in GROUP BY.
David Rowley, reviewed by Julien Rouhaud
A pending patch requires exporting a function returning Bitmapset from
catalog/pg_constraint.c. As things stand, that would mean including
nodes/bitmapset.h in pg_constraint.h, which might be hazardous for the
client-side includability of that header. It's not entirely clear whether
any client-side code needs to include pg_constraint.h, but it seems prudent
to assume that there is some such code somewhere. Therefore, split off the
function definitions into a new file pg_constraint_fn.h, similarly to what
we've done for some other catalog header files.
This finishes the work - spread across many commits over the last
several months - of putting each type of lock other than the named
individual locks into a separate tranche.
Amit Kapila
Commit 0e141c0fbb introduced a new
facility to reduce ProcArrayLock contention by clearing several XIDs
from the ProcArray under a single lock acquisition. The names
initially chosen were deemed not to be very good choices, so commit
4aec49899e renamed them. But now it
seems like we still didn't get it right. A pending patch wants to
add similar infrastructure for batching CLOG updates, so the names
need to be clear enough to allow a new set of structure members with
a related purpose.
Amit Kapila
As of commit c1772ad922, there's no
longer any way of requesting additional LWLocks in the main tranche,
so we don't need NumLWLocks() or LWLockAssign() any more. Also,
some of the allocation counters that we had previously aren't needed
any more either.
Amit Kapila
This is a quick hack, due to be reverted when its purpose has been served,
to try to gather information about why some of the buildfarm critters
regularly fail with "postmaster does not shut down" complaints. Maybe they
are just really overloaded, but maybe something else is going on. Hence,
instrument pg_ctl to print the current time when it starts waiting for
postmaster shutdown and when it gives up, and add a lot of logging of the
current time in the server's checkpoint and shutdown code paths.
No attempt has been made to make this pretty. I'm not even totally sure
if it will build on Windows, but we'll soon find out.
Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a
bad choice because it is outside the range of type "celt" (int32).
Characters approaching that limit could lead to infinite loops in logic
such as "for (c = a; c <= b; c++)" where c is of type celt but the
range bounds are chr. Such loops will work safely only if CHR_MAX+1
is representable in celt, since c must advance to beyond b before the
loop will exit.
Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe.
It's highly unlikely that Unicode will ever assign codes that high, and
none of our other backend encodings need characters beyond that either.
In addition to modifying the macro, we have to explicitly enforce character
range restrictions on the values of \u, \U, and \x escape sequences, else
the limit is trivially bypassed.
Also, the code for expanding case-independent character ranges in bracket
expressions had a potential integer overflow in its calculation of the
number of characters it could generate, which could lead to allocating too
small a character vector and then overwriting memory. An attacker with the
ability to supply arbitrary regex patterns could easily cause transient DOS
via server crashes, and the possibility for privilege escalation has not
been ruled out.
Quite aside from the integer-overflow problem, the range expansion code was
unnecessarily inefficient in that it always produced a result consisting of
individual characters, abandoning the knowledge that we had a range to
start with. If the input range is large, this requires excessive memory.
Change it so that the original range is reported as-is, and then we add on
any case-equivalent characters that are outside that range. With this
approach, we can bound the number of individual characters allowed without
sacrificing much. This patch allows at most 100000 individual characters,
which I believe to be more than the number of case pairs existing in
Unicode, so that the restriction will never be hit in practice.
It's still possible for range() to take awhile given a large character code
range, so also add statement-cancel detection to its loop. The downstream
function dovec() also lacked cancel detection, and could take a long time
given a large output from range().
Per fuzz testing by Greg Stark. Back-patch to all supported branches.
Security: CVE-2016-0773
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe. For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process. When force_parallel_mode = regress, make additional
changes to allow the regression tests to run cleanly even though parallel
workers have been injected under the hood.
Taken together, this facilitates both better user testing and better
regression testing of the parallelism code.
Robert Haas, with help from Amit Kapila and Rushabh Lathia.
For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes. There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach. Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.
Robert Haas, with help from Amit Kapila. Thanks to Noah Misch and
Andres Freund for extensive discussion of possible issues with this
approach.
It seems that sprintf(), at least in glibc's version, is unreasonably slow
compared to hand-rolled code for printing integers. Replacing most uses of
sprintf() in the datetime.c output functions with special-purpose code
turns out to give more than a 2X speedup in COPY of a table with a single
timestamp column; which is pretty impressive considering all the other
logic in that code path.
David Rowley and Andres Freund, reviewed by Peter Geoghegan and myself
An example use-case is "CHECK(num_nonnulls(a,b,c) = 1)" to assert that
exactly one of a,b,c isn't NULL. The functions are variadic, so they
can also be pressed into service to count the number of null or nonnull
elements in an array.
Marko Tiikkaja, reviewed by Pavel Stehule
GetExistingLocalJoinPath() is useful for handling EvalPlanQual rechecks
properly, and GetUserMappingById() is needed to make sure you're using
the right credentials.
Shigeru Hanada, Etsuro Fujita, Ashutosh Bapat, Robert Haas
The previous RequestAddinLWLocks() method had several disadvantages.
First, the locks would be in the main tranche; we've recently decided
that it's useful for LWLocks used for separate purposes to have
separate tranche IDs. Second, there wasn't any correlation between
what code called RequestAddinLWLocks() and what code called
LWLockAssign(); when multiple modules are in use, it could become
quite difficult to troubleshoot problems where LWLockAssign() ran out
of locks. To fix, create a concept of named LWLock tranches which
can be used either by extension or by core code.
Amit Kapila and Robert Haas
Have varlena.c expose an interface that allows the char(n), bytea, and
bpchar types to piggyback on a now-generalized SortSupport for text.
This pushes a little more knowledge of the bpchar/char(n) type into
varlena.c than might be preferred, but that seems like the approach
that creates least friction. Also speed things up for index builds
that use text_pattern_ops or varchar_pattern_ops.
This patch does quite a bit of renaming, but it seems likely to be
worth it, so as to avoid future confusion about the fact that this code
is now more generally used than the old names might have suggested.
Peter Geoghegan, reviewed by Álvaro Herrera and Andreas Karlsson,
with small tweaks by me.
This patch doesn't put the new infrastructure to use anywhere, and
indeed it's not clear how it could ever be used for something like
postgres_fdw which has to send an SQL query and wait for a reply,
but there might be FDWs or custom scan providers that are CPU-bound,
so let's give them a way to join club parallel.
KaiGai Kohei, reviewed by me.
You can't really do anything useful with this in the form it currently
exists; among other problems, there's no way to reread whatever
information might be produced when the path is output. Work is
underway to replace this with a more useful and more general system of
extensible nodes, but let's start by getting rid of this bit.
Extracted from a larger patch by KaiGai Kohei.
Insert sd_notify() calls at server start and stop for integration with
systemd. This allows the use of systemd service units of type "notify",
which greatly simplifies the systemd configuration.
Reviewed-by: Pavel Stěhule <pavel.stehule@gmail.com>
Apparently at least one committer hasn't gotten the word that these do not
need to be maintained by hand, since initdb will create them automatically.
Noted while fixing bug #13905.
No catversion bump since the post-initdb state is exactly the same either
way. I don't see a need for back-patch, either.
All the other jsonb function descriptions refer to the arguments as being
"jsonb", but these two said "json". Make it consistent. Per bug #13905
from Petru Florin Mihancea.
No catversion bump --- we can't force one in the back branches, and this
isn't very critical anyway.
This is following in a long train of similar changes and for the same
reasons - see b319356f0e and
fe702a7b3f inter alia.
Author: Amit Kapila
Reviewed-by: Alexander Korotkov, Robert Haas
Previously, each PGPROC's backendLock was part of the main tranche,
and the PGPROC just contained a pointer. Now, the actual LWLock is
part of the PGPROC.
As with previous, similar patches, this makes it significantly easier
to identify these lwlocks in LWLOCK_STATS or Trace_lwlocks output
and improves modularity.
Author: Ildus Kurbangaliev
Reviewed-by: Amit Kapila, Robert Haas
Previously, the foreign join pushdown infrastructure left the question
of security entirely up to individual FDWs, but it would be easy for
a foreign data wrapper to inadvertently open up subtle security holes
that way. So, make it the core code's job to determine which user
mapping OID is relevant, and don't attempt join pushdown unless it's
the same for all relevant relations.
Per a suggestion from Tom Lane. Shigeru Hanada and Ashutosh Bapat,
reviewed by Etsuro Fujita and KaiGai Kohei, with some further
changes by me.
Previously, postgres_fdw's connection cache was keyed by user OID and
server OID, but this can lead to multiple connections when it's not
really necessary. In particular, if all relevant users are mapped to
the public user mapping, then their connection options are certainly
the same, so one connection can be used for all of them.
While we're cleaning things up here, drop the "server" argument to
GetConnection(), which isn't really needed. This saves a few cycles
because callers no longer have to look this up; the function itself
does, but only when establishing a new connection, not when reusing
an existing one.
Ashutosh Bapat, with a few small changes by me.
This function cleans up the pending list of the GIN index by
moving entries in it to the main GIN data structure in bulk.
It returns the number of pages cleaned up from the pending list.
This function is useful, for example, when the pending list
needs to be cleaned up *quickly* to improve the performance of
the search using GIN index. VACUUM can do the same thing, too,
but it may take days to run on a large table.
Jeff Janes,
reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me.
Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com
pg_size_pretty function should be marked immutable rather than volatile
because it always returns the same result given the same argument.
Pavel Stehule
The implementations go to some lengths to deliver exact results for values
where an exact result can be expected, such as sind(30) = 0.5 exactly.
Dean Rasheed, reviewed by Michael Paquier
Commit e529cd4ffa introduced an Assert requiring NAMEDATALEN to be
less than MAX_LEVENSHTEIN_STRLEN, which has been 255 for a long time.
Since up to that instant we had always allowed NAMEDATALEN to be
substantially more than that, this was ill-advised.
It's debatable whether we need MAX_LEVENSHTEIN_STRLEN at all (versus
putting a CHECK_FOR_INTERRUPTS into the loop), or whether it has to be
so tight; but this patch takes the narrower approach of just not applying
the MAX_LEVENSHTEIN_STRLEN limit to calls from the parser.
Trusting the parser for this seems reasonable, first because the strings
are limited to NAMEDATALEN which is unlikely to be hugely more than 256,
and second because the maximum distance is tightly constrained by
MAX_FUZZY_DISTANCE (though we'd forgotten to make use of that limit in one
place). That means the cost is not really O(mn) but more like O(max(m,n)).
Relaxing the limit for user-supplied calls is left for future research;
given the lack of complaints to date, it doesn't seem very high priority.
In passing, fix confusion between lengths-in-bytes and lengths-in-chars
in comments and error messages.
Per gripe from Kevin Day; solution suggested by Robert Haas. Back-patch
to 9.5 where the unwanted restriction was introduced.
The amvalidate functions added in commit 65c5fcd353 were on the
crude side. Improve them in a few ways:
* Perform signature checking for operators and support functions.
* Apply more thorough checks for missing operators and functions,
where possible.
* Instead of reporting problems as ERRORs, report most problems as INFO
messages and make the amvalidate function return FALSE. This allows
more than one problem to be discovered per run.
* Report object names rather than OIDs, and work a bit harder on making
the messages understandable.
Also, remove a few more opr_sanity regression test queries that are
now superseded by the amvalidate checks.
Putting a reference to an expanded-format value into a Const node would be
a bad idea for a couple of reasons. It'd be possible for the supposedly
immutable Const to change value, if something modified the referenced
variable ... in fact, if the Const's reference were R/W, any function that
has the Const as argument might itself change it at runtime. Also, because
datumIsEqual() is pretty simplistic, the Const might fail to compare equal
to other Consts that it should compare equal to, notably including copies
of itself. This could lead to unexpected planner behavior, such as "could
not find pathkey item to sort" errors or inferior plans.
I have not been able to find any way to get an expanded value into a Const
within the existing core code; but Paul Ramsey was able to trigger the
problem by writing a datatype input function that returns an expanded
value.
The best fix seems to be to establish a rule that varlena values being
placed into Const nodes should be passed through pg_detoast_datum().
That will do nothing (and cost little) in normal cases, but it will flatten
expanded values and thereby avoid the above problems. Also, it will
convert short-header or compressed values into canonical format, which will
avoid possible unexpected lack-of-equality issues for those cases too.
And it provides a last-ditch defense against putting a toasted value into
a Const, which we already knew was dangerous, cf commit 2b0c86b665.
(In the light of this discussion, I'm no longer sure that that commit
provided 100% protection against such cases, but this fix should do it.)
The test added in commit 65c3d05e18 to catch datatype input functions
with unstable results would fail for functions that returned expanded
values; but it seems a bit uncharitable to deem a result unstable just
because it's expressed in expanded form, so revise the coding so that we
check for bitwise equality only after applying pg_detoast_datum(). That's
a sufficient condition anyway given the new rule about detoasting when
forming a Const.
Back-patch to 9.5 where the expanded-object facility was added. It's
possible that this should go back further; but in the absence of clear
evidence that there's any live bug in older branches, I'll refrain for now.
2PC state info is written only to WAL at PREPARE, then read back from WAL at
COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr
checkpoint cycle will be written to disk in the same form as previously. Crash
recovery path is not altered. Measured performance gains of 50-100% for short
2PC transactions by completely avoiding writing files and fsyncing. Other
optimizations still available, further patches in related areas expected.
Stas Kelvich and heavily edited by Simon Riggs
Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas,
a concrete example of how Postgres-XC has fed back ideas into PostgreSQL.
Reviewed by Michael Paquier, Jeff Janes and Andres Freund
Performance testing by Jesper Pedersen
Previously we didn’t have a generic WAL page read callback function,
surprisingly. Logical decoding has logical_read_local_xlog_page(), which was
actually generic, so move that to xlogfunc.c and rename to
read_local_xlog_page().
Maintain logical_read_local_xlog_page() so existing callers still work.
As requested by Michael Paquier, Alvaro Herrera and Andres Freund
The core innovation of this patch is the introduction of the concept
of a partial path; that is, a path which if executed in parallel will
generate a subset of the output rows in each process. Gathering a
partial path produces an ordinary (complete) path. This allows us to
generate paths for parallel joins by joining a partial path for one
side (which at the baserel level is currently always a Partial Seq
Scan) to an ordinary path on the other side. This is subject to
various restrictions at present, especially that this strategy seems
unlikely to be sensible for merge joins, so only nested loops and
hash joins paths are generated.
This also allows an Append node to be pushed below a Gather node in
the case of a partitioned table.
Testing revealed that early versions of this patch made poor decisions
in some cases, which turned out to be caused by the fact that the
original cost model for Parallel Seq Scan wasn't very good. So this
patch tries to make some modest improvements in that area.
There is much more to be done in the area of generating good parallel
plans in all cases, but this seems like a useful step forward.
Patch by me, reviewed by Dilip Kumar and Amit Kapila.
Aggregate nodes now have two new modes: a "partial" mode where they
output the unfinalized transition state, and a "finalize" mode where
they accept unfinalized transition states rather than individual
values as input.
These new modes are not used anywhere yet, but they will be necessary
for parallel aggregation. The infrastructure also figures to be
useful for cases where we want to aggregate local data and remote
data via the FDW interface, and want to bring back partial aggregates
from the remote side that can then be combined with locally generated
partial aggregates to produce the final value. It may also be useful
even when neither FDWs nor parallelism are in play, as explained in
the comments in nodeAgg.c.
David Rowley and Simon Riggs, reviewed by KaiGai Kohei, Heikki
Linnakangas, Haribabu Kommi, and me.
GIN had some minor issues too, mostly using "internal" where something
else would be more appropriate. I went with the same approach as in
9ff60273e3, namely preferring the opclass' indexed datatype for
arguments that receive an operator RHS value, even if that's not
necessarily what they really are.
Again, this is with an eye to having a uniform rule for ginvalidate()
to check support function signatures.
The conventions specified by the GiST SGML documentation were widely
ignored. For example, the strategy-number argument for "consistent" and
"distance" functions is specified to be a smallint, but most of the
built-in support functions declared it as an integer, and for that matter
the core code passed it using Int32GetDatum not Int16GetDatum. None of
that makes any real difference at runtime, but it's quite confusing for
newcomers to the code, and it makes it very hard to write an amvalidate()
function that checks support function signatures. So let's try to instill
some consistency here.
Another similar issue is that the "query" argument is not of a single
well-defined type, but could have different types depending on the strategy
(corresponding to search operators with different righthand-side argument
types). Some of the functions threw up their hands and declared the query
argument as being of "internal" type, which surely isn't right ("any" would
have been more appropriate); but the majority position seemed to be to
declare it as being of the indexed data type, corresponding to a search
operator with both input types the same. So I've specified a convention
that that's what to do always.
Also, the result of the "union" support function actually must be of the
index's storage type, but the documentation suggested declaring it to
return "internal", and some of the functions followed that. Standardize
on telling the truth, instead.
Similarly, standardize on declaring the "same" function's inputs as
being of the storage type, not "internal".
Also, somebody had forgotten to add the "recheck" argument to both
the documentation of the "distance" support function and all of their
SQL declarations, even though the C code was happily using that argument.
Clean that up too.
Fix up some other omissions in the docs too, such as documenting that
union's second input argument is vestigial.
So far as the errors in core function declarations go, we can just fix
pg_proc.h and bump catversion. Adjusting the erroneous declarations in
contrib modules is more debatable: in principle any change in those
scripts should involve an extension version bump, which is a pain.
However, since these changes are purely cosmetic and make no functional
difference, I think we can get away without doing that.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
A scan for missed proisstrict markings in the core code turned up
these functions:
brin_summarize_new_values
pg_stat_reset_single_table_counters
pg_stat_reset_single_function_counters
pg_create_logical_replication_slot
pg_create_physical_replication_slot
pg_drop_replication_slot
The first three of these take OID, so a null argument will normally look
like a zero to them, resulting in "ERROR: could not open relation with OID
0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX
functions. The other three will dump core on a null argument, though this
is mitigated by the fact that they won't do so until after checking that
the caller is superuser or has rolreplication privilege.
In addition, the pg_logical_slot_get/peek[_binary]_changes family was
intentionally marked nonstrict, but failed to make nullness checks on all
the arguments; so again a null-pointer-dereference crash is possible but
only for superusers and rolreplication users.
Add the missing ARGISNULL checks to the latter functions, and mark the
former functions as strict in pg_proc. Make that change in the back
branches too, even though we can't force initdb there, just so that
installations initdb'd in future won't have the issue. Since none of these
bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX
functions hardly misbehave at all), it seems sufficient to do this.
In addition, fix some order-of-operations oddities in the slot_get_changes
family, mostly cosmetic, but not the part that moves the function's last
few operations into the PG_TRY block. As it stood, there was significant
risk for an error to exit without clearing historical information from
the system caches.
The slot_get_changes bugs go back to 9.4 where that code was introduced.
Back-patch appropriate subsets of the pg_proc changes into all active
branches, as well.
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.
This commit skips the “pin scan” that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin scan
is skipped only in replay; the VACUUM code path on master is not touched here.
The current commit still performs the pin scan for toast indexes, though this
can also be avoided if we recheck scans on toast indexes. Later patch will
address this.
No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
This new view provides insight into the state of a running WAL receiver
in a HOT standby node.
The information returned includes the PID of the WAL receiver process,
its status (stopped, starting, streaming, etc), start LSN and TLI, last
received LSN and TLI, timestamp of last message send and receipt, latest
end-of-WAL LSN and time, and the name of the slot (if any).
Access to the detailed data is only granted to superusers; others only
get the PID.
Author: Michael Paquier
Reviewer: Haribabu Kommi
pg_ctl is using isatty() to verify whether the process is running in a
terminal, and if not it sends its output to Windows' Event Log ... which
does the wrong thing when the output has been redirected to a pipe, as
reported in bug #13592.
To fix, make pg_ctl use the code we already have to detect service-ness:
in the master branch, move src/backend/port/win32/security.c to src/port
(with suitable tweaks so that it runs properly in backend and frontend
environments); pg_ctl already has access to pgport so it Just Works. In
older branches, that's likely to cause trouble, so instead duplicate the
required code in pg_ctl.c.
Author: Michael Paquier
Bug report and diagnosis: Egon Kocjan
Backpatch: all supported branches
Using cstring as the input type was a poor decision, because that's not
really a full-fledged type. In particular, it lacks implicit coercions
from text or varchar, meaning that usages like to_regproc('foo'||'bar')
wouldn't work; basically the only case that did work without explicit
casting was a simple literal constant argument.
The lack of field complaints about this suggests that hardly anyone
is using these functions, so hopefully fixing it won't cause much of
a compatibility problem. They've only been there since 9.4, anyway.
Petr Korobeinikov
While the in-core authentication mechanism doesn't need to access
pg_shseclabel at all, it's reasonable to think that an authentication
hook will want to look at the label for the role logging in, or for rows
in other catalogs used during the authentication phase of startup.
Catalog version bumped, because this changes the "is nailed" status for
pg_shseclabel.
Author: Adam Brightwell
Fix an oversight in commit 321eed5f0f: replacing an operator's
selectivity functions needs to result in a corresponding update in
pg_depend. We have a function that can handle that, but it was not
called by AlterOperator().
To fix this without enlarging pg_operator.h's #include list beyond
what clients can safely include, split off the function definitions
into a new file pg_operator_fn.h, similarly to what we've done for
some other catalog header files. It's not entirely clear whether
any client-side code needs to include pg_operator.h, but it seems
prudent to assume that there is some such code somewhere.
The variables newestCommitTs and oldestCommitTs sound as if they are
timestamps, but in fact they are the transaction Ids that correspond
to the newest and oldest timestamps rather than the actual timestamps.
Rename these variables to reflect that they are actually xids: to wit
newestCommitTsXid and oldestCommitTsXid respectively. Also modify
related code in a similar fashion, particularly the user facing output
emitted by pg_controldata and pg_resetxlog.
Complaint and patch by me, review by Tom Lane and Alvaro Herrera.
Backpatch to 9.5 where these variables were first introduced.
Omitted boundaries represent the upper or lower limit of the corresponding
array subscript. This allows simpler specification of many common
use-cases.
(Revised version of commit 9246af6799)
YUriy Zhuravlev
When use_remote_estimate is enabled, consider adding ORDER BY to the
query we sending to the remote server so that we can use that ordered
data for a merge join. Commit f18c944b61
arranges to push down the query pathkeys, which seems like the case
mostly likely to be a win, but testing shows this can sometimes win,
too.
For a regular table, we know which indexes are present and therefore
test whether the ordering provided by each such index is useful. Here,
we take the opposite approach: guess what orderings would be useful if
they could be generated cheaply, and then ask the remote side what those
will cost.
Ashutosh Bapat, with very substantial cosmetic revisions by me. Also
reviewed by Rushabh Lathia.
Previously, -j caused the entire input file to be read in and executed as
a single command string. That's undesirable, not least because any error
causes the entire file to be regurgitated as the "failing query". Some
experimentation suggests a better rule: end the command string when we see
a semicolon immediately followed by two newlines, ie, an empty line after
a query. This serves nicely to break up the existing examples such as
information_schema.sql and system_views.sql. A limitation is that it's
no longer possible to write such a sequence within a string literal or
multiline comment in a file meant to be read with -j; but there are no
instances of such a problem within the data currently used by initdb.
(If someone does make such a mistake in future, it'll be obvious because
they'll get an unterminated-literal or unterminated-comment syntax error.)
Other than that, there shouldn't be any negative consequences; you're not
forced to end statements that way, it's just a better idea in most cases.
In passing, remove src/include/tcop/tcopdebug.h, which is dead code
because it's not included anywhere, and hasn't been for more than
ten years. One of the debug-support symbols it purported to describe
has been unreferenced for at least the same amount of time, and the
other is removed by this commit on the grounds that it was useless:
forcing -j mode all the time would have broken initdb. The lack of
complaints about that, or about the missing inclusion, shows that
no one has tried to use TCOP_DONTUSENEWLINE in many years.
This is necessary so that REASSIGN OWNED does the right thing with
composite types, to wit, that it also alters ownership of the type's
pg_class entry -- previously, the pg_class entry remained owned by the
original user, which caused later other failures such as the new owner's
inability to use ALTER TYPE to rename an attribute of the affected
composite. Also, if the original owner is later dropped, the pg_class
entry becomes owned by a non-existant user which is bogus.
To fix, create a new routine AlterTypeOwner_oid which knows whether to
pass the request to ATExecChangeOwner or deal with it directly, and use
that in shdepReassignOwner rather than calling AlterTypeOwnerInternal
directly. AlterTypeOwnerInternal is now simpler in that it only
modifies the pg_type entry and recurses to handle a possible array type;
higher-level tasks are handled by either AlterTypeOwner directly or
AlterTypeOwner_oid.
I took the opportunity to add a few more objects to the test rig for
REASSIGN OWNED, so that more cases are exercised. Additional ones could
be added for superuser-only-ownable objects (such as FDWs and event
triggers) but I didn't want to push my luck by adding a new superuser to
the tests on a backpatchable bug fix.
Per bug #13666 reported by Chris Pacejo.
Backpatch to 9.5.
(I would back-patch this all the way back, except that it doesn't apply
cleanly in 9.4 and earlier because 59367fdf9 wasn't backpatched. If we
decide that we need this in earlier branches too, we should backpatch
both.)
It emerges that libreadline doesn't notice terminal window size change
events unless they occur while collecting input. This is easy to stumble
over if you resize the window while using a pager to look at query output,
but it can be demonstrated without any pager involvement. The symptom is
that queries exceeding one line are misdisplayed during subsequent input
cycles, because libreadline has the wrong idea of the screen dimensions.
The safest, simplest way to fix this is to call rl_reset_screen_size()
just before calling readline(). That causes an extra ioctl(TIOCGWINSZ)
for every command; but since it only happens when reading from a tty, the
performance impact should be negligible. A more valid objection is that
this still leaves a tiny window during entry to readline() wherein delivery
of SIGWINCH will be missed; but the practical consequences of that are
probably negligible. In any case, there doesn't seem to be any good way to
avoid the race, since readline exposes no functions that seem safe to call
from a generic signal handler --- rl_reset_screen_size() certainly isn't.
It turns out that we also need an explicit rl_initialize() call, else
rl_reset_screen_size() dumps core when called before the first readline()
call.
rl_reset_screen_size() is not present in old versions of libreadline,
so we need a configure test for that. (rl_initialize() is present at
least back to readline 4.0, so we won't bother with a test for it.)
We would need a configure test anyway since libedit's emulation of
libreadline doesn't currently include such a function. Fortunately,
libedit seems not to have any corresponding bug.
Merlin Moncure, adjusted a bit by me
Move the content lock directly into the BufferDesc, so that locking and
pinning a buffer touches only one cache line rather than two. Adjust
the definition of BufferDesc slightly so that this doesn't make the
BufferDesc any larger than one cache line (at least on platforms where
a spinlock is only 1 or 2 bytes).
We can't fit the I/O locks into the BufferDesc and stay within one
cache line, so move those to a completely separate tranche. This
leaves a relatively limited number of LWLocks in the main tranche, so
increase the padding of those remaining locks to a full cache line,
rather than allowing adjacent locks to share a cache line, hopefully
reducing false sharing.
Performance testing shows that these changes make little difference
on laptop-class machines, but help significantly on larger servers,
especially those with more than 2 sockets.
Andres Freund, originally based on an earlier patch by Simon Riggs.
Review and cosmetic adjustments (including heavy rewriting of the
comments) by me.
It's a bit cumbersome to use LWLockNewTrancheId(), because the returned
value needs to be shared between backends so that each backend can call
LWLockRegisterTranche() with the correct ID. So, for built-in tranches,
use a hard-coded value instead.
This is motivated by an upcoming patch adding further built-in tranches.
Andres Freund and Robert Haas