A scan for missed proisstrict markings in the core code turned up
these functions:
brin_summarize_new_values
pg_stat_reset_single_table_counters
pg_stat_reset_single_function_counters
pg_create_logical_replication_slot
pg_create_physical_replication_slot
pg_drop_replication_slot
The first three of these take OID, so a null argument will normally look
like a zero to them, resulting in "ERROR: could not open relation with OID
0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX
functions. The other three will dump core on a null argument, though this
is mitigated by the fact that they won't do so until after checking that
the caller is superuser or has rolreplication privilege.
In addition, the pg_logical_slot_get/peek[_binary]_changes family was
intentionally marked nonstrict, but failed to make nullness checks on all
the arguments; so again a null-pointer-dereference crash is possible but
only for superusers and rolreplication users.
Add the missing ARGISNULL checks to the latter functions, and mark the
former functions as strict in pg_proc. Make that change in the back
branches too, even though we can't force initdb there, just so that
installations initdb'd in future won't have the issue. Since none of these
bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX
functions hardly misbehave at all), it seems sufficient to do this.
In addition, fix some order-of-operations oddities in the slot_get_changes
family, mostly cosmetic, but not the part that moves the function's last
few operations into the PG_TRY block. As it stood, there was significant
risk for an error to exit without clearing historical information from
the system caches.
The slot_get_changes bugs go back to 9.4 where that code was introduced.
Back-patch appropriate subsets of the pg_proc changes into all active
branches, as well.
This is necessary so that REASSIGN OWNED does the right thing with
composite types, to wit, that it also alters ownership of the type's
pg_class entry -- previously, the pg_class entry remained owned by the
original user, which caused later other failures such as the new owner's
inability to use ALTER TYPE to rename an attribute of the affected
composite. Also, if the original owner is later dropped, the pg_class
entry becomes owned by a non-existant user which is bogus.
To fix, create a new routine AlterTypeOwner_oid which knows whether to
pass the request to ATExecChangeOwner or deal with it directly, and use
that in shdepReassignOwner rather than calling AlterTypeOwnerInternal
directly. AlterTypeOwnerInternal is now simpler in that it only
modifies the pg_type entry and recurses to handle a possible array type;
higher-level tasks are handled by either AlterTypeOwner directly or
AlterTypeOwner_oid.
I took the opportunity to add a few more objects to the test rig for
REASSIGN OWNED, so that more cases are exercised. Additional ones could
be added for superuser-only-ownable objects (such as FDWs and event
triggers) but I didn't want to push my luck by adding a new superuser to
the tests on a backpatchable bug fix.
Per bug #13666 reported by Chris Pacejo.
This is a backpatch of commit 756e7b4c9d to branches 9.1 -- 9.4.
It emerges that libreadline doesn't notice terminal window size change
events unless they occur while collecting input. This is easy to stumble
over if you resize the window while using a pager to look at query output,
but it can be demonstrated without any pager involvement. The symptom is
that queries exceeding one line are misdisplayed during subsequent input
cycles, because libreadline has the wrong idea of the screen dimensions.
The safest, simplest way to fix this is to call rl_reset_screen_size()
just before calling readline(). That causes an extra ioctl(TIOCGWINSZ)
for every command; but since it only happens when reading from a tty, the
performance impact should be negligible. A more valid objection is that
this still leaves a tiny window during entry to readline() wherein delivery
of SIGWINCH will be missed; but the practical consequences of that are
probably negligible. In any case, there doesn't seem to be any good way to
avoid the race, since readline exposes no functions that seem safe to call
from a generic signal handler --- rl_reset_screen_size() certainly isn't.
It turns out that we also need an explicit rl_initialize() call, else
rl_reset_screen_size() dumps core when called before the first readline()
call.
rl_reset_screen_size() is not present in old versions of libreadline,
so we need a configure test for that. (rl_initialize() is present at
least back to readline 4.0, so we won't bother with a test for it.)
We would need a configure test anyway since libedit's emulation of
libreadline doesn't currently include such a function. Fortunately,
libedit seems not to have any corresponding bug.
Merlin Moncure, adjusted a bit by me
More fuzz testing by Andreas Seltenreich exposed that the planner did not
cope well with chains of lateral references. If relation X references Y
laterally, and Y references Z laterally, then we will have to scan X on the
inside of a nestloop with Z, so for all intents and purposes X is laterally
dependent on Z too. The planner did not understand this and would generate
intermediate joins that could not be used. While that was usually harmless
except for wasting some planning cycles, under the right circumstances it
would lead to "failed to build any N-way joins" or "could not devise a
query plan" planner failures.
To fix that, convert the existing per-relation lateral_relids and
lateral_referencers relid sets into their transitive closures; that is,
they now show all relations on which a rel is directly or indirectly
laterally dependent. This not only fixes the chained-reference problem
but allows some of the relevant tests to be made substantially simpler
and faster, since they can be reduced to simple bitmap manipulations
instead of searches of the LateralJoinInfo list.
Also, when a PlaceHolderVar that is due to be evaluated at a join contains
lateral references, we should treat those references as indirect lateral
dependencies of each of the join's base relations. This prevents us from
trying to join any individual base relations to the lateral reference
source before the join is formed, which again cannot work.
Andreas' testing also exposed another oversight in the "dangerous
PlaceHolderVar" test added in commit 85e5e222b1. Simply rejecting
unsafe join paths in joinpath.c is insufficient, because in some cases
we will end up rejecting *all* possible paths for a particular join, again
leading to "could not devise a query plan" failures. The restriction has
to be known also to join_is_legal and its cohort functions, so that they
will not select a join for which that will happen. I chose to move the
supporting logic into joinrels.c where the latter functions are.
Back-patch to 9.3 where LATERAL support was introduced.
At the end of crash recovery, unlogged relations are reset to the empty
state, using their init fork as the template. The init fork is copied to
the main fork without going through shared buffers. Unfortunately WAL
replay so far has not necessarily flushed writes from shared buffers to
disk at that point. In normal crash recovery, and before the
introduction of 'fast promotions' in fd4ced523 / 9.3, the
END_OF_RECOVERY checkpoint flushes the buffers out in time. But with
fast promotions that's not the case anymore.
To fix, force WAL writes targeting the init fork to be flushed
immediately (using the new FlushOneBuffer() function). In 9.5+ that
flush can centrally be triggered from the code dealing with restoring
full page writes (XLogReadBufferForRedoExtended), in earlier releases
that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay
function.
Backpatch to 9.1, even if this currently is only known to trigger in
9.3+. Flushing earlier is more robust, and it is advantageous to keep
the branches similar.
Typical symptoms of this bug are errors like
'ERROR: index "..." contains unexpected zero page at block 0'
shortly after promoting a node.
Reported-By: Thom Brown
Author: Andres Freund and Michael Paquier
Discussion: 20150326175024.GJ451@alap3.anarazel.de
Backpatch: 9.1-
While convincing myself that commit 7e19db0c09 would solve both of
the problems recently reported by Andreas Seltenreich, I realized that
add_paths_to_joinrel's handling of LATERAL restrictions could be made
noticeably simpler and faster if we were to retain the minimum possible
parameterization for each joinrel (that is, the set of relids supplying
unsatisfied lateral references in it). We already retain that for
baserels, in RelOptInfo.lateral_relids, so we can use that field for
joinrels too.
This is a back-port of commit edca44b152.
I originally intended not to back-patch that, but additional hacking
in this area turns out to be needed, making it necessary not optional
to compute lateral_relids for joinrels. In preparation for those fixes,
sync the relevant code with HEAD as much as practical. (I did not risk
rearranging fields of RelOptInfo in released branches, however.)
It was possible for the planner to decide to join a LATERAL subquery to
the outer side of an outer join before the outer join itself is completed.
Normally that's fine because of the associativity rules, but it doesn't
work if the subquery contains a lateral reference to the inner side of the
outer join. In such a situation the outer join *must* be done first.
join_is_legal() missed this consideration and would allow the join to be
attempted, but the actual path-building code correctly decided that no
valid join path could be made, sometimes leading to planner errors such as
"failed to build any N-way joins".
Per report from Andreas Seltenreich. Back-patch to 9.3 where LATERAL
support was added.
The POSIX standard for tar headers requires archive member sizes to be
printed in octal with at most 11 digits, limiting the representable file
size to 8GB. However, GNU tar and apparently most other modern tars
support a convention in which oversized values can be stored in base-256,
allowing any practical file to be a tar member. Adopt this convention
to remove two limitations:
* pg_dump with -Ft output format failed if the contents of any one table
exceeded 8GB.
* pg_basebackup failed if the data directory contained any file exceeding
8GB. (This would be a fatal problem for installations configured with a
table segment size of 8GB or more, and it has also been seen to fail when
large core dump files exist in the data directory.)
File sizes under 8GB are still printed in octal, so that no compatibility
issues are created except in cases that would have failed entirely before.
In addition, this patch fixes several bugs in the same area:
* In 9.3 and later, we'd defined tarCreateHeader's file-size argument as
size_t, which meant that on 32-bit machines it would write a corrupt tar
header for file sizes between 4GB and 8GB, even though no error was raised.
This broke both "pg_dump -Ft" and pg_basebackup for such cases.
* pg_restore from a tar archive would fail on tables of size between 4GB
and 8GB, on machines where either "size_t" or "unsigned long" is 32 bits.
This happened even with an archive file not affected by the previous bug.
* pg_basebackup would fail if there were files of size between 4GB and 8GB,
even on 64-bit machines.
* In 9.3 and later, "pg_basebackup -Ft" failed entirely, for any file size,
on 64-bit big-endian machines.
In view of these potential data-loss bugs, back-patch to all supported
branches, even though removal of the documented 8GB limit might otherwise
be considered a new feature rather than a bug fix.
Commit bda76c1c8c caused both plus and
minus infinity to be rendered as "infinity", which is not only wrong
but inconsistent with the pre-9.4 behavior of to_json(). Fix that by
duplicating the coding in date_out/timestamp_out/timestamptz_out more
closely. Per bug #13687 from Stepan Perlov. Back-patch to 9.4, like
the previous commit.
In passing, also re-pgindent json.c, since it had gotten a bit messed up by
recent patches (and I was already annoyed by indentation-related problems
in back-patching this fix ...)
This code previously counted the number of NFA states it created, and
complained if a limit was exceeded, so as to prevent bizarre regex patterns
from consuming unreasonable time or memory. That's fine as far as it went,
but the code paid no attention to how many arcs linked those states. Since
regexes can be contrived that have O(N) states but will need O(N^2) arcs
after fixempties() processing, it was still possible to blow out memory,
and take a long time doing it too. To fix, modify the bookkeeping to count
space used by both states and arcs.
I did not bother with including the "color map" in the accounting; it
can only grow to a few megabytes, which is not a lot in comparison to
what we're allowing for states+arcs (about 150MB on 64-bit machines
or half that on 32-bit machines).
Looking at some of the larger real-world regexes captured in the Tcl
regression test suite suggests that the most that is likely to be needed
for regexes found in the wild is under 10MB, so I believe that the current
limit has enough headroom to make it okay to keep it as a hard-wired limit.
In connection with this, redefine REG_ETOOBIG as meaning "regular
expression is too complex"; the previous wording of "nfa has too many
states" was already somewhat inapropos because of the error code's use
for stack depth overrun, and it was not very user-friendly either.
Back-patch to all supported branches.
Change the singly-linked in-arc and out-arc lists to be doubly-linked,
so that arc deletion is constant time rather than having worst-case time
proportional to the number of other arcs on the connected states.
Modify the bulk arc transfer operations copyins(), copyouts(), moveins(),
moveouts() so that they use a sort-and-merge algorithm whenever there's
more than a small number of arcs to be copied or moved. The previous
method is O(N^2) in the number of arcs involved, because it performs
duplicate checking independently for each copied arc. The new method may
change the ordering of existing arcs for the destination state, but nothing
really cares about that.
Provide another bulk arc copying method mergeins(), which is unused as
of this commit but is needed for the next one. It basically is like
copyins(), but the source arcs might not all come from the same state.
Replace the O(N^2) bubble-sort algorithm used in carcsort() with a qsort()
call.
These changes greatly improve the performance of regex compilation for
large or complex regexes, at the cost of extra space for arc storage during
compilation. The original tradeoff was probably fine when it was made, but
now we care more about speed and less about memory consumption.
Back-patch to all supported branches.
Postmaster child processes that aren't supposed to be attached to shared
memory were not bothering to close the shared memory mapping handle they
inherit from the postmaster process. That's mostly harmless, since the
handle vanishes anyway when the child process exits -- but the syslogger
process, if used, doesn't get killed and restarted during recovery from a
backend crash. That meant that Windows doesn't see the shared memory
mapping as becoming free, so it doesn't delete it and the postmaster is
unable to create a new one, resulting in failure to recover from crashes
whenever logging_collector is turned on.
Per report from Dmitry Vasilyev. It's a bit astonishing that we'd not
figured this out long ago, since it's been broken from the very beginnings
of out native Windows support; probably some previously-unexplained trouble
reports trace to this.
A secondary problem is that on Cygwin (perhaps only in older versions?),
exec() may not detach from the shared memory segment after all, in which
case these child processes did remain attached to shared memory, posing
the risk of an unexpected shared memory clobber if they went off the rails
somehow. That may be a long-gone bug, but we can deal with it now if it's
still live, by detaching within the infrastructure introduced here to deal
with closing the handle.
Back-patch to all supported branches.
Tom Lane and Amit Kapila
The postmaster now checks every minute or so (worst case, at most two
minutes) that postmaster.pid is still there and still contains its own PID.
If not, it performs an immediate shutdown, as though it had received
SIGQUIT.
The original goal behind this change was to ensure that failed buildfarm
runs would get fully cleaned up, even if the test scripts had left a
postmaster running, which is not an infrequent occurrence. When the
buildfarm script removes a test postmaster's $PGDATA directory, its next
check on postmaster.pid will fail and cause it to exit. Previously, manual
intervention was often needed to get rid of such orphaned postmasters,
since they'd block new test postmasters from obtaining the expected socket
address.
However, by checking postmaster.pid and not something else, we can provide
additional robustness: manual removal of postmaster.pid is a frequent DBA
mistake, and now we can at least limit the damage that will ensue if a new
postmaster is started while the old one is still alive.
Back-patch to all supported branches, since we won't get the desired
improvement in buildfarm reliability otherwise.
Some of the functions in regex compilation and execution recurse, and
therefore could in principle be driven to stack overflow. The Tcl crew
has seen this happen in practice in duptraverse(), though their fix was
to put in a hard-wired limit on the number of recursive levels, which is
not too appetizing --- fortunately, we have enough infrastructure to check
the actually available stack. Greg Stark has also seen it in other places
while fuzz testing on a machine with limited stack space. Let's put guards
in to prevent crashes in all these places.
Since the regex code would leak memory if we simply threw elog(ERROR),
we have to introduce an API that checks for stack depth without throwing
such an error. Fortunately that's not difficult.
After an internal failure in shortest() or longest() while pinning down the
exact location of a match, find() forgot to free the DFA structure before
returning. This is pretty unlikely to occur, since we just successfully
ran the "search" variant of the DFA; but it could happen, and it would
result in a session-lifespan memory leak since this code uses malloc()
directly. Problem seems to have been aboriginal in Spencer's library,
so back-patch all the way.
In passing, correct a thinko in a comment I added awhile back about the
meaning of the "ntree" field.
I happened across these issues while comparing our code to Tcl's version
of the library.
This commit makes postmaster forcibly remove the files signaling
a standby promotion request. Otherwise, the existence of those files
can trigger a promotion too early, whether a user wants that or not.
This removal of files is usually unnecessary because they can exist
only during a few moments during a standby promotion. However
there is a race condition: if pg_ctl promote is executed and creates
the files during a promotion, the files can stay around even after
the server is brought up to new master. Then, if new standby starts
by using the backup taken from that master, the files can exist
at the server startup and should be removed in order to avoid
an unexpected promotion.
Back-patch to 9.1 where promote signal file was introduced.
Problem reported by Feike Steenbergen.
Original patch by Michael Paquier, modified by me.
Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org
Formerly, we treated only portals created in the current subtransaction as
having failed during subtransaction abort. However, if the error occurred
while running a portal created in an outer subtransaction (ie, a cursor
declared before the last savepoint), that has to be considered broken too.
To allow reliable detection of which ones those are, add a bookkeeping
field to struct Portal that tracks the innermost subtransaction in which
each portal has actually been executed. (Without this, we'd end up
failing portals containing functions that had called the subtransaction,
thereby breaking plpgsql exception blocks completely.)
In addition, when we fail an outer-subtransaction Portal, transfer its
resources into the subtransaction's resource owner, so that they're
released early in cleanup of the subxact. This fixes a problem reported by
Jim Nasby in which a function executed in an outer-subtransaction cursor
could cause an Assert failure or crash by referencing a relation created
within the inner subtransaction.
The proximate cause of the Assert failure is that AtEOSubXact_RelationCache
assumed it could blow away a relcache entry without first checking that the
entry had zero refcount. That was a bad idea on its own terms, so add such
a check there, and to the similar coding in AtEOXact_RelationCache. This
provides an independent safety measure in case there are still ways to
provoke the situation despite the Portal-level changes.
This has been broken since subtransactions were invented, so back-patch
to all supported branches.
Tom Lane and Michael Paquier
It's only necessary to fix this in 9.4; later versions don't have this
code (because we ripped out Alpha support entirely), while earlier ones
aren't actually using pg_read_barrier() anywhere.
Per rather belated report from Christoph Berg.
On recent AIX it's necessary to configure gcc to use the native assembler
(because the GNU assembler hasn't been updated to handle AIX 6+). This
caused PG builds to fail with assembler syntax errors, because we'd try
to compile s_lock.h's gcc asm fragment for PPC, and that assembly code
relied on GNU-style local labels. We can't substitute normal labels
because it would fail in any file containing more than one inlined use of
tas(). Fortunately, that code is stable enough, and the PPC ISA is simple
enough, that it doesn't seem like too much of a maintenance burden to just
hand-code the branch offsets, removing the need for any labels.
Note that the AIX assembler only accepts "$" for the location counter
pseudo-symbol. The usual GNU convention is "."; but it appears that all
versions of gas for PPC also accept "$", so in theory this patch will not
break any other PPC platforms.
This has been reported by a few people, but Steve Underwood gets the credit
for being the first to pursue the problem far enough to understand why it
was failing. Thanks also to Noah Misch for additional testing.
Formerly, this function would always return "true" for an appendrel child
relation, because it would think that the appendrel parent was a potential
join target for the child. In principle that should only lead to some
inefficiency in planning, but fuzz testing by Andreas Seltenreich disclosed
that it could lead to "could not find pathkey item to sort" planner errors
in odd corner cases. Specifically, we would think that all columns of a
child table's multicolumn index were interesting pathkeys, causing us to
generate a MergeAppend path that sorts by all the columns. However, if any
of those columns weren't actually used above the level of the appendrel,
they would not get added to that rel's targetlist, which would result in
being unable to resolve the MergeAppend's sort keys against its targetlist
during createplan.c.
Backpatch to 9.3. In older versions, columns of an appendrel get added
to its targetlist even if they're not mentioned above the scan level,
so that the failure doesn't occur. It might be worth back-patching this
fix to older versions anyway, but I'll refrain for the moment.
The tuplesort/tuplestore memory management logic assumed that the chunk
allocation overhead for its memtuples array could not increase when
increasing the array size. This is and always was true for tuplesort,
but we (I, I think) blindly copied that logic into tuplestore.c without
noticing that the assumption failed to hold for the much smaller array
elements used by tuplestore. Given rather small work_mem, this could
result in an improper complaint about "unexpected out-of-memory situation",
as reported by Brent DeSpain in bug #13530.
The easiest way to fix this is just to increase tuplestore's initial
array size so that the assumption holds. Rather than relying on magic
constants, though, let's export a #define from aset.c that represents
the safe allocation threshold, and make tuplestore's calculation depend
on that.
Do the same in tuplesort.c to keep the logic looking parallel, even though
tuplesort.c isn't actually at risk at present. This will keep us from
breaking it if we ever muck with the allocation parameters in aset.c.
Back-patch to all supported versions. The error message doesn't occur
pre-9.3, not so much because the problem can't happen as because the
pre-9.3 tuplestore code neglected to check for it. (The chance of
trouble is a great deal larger as of 9.3, though, due to changes in the
array-size-increasing strategy.) However, allowing LACKMEM() to become
true unexpectedly could still result in less-than-desirable behavior,
so let's patch it all the way back.
Commit c9b0cbe98b accidentally broke the
order of operations during postmaster shutdown: it resulted in removing
the per-socket lockfiles after, not before, postmaster.pid. This creates
a race-condition hazard for a new postmaster that's started immediately
after observing that postmaster.pid has disappeared; if it sees the
socket lockfile still present, it will quite properly refuse to start.
This error appears to be the explanation for at least some of the
intermittent buildfarm failures we've seen in the pg_upgrade test.
Another problem, which has been there all along, is that the postmaster
has never bothered to close() its listen sockets, but has just allowed them
to close at process death. This creates a different race condition for an
incoming postmaster: it might be unable to bind to the desired listen
address because the old postmaster is still incumbent. This might explain
some odd failures we've seen in the past, too. (Note: this is not related
to the fact that individual backends don't close their client communication
sockets. That behavior is intentional and is not changed by this patch.)
Fix by adding an on_proc_exit function that closes the postmaster's ports
explicitly, and (in 9.3 and up) reshuffling the responsibility for where
to unlink the Unix socket files. Lock file unlinking can stay where it
is, but teach it to unlink the lock files in reverse order of creation.
SP-GiST initialized an all-zeros page at vacuum, but that was not
WAL-logged, which is not safe. You might get a torn page write, when it gets
flushed to disk, and end-up with a half-initialized index page. To fix,
leave it in the all-zeros state, and add it to the FSM. It will be
initialized when reused. Also don't set the page-deleted flag when recycling
an empty page. That was also not WAL-logged, and a torn write of that would
cause the page to have an invalid checksum.
Backpatch to 9.2, where SP-GiST indexes were added.
In 9.4, fix a 9.4.1 regression that allowed multiple entries for a
PGC_POSTMASTER variable to cause bogus complaints in the postmaster log.
(The issue here was that commit bf007a27ac unintentionally reverted
3e3f65973a, which suppressed any duplicate entries within
ParseConfigFp. Back-patch the reimplementation just made in HEAD, which
makes use of an "ignore" field to prevent application of superseded items.)
Add missed failure check in AlterSystemSetConfigFile(). We don't really
expect ParseConfigFp() to fail, but that's not an excuse for not checking.
In both 9.3 and 9.4, remove mistaken assignment to ConfigFileLineno that
caused line counting after an include_dir directive to be completely wrong.
Commit b89e151054 added the
ResolveCminCmaxDuringDecoding declaration to tqual.h, which uses an
HTAB parameter, without declaring HTAB. It accidentally fails to
fail to build with current sources because a declaration happens to
be included, directly or indirectly, in all source files that
currently use tqual.h before tqual.h is first included, but we
shouldn't count on that. Since an opaque declaration is enough
here, just use that, as was done in snapmgr.h.
Backpatch to 9.4, where the HTAB reference was added to tqual.h.
Commit f3b5565dd4 was a couple of bricks shy
of a load; specifically, it missed putting pg_trigger_tgrelid_tgname_index
into the relcache init file, because that index is not used by any
syscache. However, we have historically nailed that index into cache for
performance reasons. The upshot was that load_relcache_init_file always
decided that the init file was busted and silently ignored it, resulting
in a significant hit to backend startup speed.
To fix, reinstantiate RelationIdIsInInitFile() as a wrapper around
RelationSupportsSysCache(), which can know about additional relations
that should be in the init file despite being unknown to syscache.c.
Also install some guards against future mistakes of this type: make
write_relcache_init_file Assert that all nailed relations get written to
the init file, and make load_relcache_init_file emit a WARNING if it takes
the "wrong number of nailed relations" exit path. Now that we remove the
init files during postmaster startup, that case should never occur in the
field, even if we are starting a minor-version update that added or removed
rels from the nailed set. So the warning shouldn't ever be seen by end
users, but it will show up in the regression tests if somebody breaks this
logic.
Back-patch to all supported branches, like the previous commit.
When we invalidate the relcache entry for a system catalog or index, we
must also delete the relcache "init file" if the init file contains a copy
of that rel's entry. The old way of doing this relied on a specially
maintained list of the OIDs of relations present in the init file: we made
the list either when reading the file in, or when writing the file out.
The problem is that when writing the file out, we included only rels
present in our local relcache, which might have already suffered some
deletions due to relcache inval events. In such cases we correctly decided
not to overwrite the real init file with incomplete data --- but we still
used the incomplete initFileRelationIds list for the rest of the current
session. This could result in wrong decisions about whether the session's
own actions require deletion of the init file, potentially allowing an init
file created by some other concurrent session to be left around even though
it's been made stale.
Since we don't support changing the schema of a system catalog at runtime,
the only likely scenario in which this would cause a problem in the field
involves a "vacuum full" on a catalog concurrently with other activity, and
even then it's far from easy to provoke. Remarkably, this has been broken
since 2002 (in commit 7863404417), but we had
never seen a reproducible test case until recently. If it did happen in
the field, the symptoms would probably involve unexpected "cache lookup
failed" errors to begin with, then "could not open file" failures after the
next checkpoint, as all accesses to the affected catalog stopped working.
Recovery would require manually removing the stale "pg_internal.init" file.
To fix, get rid of the initFileRelationIds list, and instead consult
syscache.c's list of relations used in catalog caches to decide whether a
relation is included in the init file. This should be a tad more efficient
anyway, since we're replacing linear search of a list with ~100 entries
with a binary search. It's a bit ugly that the init file contents are now
so directly tied to the catalog caches, but in practice that won't make
much difference.
Back-patch to all supported branches.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
Commit 2ce439f337 introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
pushJsonbValue was accepting jbvBinary objects passed as WJB_ELEM or
WJB_VALUE data. While this succeeded, when those objects were later
encountered in attempting to convert the result to Jsonb, errors
occurred. With this change we ghuarantee that a JSonbValue constructed
from calls to pushJsonbValue does not contain any jbvBinary objects.
This cures a problem observed with jsonb_delete.
This means callers of pushJsonbValue no longer need to perform this
unpacking themselves. A subsequent patch will perform some cleanup in
that area.
The error was not triggered by any 9.4 code, but this is a publicly
visible routine, and so the error could be exercised by third party
code, therefore backpatch to 9.4.
Bug report from Peter Geoghegan, fix by me.
This reverts commit 16304a0134, except
for its changes in src/port/snprintf.c; as well as commit
cac18a76bb which is no longer needed.
Fujii Masao reported that the previous commit caused failures in psql on
OS X, since if one exits the pager program early while viewing a query
result, psql sees an EPIPE error from fprintf --- and the wrapper function
thought that was reason to panic. (It's a bit surprising that the same
does not happen on Linux.) Further discussion among the security list
concluded that the risk of other such failures was far too great, and
that the one-size-fits-all approach to error handling embodied in the
previous patch is unlikely to be workable.
This leaves us again exposed to the possibility of the type of failure
envisioned in CVE-2015-3166. However, that failure mode is strictly
hypothetical at this point: there is no concrete reason to believe that
an attacker could trigger information disclosure through the supposed
mechanism. In the first place, the attack surface is fairly limited,
since so much of what the backend does with format strings goes through
stringinfo.c or psprintf(), and those already had adequate defenses.
In the second place, even granting that an unprivileged attacker could
control the occurrence of ENOMEM with some precision, it's a stretch to
believe that he could induce it just where the target buffer contains some
valuable information. So we concluded that the risk of non-hypothetical
problems induced by the patch greatly outweighs the security risks.
We will therefore revert, and instead undertake closer analysis to
identify specific calls that may need hardening, rather than attempt a
universal solution.
We have kept the portion of the previous patch that improved snprintf.c's
handling of errors when it calls the platform's sprintf(). That seems to
be an unalloyed improvement.
Security: CVE-2015-3166
All known standard library implementations of these functions can fail
with ENOMEM. A caller neglecting to check for failure would experience
missing output, information exposure, or a crash. Check return values
within wrappers and code, currently just snprintf.c, that bypasses the
wrappers. The wrappers do not return after an error, so their callers
need not check. Back-patch to 9.0 (all supported versions).
Popular free software standard library implementations do take pains to
bypass malloc() in simple cases, but they risk ENOMEM for floating point
numbers, positional arguments, large field widths, and large precisions.
No specification demands such caution, so this commit regards every call
to a printf family function as a potential threat.
Injecting the wrappers implicitly is a compromise between patch scope
and design goals. I would prefer to edit each call site to name a
wrapper explicitly. libpq and the ECPG libraries would, ideally, convey
errors to the caller rather than abort(). All that would be painfully
invasive for a back-patched security fix, hence this compromise.
Security: CVE-2015-3166
The logic introduced in commit b69bf30b9b
and repaired in commits 669c7d20e6 and
7be47c56af helps to ensure that we don't
overwrite old multixact member information while it is still needed,
but a user who creates many large multixacts can still exhaust the
member space (and thus start getting errors) while autovacuum stands
idly by.
To fix this, progressively ramp down the effective value (but not the
actual contents) of autovacuum_multixact_freeze_max_age as member space
utilization increases. This makes autovacuum more aggressive and also
reduces the threshold for a manual VACUUM to perform a full-table scan.
This patch leaves unsolved the problem of ensuring that emergency
autovacuums are triggered even when autovacuum=off. We'll need to fix
that via a separate patch.
Thomas Munro and Robert Haas
Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk. This could lead to data corruption.
Back-patch to all supported versions.
Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.
After a timeline switch, we would leave behind recycled WAL segments that
are in the future, but on the old timeline. After promotion, and after they
become old enough to be recycled again, we would notice that they don't have
a .ready or .done file, create a .ready file for them, and archive them.
That's bogus, because the files contain garbage, recycled from an older
timeline (or prealloced as zeros). We shouldn't archive such files.
This could happen when we're following a timeline switch during replay, or
when we switch to new timeline at end-of-recovery.
To fix, whenever we switch to a new timeline, scan the data directory for
WAL segments on the old timeline, but with a higher segment number, and
remove them. Those don't belong to our timeline history, and are most
likely bogus recycled or preallocated files. They could also be valid files
that we streamed from the primary ahead of time, but in any case, they're
not needed to recover to the new timeline.
Since commit ba7c5975ad, port/dirmod.c
has contained only Windows-specific functions. Most platforms don't
seem to mind uselessly building an empty file, but OS X for one issues
warnings. Hence, treat dirmod.c as a Windows-specific file selected
by configure rather than one that's always built. We can revert this
change if dirmod.c ever gains any non-Windows functionality again.
Back-patch to 9.4 where the mentioned commit appeared.
In investigating yesterday's crash report from Hugo Osvaldo Barrera, I only
looked back as far as commit f3aec2c7f5 where the breakage occurred
(which is why I thought the IPv4-in-IPv6 business was undocumented). But
actually the logic dates back to commit 3c9bb8886d and was simply
broken by erroneous refactoring in the later commit. A bit of archives
excavation shows that we added the whole business in response to a report
that some 2003-era Linux kernels would report IPv4 connections as having
IPv4-in-IPv6 addresses. The fact that we've had no complaints since 9.0
seems to be sufficient confirmation that no modern kernels do that, so
let's just rip it all out rather than trying to fix it.
Do this in the back branches too, thus essentially deciding that our
effective behavior since 9.0 is correct. If there are any platforms on
which the kernel reports IPv4-in-IPv6 addresses as such, yesterday's fix
would have made for a subtle and potentially security-sensitive change in
the effective meaning of IPv4 pg_hba.conf entries, which does not seem like
a good thing to do in minor releases. So let's let the post-9.0 behavior
stand, and change the documentation to match it.
In passing, I failed to resist the temptation to wordsmith the description
of pg_hba.conf IPv4 and IPv6 address entries a bit. A lot of this text
hasn't been touched since we were IPv4-only.
Fix some issues I noticed while fooling with an extension to allow an
additional kind of toast pointer. Much of this is just comment
improvement, but there are a couple of actual bugs, which might or might
not be reachable today depending on what can happen during logical
decoding. An example is that toast_flatten_tuple() failed to cover the
possibility of an indirection pointer in its input. Back-patch to 9.4
just in case that is reachable now.
In HEAD, also correct some really minor issues with recent compression
reorganization, such as dangerously underparenthesized macros.
If any error occurred while we were in the middle of reading a protocol
message from the client, we could lose sync, and incorrectly try to
interpret a part of another message as a new protocol message. That will
usually lead to an "invalid frontend message" error that terminates the
connection. However, this is a security issue because an attacker might
be able to deliberately cause an error, inject a Query message in what's
supposed to be just user data, and have the server execute it.
We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
operations that could ereport(ERROR) in the middle of processing a message,
but a query cancel interrupt or statement timeout could nevertheless cause
it to happen. Also, the V2 fastpath and COPY handling were not so careful.
It's very difficult to recover in the V2 COPY protocol, so we will just
terminate the connection on error. In practice, that's what happened
previously anyway, as we lost protocol sync.
To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
whenever we're in the middle of reading a message. When it's set, we cannot
safely ERROR out and continue running, because we might've read only part
of a message. PqCommReadingMsg acts somewhat similarly to critical sections
in that if an error occurs while it's set, the error handler will force the
connection to be terminated, as if the error was FATAL. It's not
implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
to PANIC in critical sections, because we want to be able to use
PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
advantage of that to prevent an OOM error from terminating the connection.
To prevent unnecessary connection terminations, add a holdoff mechanism
similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
interrupts, but still allow die interrupts. The rules on which interrupts
are processed when are now a bit more complicated, so refactor
ProcessInterrupts() and the calls to it in signal handlers so that the
signal handlers always call it if ImmediateInterruptOK is set, and
ProcessInterrupts() can decide to not do anything if the other conditions
are not met.
Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
Backpatch to all supported versions.
Security: CVE-2015-0244
calc_rangesel() failed outright when comparing range variables to empty
constant ranges with < or >=, as a result of missing cases in a switch.
It also produced a bogus estimate for > comparison to an empty range.
On top of that, the >= and > cases were mislabeled throughout. For
nonempty constant ranges, they managed to produce the right answers
anyway as a result of counterbalancing typos.
Also, default_range_selectivity() omitted cases for elem <@ range,
range &< range, and range &> range, so that rather dubious defaults
were applied for these operators.
In passing, rearrange the code in rangesel() so that the elem <@ range
case is handled in a less opaque fashion.
Report and patch by Emre Hasegeli, some additional work by me
The requiredEntries / additionalEntries arrays were not freed in
freeScanKeys() like other per-key stuff.
It's not obvious, but startScanKey() was only ever called after the keys
have been initialized with ginNewScanKey(). That's why it doesn't need to
worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap
was never true, because ginrescan free's the existing keys, and it's not OK
to call gingetbitmap twice in a row without calling ginrescan in between.
To make that clear, remove the unnecessary ginIsNewKey(). And just to be
extra sure that nothing funny happens if there is an existing key after all,
call freeScanKeys() to free it if it exists. This makes the code more
straightforward.
(I'm seeing other similar leaks in testing a query that rescans an GIN index
scan, but that's a different issue. This just fixes the obvious leak with
those two arrays.)
Backpatch to 9.4, where GIN fast scan was added.
As of 9.3, ruleutils.c goes to some lengths to ensure that table and column
aliases used in its output are unique. Of course this takes more time than
was required before, which in itself isn't fatal. However, EXPLAIN was set
up so that recalculation of the unique aliases was repeated for each
subexpression printed in a plan. That results in O(N^2) time and memory
consumption for large plan trees, which did not happen in older branches.
Fortunately, the expensive work is the same across a whole plan tree,
so there is no need to repeat it; we can do most of the initialization
just once per query and re-use it for each subexpression. This buys
back most (not all) of the performance loss since 9.2.
We need an extra ExplainState field to hold the precalculated deparse
context. That's no problem in HEAD, but in the back branches, expanding
sizeof(ExplainState) seems risky because third-party extensions might
have local variables of that struct type. So, in 9.4 and 9.3, introduce
an auxiliary struct to keep sizeof(ExplainState) the same. We should
refactor the APIs to avoid such local variables in future, but that's
material for a separate HEAD-only commit.
Per gripe from Alexey Bashtanov. Back-patch to 9.3 where the issue
was introduced.
Some users run their applications in chroot environments that lack an
/etc/passwd file. This means that the current UID's user name and home
directory are not obtainable. libpq used to be all right with that,
so long as the database role name to use was specified explicitly.
But commit a4c8f14364 broke such cases by
causing any failure of pg_fe_getauthname() to be treated as a hard error.
In any case it did little to advance its nominal goal of causing errors
in pg_fe_getauthname() to be reported better. So revert that and instead
put some real error-reporting code in place. This requires changes to the
APIs of pg_fe_getauthname() and pqGetpwuid(), since the latter had
departed from the POSIX-specified API of getpwuid_r() in a way that made
it impossible to distinguish actual lookup errors from "no such user".
To allow such failures to be reported, while not failing if the caller
supplies a role name, add a second call of pg_fe_getauthname() in
connectOptions2(). This is a tad ugly, and could perhaps be avoided with
some refactoring of PQsetdbLogin(), but I'll leave that idea for later.
(Note that the complained-of misbehavior only occurs in PQsetdbLogin,
not when using the PQconnect functions, because in the latter we will
never bother to call pg_fe_getauthname() if the user gives a role name.)
In passing also clean up the Windows-side usage of GetUserName(): the
recommended buffer size is 257 bytes, the passed buffer length should
be the buffer size not buffer size less 1, and any error is reported
by GetLastError() not errno.
Per report from Christoph Berg. Back-patch to 9.4 where the chroot
failure case was introduced. The generally poor reporting of errors
here is of very long standing, of course, but given the lack of field
complaints about it we won't risk changing these APIs further back
(even though they're theoretically internal to libpq).
Darwin --enable-nls builds use a substitute setlocale() that may start a
thread. Buildfarm member orangutan experienced BackendList corruption
on account of different postmaster threads executing signal handlers
simultaneously. Furthermore, a multithreaded postmaster risks undefined
behavior from sigprocmask() and fork(). Emit LOG messages about the
problem and its workaround. Back-patch to 9.0 (all supported versions).
Backpatch to 9.3 where src/common was introduce, because a bugfix that
needs to be backpatched, requires the function. Earlier branches will
have to duplicate the code.
Commit 36a35c55 changed the divisor from 3 to 6, for no apparent reason.
Reducing GinMaxItemSize like that created a dump/reload hazard: loading a
9.3 database to 9.4 might fail with "index row size XXX exceeds maximum 1352
for index ..." error. Revert the change.
While we're at it, make the calculation slightly more accurate. It used to
divide the available space on page by three, then subtract
sizeof(ItemIdData), and finally round down. That's not totally accurate; the
item pointers for the three items are packed tight right after the page
header, but there is alignment padding after the item pointers. Change the
calculation to reflect that, like BTMaxItemSize does. I tested this with
different block sizes on systems with 4- and 8-byte alignment, and the value
after the final MAXALIGN_DOWN was the same with both methods on all
configurations. So this does not make any difference currently, but let's be
tidy.
Also add a comment explaining what the macro does.
This fixes bug #12292 reported by Robert Thaler. Backpatch to 9.4, where the
bug was introduced.
MapArrayTypeName would copy up to NAMEDATALEN-1 bytes of the base type
name, which of course is wrong: after prepending '_' there is only room for
NAMEDATALEN-2 bytes. Aside from being the wrong result, this case would
lead to overrunning the statically allocated work buffer. This would be a
security bug if the function were ever used outside bootstrap mode, but it
isn't, at least not in any currently supported branches.
Aside from fixing the off-by-one loop logic, this patch gets rid of the
static work buffer by having MapArrayTypeName pstrdup its result; the sole
caller was already doing that, so this just requires moving the pstrdup
call. This saves a few bytes but mainly it makes the API a lot cleaner.
Back-patch on the off chance that there is some third-party code using
MapArrayTypeName with less-secure input. Pushing pstrdup into the function
should not cause any serious problems for such hypothetical code; at worst
there might be a short term memory leak.
Per Coverity scanning.
Since this is not something that a user should change,
pg_config_manual.h was an inappropriate place for it.
In initdb.c, remove the use of the macro, because utils/guc.h can't be
included by non-backend code. But we hardcode all the other
configuration file names there, so this isn't a disaster.
The logical decoding patchset introduced PROC_IN_LOGICAL_DECODING flag
PGXACT flag, that allows such backends to be skipped when computing
the xmin horizon/snapshots. That's fine and sensible for walsenders
streaming out logical changes, but not at all fine for SQL backends
doing logical decoding. If the latter set that flag any change they
have performed outside of logical decoding will not be regarded as
visible - which e.g. can lead to that change being vacuumed away.
Note that not setting the flag for SQL backends isn't particularly
bothersome - the SQL backend doesn't do streaming, so it only runs for
a limited amount of time.
Per buildfarm member 'tick' and Alvaro.
Backpatch to 9.4, where logical decoding was introduced.
We expose a function IsValidJsonNumber that internally calls the lexer
for json numbers. That allows us to use the same test everywhere,
instead of inventing a broken test for hstore conversions. The new
function is also used in datum_to_json, replacing the code that is now
moved to the new function.
Backpatch to 9.3 where hstore_to_json_loose was introduced.
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.
To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.
In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.
Backpatch to all supported versions; this has been racy since hot standby
was introduced.
This only happens if a client issues a Parse message with an empty query
string, which is a bit odd; but since it is explicitly called out as legal
by our FE/BE protocol spec, we'd probably better continue to allow it.
Fix by adding tests everywhere that the raw_parse_tree field is passed to
functions that don't or shouldn't accept NULL. Also make it clear in the
relevant comments that NULL is an expected case.
This reverts commits a73c9dbab0 and
2e9650cbcf, which fixed specific crash
symptoms by hacking things at what now seems to be the wrong end, ie the
callee functions. Making the callees allow NULL is superficially more
robust, but it's not always true that there is a defensible thing for the
callee to do in such cases. The caller has more context and is better
able to decide what the empty-query case ought to do.
Per followup discussion of bug #11335. Back-patch to 9.2. The code
before that is sufficiently different that it would require development
of a separate patch, which doesn't seem worthwhile for what is believed
to be an essentially cosmetic change.
At one time it wasn't terribly important what column names were associated
with the fields of a composite Datum, but since the introduction of
operations like row_to_json(), it's important that looking up the rowtype
ID embedded in the Datum returns the column names that users would expect.
That did not work terribly well before this patch: you could get the column
names of the underlying table, or column aliases from any level of the
query, depending on minor details of the plan tree. You could even get
totally empty field names, which is disastrous for cases like row_to_json().
To fix this for whole-row Vars, look to the RTE referenced by the Var, and
make sure its column aliases are applied to the rowtype associated with
the result Datums. This is a tad scary because we might have to return
a transient RECORD type even though the Var is declared as having some
named rowtype. In principle it should be all right because the record
type will still be physically compatible with the named rowtype; but
I had to weaken one Assert in ExecEvalConvertRowtype, and there might be
third-party code containing similar assumptions.
Similarly, RowExprs have to be willing to override the column names coming
from a named composite result type and produce a RECORD when the column
aliases visible at the site of the RowExpr differ from the underlying
table's column names.
In passing, revert the decision made in commit 398f70ec07 to add
an alias-list argument to ExecTypeFromExprList: better to provide that
functionality in a separate function. This also reverts most of the code
changes in d685814835, which we don't need because we're no longer
depending on the tupdesc found in the child plan node's result slot to be
blessed.
Back-patch to 9.4, but not earlier, since this solution changes the results
in some cases that users might not have realized were buggy. We'll apply a
more restricted form of this patch in older branches.
As noted by Noah Misch, my initial cut at fixing bug #11638 didn't cover
all cases where ANALYZE might be invoked in an unsafe context. We need to
test the result of IsInTransactionChain not IsTransactionBlock; which is
notationally a pain because IsInTransactionChain requires an isTopLevel
flag, which would have to be passed down through several levels of callers.
I chose to pass in_outer_xact (ie, the result of IsInTransactionChain)
rather than isTopLevel per se, as that seemed marginally more apropos
for the intermediate functions to know about.
Nobody seemed concerned about this naming when it originally went in,
but there's a pending patch that implements the opposite of
dsm_keep_mapping, and the term "unkeep" was judged unpalatable.
"unpin" has existing precedent in the PostgreSQL code base, and the
English language, so use this terminology instead.
Per discussion, back-patch to 9.4.
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source
database directory on the filesystem level. To ensure the on disk
state is consistent they block out users of the affected database and
force a checkpoint to flush out all data to disk. Unfortunately, up to
now, that checkpoint didn't flush out dirty buffers from unlogged
relations.
That bug means there could be leftover dirty buffers in either the
template database, or the database in its old location. Leading to
problems when accessing relations in an inconsistent state; and to
possible problems during shutdown in the SET TABLESPACE case because
buffers belonging files that don't exist anymore are flushed.
This was reported in bug #10675 by Maxim Boguk.
Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and
Fujii Masao.
Backpatch to 9.1 where unlogged tables were introduced.
json_agg and json_object_agg and their associated transition functions
should have been marked as stable rather than immutable, as they call IO
functions indirectly. Changing this probably isn't going to make much
difference, as you can't use an aggregate function in an index
expression, but we should be correct nevertheless.
json_object, on the other hand, should be marked immutable rather than
stable, as it does not call IO functions.
As discussed on -hackers, this change is being made without bumping the
catalog version, as we don't want to do that at this stage of the cycle,
and the changes are very unlikely to affect anyone.
Follow our usual style of providing an "extern" for a standard library
function only when we're also providing the implementation. This avoids
issues when the system headers declare the function slightly differently
than we do, as noted by Caleb Welton.
We might have to go to the extent of probing to see if the system headers
declare the function, but let's not do that until it's demonstrated to be
necessary.
Oversight in commit 9e6b1bf258. Back-patch
to all supported branches, as that was.
Up to now, PG has assumed that any given timezone abbreviation (such as
"EDT") represents a constant GMT offset in the usage of any particular
region; we had a way to configure what that offset was, but not for it
to be changeable over time. But, as with most things horological, this
view of the world is too simplistic: there are numerous regions that have
at one time or another switched to a different GMT offset but kept using
the same timezone abbreviation. Almost the entire Russian Federation did
that a few years ago, and later this month they're going to do it again.
And there are similar examples all over the world.
To cope with this, invent the notion of a "dynamic timezone abbreviation",
which is one that is referenced to a particular underlying timezone
(as defined in the IANA timezone database) and means whatever it currently
means in that zone. For zones that use or have used daylight-savings time,
the standard and DST abbreviations continue to have the property that you
can specify standard or DST time and get that time offset whether or not
DST was theoretically in effect at the time. However, the abbreviations
mean what they meant at the time in question (or most recently before that
time) rather than being absolutely fixed.
The standard abbreviation-list files have been changed to use this behavior
for abbreviations that have actually varied in meaning since 1970. The
old simple-numeric definitions are kept for abbreviations that have not
changed, since they are a bit faster to resolve.
While this is clearly a new feature, it seems necessary to back-patch it
into all active branches, because otherwise use of Russian zone
abbreviations is going to become even more problematic than it already was.
This change supersedes the changes in commit 513d06ded et al to modify the
fixed meanings of the Russian abbreviations; since we've not shipped that
yet, this will avoid an undesirably incompatible (not to mention incorrect)
change in behavior for timestamps between 2011 and 2014.
This patch makes some cosmetic changes in ecpglib to keep its usage of
datetime lookup tables as similar as possible to the backend code, but
doesn't do anything about the increasingly obsolete set of timezone
abbreviation definitions that are hard-wired into ecpglib. Whatever we
do about that will likely not be appropriate material for back-patching.
Also, a potential free() of a garbage pointer after an out-of-memory
failure in ecpglib has been fixed.
This patch also fixes pre-existing bugs in DetermineTimeZoneOffset() that
caused it to produce unexpected results near a timezone transition, if
both the "before" and "after" states are marked as standard time. We'd
only ever thought about or tested transitions between standard and DST
time, but that's not what's happening when a zone simply redefines their
base GMT offset.
In passing, update the SGML documentation to refer to the Olson/zoneinfo/
zic timezone database as the "IANA" database, since it's now being
maintained under the auspices of IANA.
We've gotten enough push-back on that change to make it clear that it
wasn't an especially good idea to do it like that. Revert plain EXPLAIN
to its previous behavior, but keep the extra output in EXPLAIN ANALYZE.
Per discussion.
Internally, I set this up as a separate flag ExplainState.summary that
controls printing of planning time and execution time. For now it's
just copied from the ANALYZE option, but we could consider exposing it
to users.
As of commit a87c72915 (which later got backpatched as far as 9.1),
we're explicitly supporting the notion that append relations can be
nested; this can occur when UNION ALL constructs are nested, or when
a UNION ALL contains a table with inheritance children.
Bug #11457 from Nelson Page, as well as an earlier report from Elvis
Pranskevichus, showed that there were still nasty bugs associated with such
cases: in particular the EquivalenceClass mechanism could try to generate
"join" clauses connecting an appendrel child to some grandparent appendrel,
which would result in assertion failures or bogus plans.
Upon investigation I concluded that all current callers of
find_childrel_appendrelinfo() need to be fixed to explicitly consider
multiple levels of parent appendrels. The most complex fix was in
processing of "broken" EquivalenceClasses, which are ECs for which we have
been unable to generate all the derived equality clauses we would like to
because of missing cross-type equality operators in the underlying btree
operator family. That code path is more or less entirely untested by
the regression tests to date, because no standard opfamilies have such
holes in them. So I wrote a new regression test script to try to exercise
it a bit, which turned out to be quite a worthwhile activity as it exposed
existing bugs in all supported branches.
The present patch is essentially the same as far back as 9.2, which is
where parameterized paths were introduced. In 9.0 and 9.1, we only need
to back-patch a small fragment of commit 5b7b5518d, which fixes failure to
propagate out the original WHERE clauses when a broken EC contains constant
members. (The regression test case results show that these older branches
are noticeably stupider than 9.2+ in terms of the quality of the plans
generated; but we don't really care about plan quality in such cases,
only that the plan not be outright wrong. A more invasive fix in the
older branches would not be a good idea anyway from a plan-stability
standpoint.)
I left the GUC in place for the beta period, so that people could experiment
with different values. No-one's come up with any data that a different value
would be better under some circumstances, so rather than try to document to
users what the GUC, let's just hard-code the current value, 8.
As noted in http://bugs.debian.org/763098 there is a conflict between
postgres' definition of CACHE_LINE_SIZE and the definition by various
*bsd platforms. It's debatable who has the right to define such a
name, but postgres' use was only introduced in 375d8526f2 (9.4), so
it seems like a good idea to rename it.
Discussion: 20140930195756.GC27407@msg.df7cb.de
Per complaint of Christoph Berg in the above email, although he's not
the original bug reporter.
Backpatch to 9.4 where the define was introduced.
The original design used an array of offsets into the variable-length
portion of a JSONB container. However, such an array is basically
uncompressible by simple compression techniques such as TOAST's LZ
compressor. That's bad enough, but because the offset array is at the
front, it tended to trigger the give-up-after-1KB heuristic in the TOAST
code, so that the entire JSONB object was stored uncompressed; which was
the root cause of bug #11109 from Larry White.
To fix without losing the ability to extract a random array element in O(1)
time, change this scheme so that most of the JEntry array elements hold
lengths rather than offsets. With data that's compressible at all, there
tend to be fewer distinct element lengths, so that there is scope for
compression of the JEntry array. Every N'th entry is still an offset.
To determine the length or offset of any specific element, we might have
to examine up to N preceding JEntrys, but that's still O(1) so far as the
total container size is concerned. Testing shows that this cost is
negligible compared to other costs of accessing a JSONB field, and that
the method does largely fix the incompressible-data problem.
While at it, rearrange the order of elements in a JSONB object so that
it's "all the keys, then all the values" not alternating keys and values.
This doesn't really make much difference right at the moment, but it will
allow providing a fast path for extracting individual object fields from
large JSONB values stored EXTERNAL (ie, uncompressed), analogously to the
existing optimization for substring extraction from large EXTERNAL text
values.
Bump catversion to denote the incompatibility in on-disk format.
We will need to fix pg_upgrade to disallow upgrading jsonb data stored
with 9.4 betas 1 and 2.
Heikki Linnakangas and Tom Lane
x86's memory barrier assembly was marked as clobbering "memory" but
not "cc" even though 'addl' sets various flags. As it turns out gcc on
x86 implicitly assumes "cc" on every inline assembler statement, so
it's not a bug. But as that's poorly documented and might get copied
to architectures or compilers where that's not the case, it seems
better to be precise.
Discussion: 20140919100016.GH4277@alap3.anarazel.de
To keep the code common, backpatch to 9.2 where explicit memory
barriers were introduced.
They were marked to return a boolean, but they actually return a
GinTernaryValue, which is more like a "char". It makes no practical
difference, as the triConsistent functions cannot be called directly from
SQL because they have "internal" arguments, but this nevertheless seems
more correct.
Also fix the GinTernaryValue name in the documentation. I renamed the enum
earlier, but neglected the docs.
Alexander Korotkov. This is new in 9.4, so backpatch there.
Some Sparc CPUs can be run in various coherence models, ranging from
RMO (relaxed) over PSO (partial) to TSO (total). Solaris has always
run CPUs in TSO mode while in userland, but linux didn't use to and
the various *BSDs still don't. Unfortunately the sparc TAS/S_UNLOCK
were only correct under TSO. Fix that by adding the necessary memory
barrier instructions. On sparcv8+, which should be all relevant CPUs,
these are treated as NOPs if the current consistency model doesn't
require the barriers.
Discussion: 20140630222854.GW26930@awork2.anarazel.de
Will be backpatched to all released branches once a few buildfarm
cycles haven't shown up problems. As I've no access to sparc, this is
blindly written.
If SELECT FOR UPDATE NOWAIT tries to lock a tuple that is concurrently
being updated, it might fail to honor its NOWAIT specification and block
instead of raising an error.
Fix by adding a no-wait flag to EvalPlanQualFetch which it can pass down
to heap_lock_tuple; also use it in EvalPlanQualFetch itself to avoid
blocking while waiting for a concurrent transaction.
Authors: Craig Ringer and Thomas Munro, tweaked by Álvaro
http://www.postgresql.org/message-id/51FB6703.9090801@2ndquadrant.com
Per Thomas Munro in the course of his SKIP LOCKED feature submission,
who also provided one of the isolation test specs.
Backpatch to 9.4, because that's as far back as it applies without
conflicts (although the bug goes all the way back). To that branch also
backpatch Thomas Munro's new NOWAIT test cases, committed in master by
Heikki as commit 9ee16b49f0 .
As 'ALTER TABLESPACE .. MOVE ALL' really didn't change the tablespace
but instead changed objects inside tablespaces, it made sense to
rework the syntax and supporting functions to operate under the
'ALTER (TABLE|INDEX|MATERIALIZED VIEW)' syntax and to be in
tablecmds.c.
Pointed out by Alvaro, who also suggested the new syntax.
Back-patch to 9.4.
<@ and @> are each other's commutators, but they were incorrectly marked
as being each other's negators instead. (This was actually questioned
in a comment in the original commit, but nobody followed through :-(.)
Per bug #11178 from Christian Pronovost.
In passing, fix some JSONB operator descriptions that were randomly
different from the phrasing of every other similar description.
catversion bump for pg_catalog contents change.
Previously, TOAST tables only required in the new cluster could cause
oid conflicts if they were auto-numbered and a later conflicting oid had
to be assigned.
Backpatch through 9.3
There were several oversights in recovery code where COMMIT/ABORT PREPARED
records were ignored:
* pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits)
* recovery_min_apply_delay (2PC commits were applied immediately)
* recovery_target_xid (recovery would not stop if the XID used 2PC)
The first of those was reported by Sergiy Zuban in bug #11032, analyzed by
Tom Lane and Andres Freund. The bug was always there, but was masked before
commit d19bd29f07, because COMMIT PREPARED
always created an extra regular transaction that was WAL-logged.
Backpatch to all supported versions (older versions didn't have all the
features and therefore didn't have all of the above bugs).
Apparently we still build against OpenSSL so old that it doesn't
have this function, so add an autoconf check for it to make the
buildfarm happy. If the function doesn't exist, always return
that compression is disabled, since presumably the actual
compression functionality is always missing.
For now, hardcode the function as present on MSVC, since we should
hopefully be well beyond those old versions on that platform.
Commit 1b86c81d2d fixed the decoding of toasted columns for the rows
contained in one xl_heap_multi_insert record. But that's not actually
enough, because heap_multi_insert() will actually first toast all
passed in rows and then emit several *_multi_insert records; one for
each page it fills with tuples.
Add a XLOG_HEAP_LAST_MULTI_INSERT flag which is set in
xl_heap_multi_insert->flag denoting that this multi_insert record is
the last emitted by one heap_multi_insert() call. Then use that flag
in decode.c to only set clear_toast_afterwards in the right situation.
Expand the number of rows inserted via COPY in the corresponding
regression test to make sure that more than one heap page is filled
with tuples by one heap_multi_insert() call.
Backpatch to 9.4 like the previous commit.
The old name wasn't very descriptive as of actual contents of the
directory, which are historical snapshots in the snapshots/
subdirectory and mappingdata for rewritten tuples in
mappings/. There's been a fair amount of discussion what would be a
good name. I'm settling for pg_logical because it's likely that
further data around logical decoding and replication will need saving
in the future.
Also add the missing entry for the directory into storage.sgml's list
of PGDATA contents.
Bumps catversion as the data directories won't be compatible.
Backpatch to 9.4, which I missed to do as Michael Paquier luckily
noticed. As there already has been a catversion bump after 9.4beta1,
there's no reasons for having 9.4 diverge from master.
When decoding the results of a HEAP2_MULTI_INSERT (currently only
generated by COPY FROM) toast columns for all but the last tuple
weren't replaced by their actual contents before being handed to the
output plugin. The reassembled toast datums where disregarded after
every REORDER_BUFFER_CHANGE_(INSERT|UPDATE|DELETE) which is correct
for plain inserts, updates, deletes, but not multi inserts - there we
generate several REORDER_BUFFER_CHANGE_INSERTs for a single
xl_heap_multi_insert record.
To solve the problem add a clear_toast_afterwards boolean to
ReorderBufferChange's union member that's used by modifications. All
row changes but multi_inserts always set that to true, but
multi_insert sets it only for the last change generated.
Add a regression test covering decoding of multi_inserts - there was
none at all before.
Backpatch to 9.4 where logical decoding was introduced.
Bug found by Petr Jelinek.
The previous design exposed the input and output ExprContexts of the
Agg plan node, but work on grouping sets has suggested that we'll regret
doing that. Instead provide more narrowly-defined APIs that can be
implemented in multiple ways, namely a way to get a short-term memory
context and a way to register an aggregate shutdown callback.
Back-patch to 9.4 where the bad APIs were introduced, since we don't
want third-party code using these APIs and then having to change in 9.5.
Andrew Gierth
The "false" case was really quite useless since all it did was to throw
an error; a definition not helped in the least by making it the default.
Instead let's just have the "true" case, which emits nested objects and
arrays in JSON syntax. We might later want to provide the ability to
emit sub-objects in Postgres record or array syntax, but we'd be best off
to drive that off a check of the target field datatype, not a separate
argument.
For the functions newly added in 9.4, we can just remove the flag arguments
outright. We can't do that for json_populate_record[set], which already
existed in 9.3, but we can ignore the argument and always behave as if it
were "true". It helps that the flag arguments were optional and not
documented in any useful fashion anyway.
Instead of truncating pg_multixact at vacuum time, do it only at
checkpoint time. The reason for doing it this way is twofold: first, we
want it to delete only segments that we're certain will not be required
if there's a crash immediately after the removal; and second, we want to
do it relatively often so that older files are not left behind if
there's an untimely crash.
Per my proposal in
http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org
we now execute the truncation in the checkpointer process rather than as
part of vacuum. Vacuum is in only charge of maintaining in shared
memory the value to which it's possible to truncate the files; that
value is stored as part of checkpoints also, and so upon recovery we can
reuse the same value to re-execute truncate and reset the
oldest-value-still-safe-to-use to one known to remain after truncation.
Per bug reported by Jeff Janes in the course of his tests involving
bug #8673.
While at it, update some comments that hadn't been updated since
multixacts were changed.
Backpatch to 9.3, where persistency of pg_multixact files was
introduced by commit 0ac5ad5134.
These should not have existed to begin with, but there was apparently some
misunderstanding of the purpose of the opr_sanity regression test item
that checks for operator implementation functions with their own comments.
The idea there is to check for unintentional violations of the rule that
operator implementation functions shouldn't be documented separately
.... but for these functions, that is in fact what we want, since the
variadic option is useful and not accessible via the operator syntax.
Get rid of the extra pg_proc entries and fix the regression test and
documentation to be explicit about what we're doing here.
ExecMakeTableFunctionResult evaluated the arguments for a function-in-FROM
in the query-lifespan memory context. This is insignificant in simple
cases where the function relation is scanned only once; but if the function
is in a sub-SELECT or is on the inside of a nested loop, any memory
consumed during argument evaluation can add up quickly. (The potential for
trouble here had been foreseen long ago, per existing comments; but we'd
not previously seen a complaint from the field about it.) To fix, create
an additional temporary context just for this purpose.
Per an example from MauMau. Back-patch to all active branches.
Commit ea9df812d8 failed to include
NUM_BUFFER_PARTITIONS in this offset, resulting in a bad offset.
Ultimately this threw off NUM_FIXED_LWLOCKS which is based on
earlier offsets, leading to memory allocation problems. It seems
likely to have also caused increased LWLOCK contention when
serializable transactions were used, because lightweight locks used
for that overlapped others.
Reported by Amit Kapila with analysis and fix.
Backpatch to 9.4, where the bug was introduced.
data_directory could be set both in postgresql.conf and postgresql.auto.conf so far.
This could cause some problematic situations like circular definition. To avoid such
situations, this commit forbids a user to set data_directory in postgresql.auto.conf.
Backpatch this to 9.4 where ALTER SYSTEM command was introduced.
Amit Kapila, reviewed by Abhijit Menon-Sen, with minor adjustments by me.
This function is pervasive on free software operating systems; import
NetBSD's implementation. Back-patch to 8.4, like the commit that will
harness it.
The previous naming broke the query that libpq's lo_initialize() uses
to collect the OIDs of the server-side functions it requires, because
that query effectively assumes that there is only one function named
lo_create in the pg_catalog schema (and likewise only one lo_open, etc).
While we should certainly make libpq more robust about this, the naive
query will remain in use in the field for the foreseeable future, so it
seems the only workable choice is to use a different name for the new
function. lo_from_bytea() won a small straw poll.
Back-patch into 9.4 where the new function was introduced.
Change the order of checks in similar functions to be the same; remove
a parameter that's not needed anymore; rename a memory context and
expand a couple of comments.
Per review comments from Amit Kapila
In a50d976254 I already changed this, but got it wrong for the case
where the number of members is larger than the number of entries that
fit in the last page of the last segment.
As reported by Serge Negodyuck in a followup to bug #8673.
It's critical that the backend's idea of LOBLKSIZE match the way data has
actually been divided up in pg_largeobject. While we don't provide any
direct way to adjust that value, doing so is a one-line source code change
and various people have expressed interest recently in changing it. So,
just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in
pg_control and cross-check that the backend's compiled-in setting matches
the on-disk data.
Also tweak the code in inv_api.c so that fetches from pg_largeobject
explicitly verify that the length of the data field is not more than
LOBLKSIZE. Formerly we just had Asserts() for that, which is no protection
at all in production builds. In some of the call sites an overlength data
value would translate directly to a security-relevant stack clobber, so it
seems worth one extra runtime comparison to be sure.
In the back branches, we can't change the contents of pg_control; but we
can still make the extra checks in inv_api.c, which will offer some amount
of protection against running with the wrong value of LOBLKSIZE.
Previously there's been a mix between 'slotname' and 'slot_name'. It's
not nice to be unneccessarily inconsistent in a new feature. As a post
beta1 initdb now is required in the wake of eeca4cd35e, fix the
inconsistencies.
Most the changes won't affect usage of replication slots because the
majority of changes is around function parameter names. The prominent
exception to that is that the recovery.conf parameter
'primary_slotname' is now named 'primary_slot_name'.
The way the code was written, the padding was copied from uninitialized
memory areas.. Because the structs are local variables in the code where
the WAL records are constructed, making them larger and zeroing the padding
bytes would not make the code very pretty, so rather than fixing this
directly by zeroing out the padding bytes, it seems more clear to not try to
align the tuples in the WAL records. The redo functions are taught to copy
the tuple header to a local variable to avoid unaligned access.
Stable-branches have the same problem, but we can't change the WAL format
there, so fix in master only. Reading a few random extra bytes at the stack
is harmless in practice, so it's not worth crafting a different
back-patchable fix.
Per reports from Kevin Grittner and Andres Freund, using clang static
analyzer and Valgrind, respectively.
This is needed to allow ORDER BY, DISTINCT, etc to work as expected for
pg_lsn values.
We had previously decided to put this off for 9.5, but in view of commit
eeca4cd35e there's no reason to avoid a
catversion bump for 9.4beta2, and this does make a pretty significant
usability difference for pg_lsn.
Michael Paquier, with fixes from Andres Freund and Tom Lane
This should have been done in 6bc8ef0b7f
and/or 50e547096c, but better late than
never. If we don't change this then we risk 9.3 pg_controldata or
pg_resetxlog being inappropriately used against a 9.4 pg_control file,
or vice versa.
187492b6c2 changed pgstat.c so that
the stats files were saved into $PGDATA/pg_stat directory when the server
was shutdowned. But it accidentally forgot to change the location of
pg_stat_statements permanent stats file. This commit fixes pg_stat_statements
so that its stats file is also saved into $PGDATA/pg_stat at shutdown.
Since this fix changes the file layout, we don't back-patch it to 9.3
where this oversight was introduced.
In general it's not a good idea for built-in types in the 'U' category
to be marked preferred; they could draw behavior away from user-defined
types with similarly-named operators. pg_lsn is probably at low risk
of that right now given the lack of casts between it and other types,
but that doesn't make this marking OK.
Ordinarily we'd bump catversion when changing any predefined catalog
contents like this, but since we're past beta1, the costs of a forced
initdb seem to outweigh the benefits of guaranteed behavioral consistency.
There's not any known behavioral impact today anyway --- this is more
in the nature of being sure there's not problems in future.
Per an off-list complaint from Thomas Fanghaenel.
Allow the contrib/uuid-ossp extension to be built atop any one of these
three popular UUID libraries. (The extension's name is now arguably a
misnomer, but we'll keep it the same so as not to cause unnecessary
compatibility issues for users.)
We would not normally consider a change like this post-beta1, but the issue
has been forced by our upgrade to autoconf 2.69, whose more rigorous header
checks are causing OSSP's header files to be rejected on some platforms.
It's been foreseen for some time that we'd have to move away from depending
on OSSP UUID due to lack of upstream maintenance, so this is a down payment
on that problem.
While at it, add some simple regression tests, in hopes of catching any
major incompatibilities between the three implementations.
Matteo Beccati, with some further hacking by me
In yesterday's commit 2dc4f011fd, I tried
to force buffering of stdout/stderr in initdb to be what it is by
default when the program is run interactively on Unix (since that's how
most manual testing is done). This tripped over the fact that Windows
doesn't support _IOLBF mode. We dealt with that a long time ago in
syslogger.c by falling back to unbuffered mode on Windows. Export that
solution in port.h and use it in initdb.
Back-patch to 8.4, like the previous commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
rd_replidindex should be managed the same as rd_oidindex, and rd_keyattr
and rd_idattr should be managed like rd_indexattr. Omissions in this area
meant that the bitmapsets computed for rd_keyattr and rd_idattr would be
leaked during any relcache flush, resulting in a slow but permanent leak in
CacheMemoryContext. There was also a tiny probability of relcache entry
corruption if we ran out of memory at just the wrong point in
RelationGetIndexAttrBitmap. Otherwise, the fields were not zeroed where
expected, which would not bother the code any AFAICS but could greatly
confuse anyone examining the relcache entry while debugging.
Also, create an API function RelationGetReplicaIndex rather than letting
non-relcache code be intimate with the mechanisms underlying caching of
that value (we won't even mention the memory leak there).
Also, fix a relcache flush hazard identified by Andres Freund:
RelationGetIndexAttrBitmap must not assume that rd_replidindex stays valid
across index_open.
The aspects of this involving rd_keyattr date back to 9.3, so back-patch
those changes.
There's no longer much pressure to switch the default GIN opclass for
jsonb, but there was still some unhappiness with the name "jsonb_hash_ops",
since hashing is no longer a distinguishing property of that opclass,
and anyway it seems like a relatively minor detail. At the suggestion of
Heikki Linnakangas, we'll use "jsonb_path_ops" instead; that captures the
important characteristic that each index entry depends on the entire path
from the document root to the indexed value.
Also add a user-facing explanation of the implementation properties of
these two opclasses.
Fix JSONB_MAX_ELEMS and JSONB_MAX_PAIRS macros to use CB_MASK in the
calculation. JENTRY_POSMASK happens to have the same value at the moment,
but that's just coincidental.
Refactor jsonb iterator functions, for readability.
Get rid of the JENTRY_ISFIRST flag. Whenever we handle JEntrys, we have
access to the whole array and have enough context information to know
which entry is the first. This frees up one bit in the JEntry header for
future use. While we're at it, shuffle the JEntry bits so that boolean
true and false go together, for aesthetic reasons.
Bump catalog version as this changes the on-disk format slightly.
Change the key representation so that values that would exceed 127 bytes
are hashed into short strings, and so that the original JSON datatype of
each value is recorded in the index. The hashing rule eliminates the major
objection to having this opclass be the default for jsonb, namely that it
could fail for plausible input data (due to GIN's restrictions on maximum
key length). Preserving datatype information doesn't really buy us much
right now, but it requires no extra space compared to the previous way,
and it might be useful later.
Also, change the consistency-checking functions to request recheck for
exists (jsonb ? text) and related operators. The original analysis that
this is an exactly checkable query was incorrect, since the index does
not preserve information about whether a key appears at top level in
the indexed JSON object. Add a test case demonstrating the problem.
Make some other, mostly cosmetic improvements to the code in jsonb_gin.c
as well.
catversion bump due to on-disk data format change in jsonb_ops indexes.
Per discussion, the old value of 128MB is ridiculously small on modern
machines; in fact, it's not even any larger than the default value of
shared_buffers, which it certainly should be. Increase to 4GB, which
is unlikely to be any worse than the old default for anyone, and should
be noticeably better for most. Eventually we might have an autotuning
scheme for this setting, but the recent attempt crashed and burned,
so for now just do this.
This reverts commit ee1e5662d8, as well as
a remarkably large number of followup commits, which were mostly concerned
with the fact that the implementation didn't work terribly well. It still
doesn't: we probably need some rather basic work in the GUC infrastructure
if we want to fully support GUCs whose default varies depending on the
value of another GUC. Meanwhile, it also emerged that there wasn't really
consensus in favor of the definition the patch tried to implement (ie,
effective_cache_size should default to 4 times shared_buffers). So whack
it all back to where it was. In a followup commit, I'll do what was
recently agreed to, which is to simply change the default to a higher
value.
Previous commit was confused about the case we're handling: actually,
what the patch is dealing with is platforms that have optreset, *and*
have <getopt.h>, but the latter fails to declare the former. Because
we use a linking probe to set HAVE_INT_OPTRESET, we need to be sure we
have a declaration even if <getopt.h> doesn't think it exists.
Reportedly, some versions of mingw are like that, and it seems plausible
in general that older platforms might be that way. However, we'd
determined experimentally that just doing "extern int" conflicts with
the way Cygwin declares these variables, so explicitly exclude Cygwin.
Michael Paquier, tweaked by me to hopefully not break Cygwin
Just as we would start bgworkers immediately after an initial startup
of the server, we should restart them immediately when reinitializing.
Petr Jelinek and Robert Haas
The main target of this cleanup is the convertJsonb() function, but I also
touched a lot of other things that I spotted into in the process.
The new convertToJsonb() function uses an output buffer that's resized on
demand, so the code to estimate of the size of JsonbValue is removed.
The on-disk format was not changed, even though I refactored the structs
used to handle it. The term "superheader" is replaced with "container".
The jsonb_exists_any and jsonb_exists_all functions no longer sort the input
array. That was a premature optimization, the idea being that if there are
duplicates in the input array, you only need to check them once. Also,
sorting the array saves some effort in the binary search used to find a key
within an object. But there were drawbacks too: the sorting and
deduplicating obviously isn't free, and in the typical case there are no
duplicates to remove, and the gain in the binary search was minimal. Remove
all that, which makes the code simpler too.
This includes a bug-fix; the total length of the elements in a jsonb array
or object mustn't exceed 2^28. That is now checked.
Commit 261c7d4b65 removed the "m" field
from struct LINE, but neglected to make pg_type.h's idea of the type's
size match. This resulted in reading past the end of palloc'd LINE
values when inserting them into tuples etc. In principle that could
cause a SIGSEGV, though the odds of detectable problems seem low.
Bump catversion since this makes an incompatible on-disk format change.
Note that if the line type had been in use in the field, this would
break pg_upgrade'ability of databases containing line values; but
it seems unlikely that there are any (they'd have had to be compiled
with -DENABLE_LINE_TYPE).
Spotted by Andres Freund.
It's easy to forget using SYSTEMQUOTEs when constructing command strings
for system() or popen(). Even if we fix all the places missing it now, it is
bound to be forgotten again in the future. Introduce wrapper functions that
do the the extra quoting for you, and get rid of SYSTEMQUOTEs in all the
callers.
We previosly used SYSTEMQUOTEs in all the hard-coded command strings, and
this doesn't change the behavior of those. But user-supplied commands, like
archive_command, restore_command, COPY TO/FROM PROGRAM calls, as well as
pgbench's \shell, will now gain an extra pair of quotes. That is desirable,
but if you have existing scripts or config files that include an extra
pair of quotes, those might need to be adjusted.
Reviewed by Amit Kapila and Tom Lane
If we have an array of records stored on disk, the individual record fields
cannot contain out-of-line TOAST pointers: the tuptoaster.c mechanisms are
only prepared to deal with TOAST pointers appearing in top-level fields of
a stored row. The same applies for ranges over composite types, nested
composites, etc. However, the existing code only took care of expanding
sub-field TOAST pointers for the case of nested composites, not for other
structured types containing composites. For example, given a command such
as
UPDATE tab SET arraycol = ARRAY[(ROW(x,42)::mycompositetype] ...
where x is a direct reference to a field of an on-disk tuple, if that field
is long enough to be toasted out-of-line then the TOAST pointer would be
inserted as-is into the array column. If the source record for x is later
deleted, the array field value would become a dangling pointer, leading
to errors along the line of "missing chunk number 0 for toast value ..."
when the value is referenced. A reproducible test case for this was
provided by Jan Pecek, but it seems likely that some of the "missing chunk
number" reports we've heard in the past were caused by similar issues.
Code-wise, the problem is that PG_DETOAST_DATUM() is not adequate to
produce a self-contained Datum value if the Datum is of composite type.
Seen in this light, the problem is not just confined to arrays and ranges,
but could also affect some other places where detoasting is done in that
way, for example form_index_tuple().
I tried teaching the array code to apply toast_flatten_tuple_attribute()
along with PG_DETOAST_DATUM() when the array element type is composite,
but this was messy and imposed extra cache lookup costs whether or not any
TOAST pointers were present, indeed sometimes when the array element type
isn't even composite (since sometimes it takes a typcache lookup to find
that out). The idea of extending that approach to all the places that
currently use PG_DETOAST_DATUM() wasn't attractive at all.
This patch instead solves the problem by decreeing that composite Datum
values must not contain any out-of-line TOAST pointers in the first place;
that is, we expand out-of-line fields at the point of constructing a
composite Datum, not at the point where we're about to insert it into a
larger tuple. This rule is applied only to true composite Datums, not
to tuples that are being passed around the system as tuples, so it's not
as invasive as it might sound at first. With this approach, the amount
of code that has to be touched for a full solution is greatly reduced,
and added cache lookup costs are avoided except when there actually is
a TOAST pointer that needs to be inlined.
The main drawback of this approach is that we might sometimes dereference
a TOAST pointer that will never actually be used by the query, imposing a
rather large cost that wasn't there before. On the other side of the coin,
if the field value is used multiple times then we'll come out ahead by
avoiding repeat detoastings. Experimentation suggests that common SQL
coding patterns are unaffected either way, though. Applications that are
very negatively affected could be advised to modify their code to not fetch
columns they won't be using.
In future, we might consider reverting this solution in favor of detoasting
only at the point where data is about to be stored to disk, using some
method that can drill down into multiple levels of nested structured types.
That will require defining new APIs for structured types, though, so it
doesn't seem feasible as a back-patchable fix.
Note that this patch changes HeapTupleGetDatum() from a macro to a function
call; this means that any third-party code using that macro will not get
protection against creating TOAST-pointer-containing Datums until it's
recompiled. The same applies to any uses of PG_RETURN_HEAPTUPLEHEADER().
It seems likely that this is not a big problem in practice: most of the
tuple-returning functions in core and contrib produce outputs that could
not possibly be toasted anyway, and the same probably holds for third-party
extensions.
This bug has existed since TOAST was invented, so back-patch to all
supported branches.
Commit a730183926 created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that. To clean it up:
* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h. We won't consider this symbol part of the FE/BE API.
* Push enum ForkNumber from relfilenode.h into relpath.h. We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.
* So, relfilenode.h now includes relpath.h rather than vice-versa. This
direction of dependency is fine. (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)
* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.
* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().
* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers. (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)
* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().
* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames. This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.
The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.
Per a gripe from Christoph Berg about how include/common wasn't
self-contained.
pg_controldata includes postgres.h not postgres_fe.h, so utils/palloc.h
must be able to compile in a "#define FRONTEND" context. It appears that
Solaris Studio is smart enough to persuade us to define PG_USE_INLINE,
but not smart enough to not make a copy of unreferenced static functions;
which leads to an unsatisfied reference to CurrentMemoryContext. So we
need an #ifndef FRONTEND around that declaration. Per buildfarm.
This breaks the principle that common/ ought not depend on anything in the
server, not only code-wise but in the headers. The only arguable advantage
is avoidance of duplication of half a dozen extern declarations, and even
that is rather dubious, considering that the previous coding was wrong
about which declarations to duplicate: it exposed pnstrdup() to frontend
code even though no such function is provided in fe_memutils.c.
On the same principle, don't #include utils/memutils.h in the frontend
build of psprintf.c. This requires duplicating the definition of
MaxAllocSize, but that seems fine to me: there's no a-priori reason why
frontend code should use the same size limit as the backend anyway.
In passing, clean up some rather odd layout and ordering choices that
were imposed on palloc.h to reduce the number of #ifdefs required by
the previous approach.
Per gripe from Christoph Berg. There's still more work to do to make
include/common/ clean, but this part seems reasonably noncontroversial.
If a tuple is locked, and this lock is later upgraded either to an
update or to a stronger lock, and in the meantime some other process
tries to lock, update or delete the same tuple, it (the tuple) could end
up being updated twice, or having conflicting locks held.
The reason for this is that the second updater checks for a change in
Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
the first lock; and if there's a change, it restarts and re-evaluates
its ability to update the tuple. But it neglected to check for changes
in lock strength or in lock-vs-update status when those two properties
stayed the same. This would lead it to take the wrong decision and
continue with its own update, when in reality it shouldn't do so but
instead restart from the top.
This could lead to either an assertion failure much later (when a
multixact containing multiple updates is detected), or duplicate copies
of tuples.
To fix, make sure to compare the other relevant infomask bits alongside
the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
necessary.
Also, in the belt-and-suspenders spirit, add a check to
MultiXactCreateFromMembers that a multixact being created does not have
two or more members that are claimed to be updates. This should protect
against other bugs that might cause similar bogus situations.
Backpatch to 9.3, where the possibility of multixacts containing updates
was introduced. (In prior versions it was possible to have the tuple
lock upgraded from shared to exclusive, and an update would not restart
from the top; yet we're protected against a bug there because there's
always a sleep to wait for the locking transaction to complete before
continuing to do anything. Really, the fact that tuple locks always
conflicted with concurrent updates is what protected against bugs here.)
Per report from Andrew Dunstan and Josh Berkus in thread at
http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com
Bug analysis by Andres Freund.
pg_sequence_parameters() and pg_identify_object() have had incorrect
proallargtypes entries since 9.1 and 9.3 respectively. This was mostly
masked by the correct information in proargtypes, but a few operations
such as pg_get_function_arguments() (and thus psql's \df display) would
show the wrong data types for these functions' input parameters.
In HEAD, fix the wrong info, bump catversion, and add an opr_sanity
regression test to catch future mistakes of this sort.
In the back branches, just fix the wrong info so that installations
initdb'd with future minor releases will have the right data. We
can't force an initdb, and it doesn't seem like a good idea to add
a regression test that will fail on existing installations.
Andres Freund
Before 9.4, such an aggregate couldn't be declared, because its final
function would have to have polymorphic result type but no polymorphic
argument, which CREATE FUNCTION would quite properly reject. The
ordered-set-aggregate patch found a workaround: allow the final function
to be declared as accepting additional dummy arguments that have types
matching the aggregate's regular input arguments. However, we failed
to notice that this problem applies just as much to regular aggregates,
despite the fact that we had a built-in regular aggregate array_agg()
that was known to be undeclarable in SQL because its final function
had an illegal signature. So what we should have done, and what this
patch does, is to decouple the extra-dummy-arguments behavior from
ordered-set aggregates and make it generally available for all aggregate
declarations. We have to put this into 9.4 rather than waiting till
later because it slightly alters the rules for declaring ordered-set
aggregates.
The patch turned out a bit bigger than I'd hoped because it proved
necessary to record the extra-arguments option in a new pg_aggregate
column. I'd thought we could just look at the final function's pronargs
at runtime, but that didn't work well for variadic final functions.
It's probably just as well though, because it simplifies life for pg_dump
to record the option explicitly.
While at it, fix array_agg() to have a valid final-function signature,
and add an opr_sanity test to notice future deviations from polymorphic
consistency. I also marked the percentile_cont() aggregates as not
needing extra arguments, since they don't.
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
Some ancient comments claimed that fn_nargs could be -1 to indicate a
variable number of input arguments; but this was never implemented, and
is at variance with what we ultimately did with "variadic" functions.
Update the comments.
Because of gcc -Wmissing-prototypes, all functions in dynamically
loadable modules must have a separate prototype declaration. This is
meant to detect global functions that are not declared in header files,
but in cases where the function is called via dfmgr, this is redundant.
Besides filling up space with boilerplate, this is a frequent source of
compiler warnings in extension modules.
We can fix that by creating the function prototype as part of the
PG_FUNCTION_INFO_V1 macro, which such modules have to use anyway. That
makes the code of modules cleaner, because there is one less place where
the entry points have to be listed, and creates an additional check that
functions have the right prototype.
Remove now redundant prototypes from contrib and other modules.
These are natural complements to the functions added by commit
0886fc6a5c, but they weren't included
in the original patch for some reason. Add them.
Patch by me, per a complaint by Tom Lane. Review by Tatsuo
Ishii.
This allows squeezing out the unused space in full-page writes. And more
importantly, it can be a useful debugging aid.
In hindsight we should've done this back when GIN was added - we wouldn't
need the 'maxoff' field in the page opaque struct if we had used pd_lower
and pd_upper like on normal pages. But as long as there can be pages in the
index that have been binary-upgraded from pre-9.4 versions, we can't rely
on that, and have to continue using 'maxoff'.
Most of the code churn comes from renaming some macros, now that they're
used on internal pages, too.
This change is completely backwards-compatible, no effect on pg_upgrade.
Repositioning the tuplestore seek pointer in window_gettupleslot() turns
out to be a very significant expense when the window frame is sizable and
the frame end can move. To fix, introduce a tuplestore function for
skipping an arbitrary number of tuples in one call, parallel to the one we
introduced for tuplesort objects in commit 8d65da1f. This reduces the cost
of window_gettupleslot() to O(1) if the tuplestore has not spilled to disk.
As in the previous commit, I didn't try to do any real optimization of
tuplestore_skiptuples for the case where the tuplestore has spilled to
disk. There is probably no practical way to get the cost to less than O(N)
anyway, but perhaps someone can think of something later.
Also fix PersistHoldablePortal() to make use of this API now that we have
it.
Based on a suggestion by Dean Rasheed, though this turns out not to look
much like his patch.