The S/390 members of the buildfarm are showing failures indicating
that they're having trouble with the rint() calls I added yesterday.
There's no good reason for that, and I wonder if it is a compiler bug
similar to the one we worked around in d9476b838. Try to fix it using
the same method as before, namely to store the result of rint() back
into a "double" variable rather than immediately converting to int64.
(This isn't entirely waving a dead chicken, since on machines with
wider-than-double float registers, the extra store forces a width
conversion. I don't know if S/390 is like that, but it seems worth
trying.)
In passing, merge duplicate ereport() calls in float8_timestamptz().
Per buildfarm.
When converting a float value to integer microseconds, we should be careful
to round the value to the nearest integer, typically with rint(); simply
assigning to an int64 variable will truncate, causing apparently off-by-one
values in cases that should work. Most places in the datetime code got
this right, but not these two.
float8_timestamptz() is new as of commit e511d878f (9.6). Previous
versions effectively depended on interval_mul() to do roundoff correctly,
which it does, so this fixes an accuracy regression in 9.6.
The problem in make_interval() dates to its introduction in 9.4. Aside
from being careful to round not truncate, let's incorporate the hours and
minutes inputs into the result with exact integer arithmetic, rather than
risk introducing roundoff error where there need not have been any.
float8_timestamptz() problem reported by Erik Nordström, though this is
not his proposed patch. make_interval() problem found by me.
Discussion: https://postgr.es/m/CAHuQZDS76jTYk3LydPbKpNfw9KbACmD=49dC4BrzHcfPv6yA1A@mail.gmail.com
The problem with the original coding here is that we might receive (and
clear) a relcache invalidation signal for the target relation down inside
one of the index_open calls we're doing. Since the target is open, we
would not drop the relcache entry, just reset its rd_indexvalid and
rd_indexlist fields. But RelationGetIndexAttrBitmap() kept going, and
would eventually cache and return potentially-obsolete attribute bitmaps.
The case where this matters is where the inval signal was from a CREATE
INDEX CONCURRENTLY telling us about a new index on a formerly-unindexed
column. (In all other cases, the lock we hold on the target rel should
prevent any concurrent change in index state.) Even just returning the
stale attribute bitmap is not such a problem, because it shouldn't matter
during the transaction in which we receive the signal. What hurts is
caching the stale data, because it can survive into later transactions,
breaking CREATE INDEX CONCURRENTLY's expectation that later transactions
will not create new broken HOT chains. The upshot is that there's a window
for building corrupted indexes during CREATE INDEX CONCURRENTLY.
This patch fixes the problem by rechecking that the set of index OIDs
is still the same at the end of RelationGetIndexAttrBitmap() as it was
at the start. If not, we loop back and try again. That's a little
more than is strictly necessary to fix the bug --- in principle, we
could return the stale data but not cache it --- but it seems like a
bad idea on general principles for relcache to return data it knows
is stale.
There might be more hazards of the same ilk, or there might be a better
way to fix this one, but this patch definitely improves matters and seems
unlikely to make anything worse. So let's push it into today's releases
even as we continue to study the problem.
Pavan Deolasee and myself
Discussion: https://postgr.es/m/CABOikdM2MUq9cyZJi1KyLmmkCereyGp5JQ4fuwKoyKEde_mzkQ@mail.gmail.com
!foo means "the tsvector does not contain foo", and therefore it should
match an empty tsvector. ts_match_vq() overenthusiastically supposed
that an empty tsvector could never match any query, so it forcibly
returned FALSE, the wrong answer. Remove the premature optimization.
Our behavior on this point was inconsistent, because while seqscans and
GIST index searches both failed to match empty tsvectors, GIN index
searches would find them, since GIN scans don't rely on ts_match_vq().
That makes this certainly a bug, not a debatable definition disagreement,
so back-patch to all supported branches.
Report and diagnosis by Tom Dunstan (bug #14515); added test cases by me.
Discussion: https://postgr.es/m/20170126025524.1434.97828@wrigleys.postgresql.org
This fixes problems where a plan must change but fails to do so,
as seen in a bug report from Rajkumar Raghuwanshi.
For ALTER FOREIGN TABLE OPTIONS, do this through the standard method of
forcing a relcache flush on the table. For ALTER FOREIGN DATA WRAPPER
and ALTER SERVER, just flush the whole plan cache on any change in
pg_foreign_data_wrapper or pg_foreign_server. That matches the way
we handle some other low-probability cases such as opclass changes, and
it's unclear that the case arises often enough to be worth working harder.
Besides, that gives a patch that is simple enough to back-patch with
confidence.
Back-patch to 9.3. In principle we could apply the code change to 9.2 as
well, but (a) we lack postgres_fdw to test it with, (b) it's doubtful that
anyone is doing anything exciting enough with FDWs that far back to need
this desperately, and (c) the patch doesn't apply cleanly.
Patch originally by Amit Langote, reviewed by Etsuro Fujita and Ashutosh
Bapat, who each contributed substantial changes as well.
Discussion: https://postgr.es/m/CAKcux6m5cA6rRPTKkqVdJ-R=KKDfe35Q_ZuUqxDSV_4hwga=og@mail.gmail.com
array_fill(..., array[0]) produced an empty array, which is probably
what users expect, but it was a one-dimensional zero-length array
which is not our standard representation of empty arrays. Also, for
no very good reason, it rejected empty input arrays; that case should
be allowed and produce an empty output array.
In passing, remove the restriction that the input array(s) have lower
bound 1. That seems rather pointless, and it would have needed extra
complexity to make the check deal with empty input arrays.
Per bug #14487 from Andrew Gierth. It's been broken all along, so
back-patch to all supported branches.
Discussion: https://postgr.es/m/20170105152156.10135.64195@wrigleys.postgresql.org
In GetCachedPlan(), initialize 'plan' to silence a compiler warning, but
also add an Assert() to make sure we don't ever actually fall through
with 'plan' still being set to NULL, since we are about to dereference
it.
Back-patch back to 9.2.
Author: Stephen Frost
Discussion: https://postgr.es/m/20161129152102.GR13284%40tamriel.snowman.net
interval_transform() contained two separate bugs that caused it to
sometimes mistakenly decide that a cast from interval to restricted
interval is a no-op and throw it away.
First, it was wrong to rely on dt.h's field type macros to have an
ordering consistent with the field's significance; in one case they do
not. This led to mistakenly treating YEAR as less significant than MONTH,
so that a cast from INTERVAL MONTH to INTERVAL YEAR was incorrectly
discarded.
Second, fls(1<<k) produces k+1 not k, so comparing its output directly
to SECOND was wrong. This led to supposing that a cast to INTERVAL
MINUTE was really a cast to INTERVAL SECOND and so could be discarded.
To fix, get rid of the use of fls(), and make a function based on
intervaltypmodout to produce a field ID code adapted to the need here.
Per bug #14479 from Piotr Stefaniak. Back-patch to 9.2 where transform
functions were introduced, because this code was born broken.
Discussion: https://postgr.es/m/20161227172307.10135.7747@wrigleys.postgresql.org
When ts_rewrite()'s replacement argument is an empty tsquery, it's supposed
to simplify any operator nodes whose operand(s) become NULL; but it failed
to do that reliably, because dropvoidsubtree() only examined the top level
of the result tree. Rather than make a second recursive pass, let's just
give the responsibility to dofindsubquery() to simplify while it's doing
the main replacement pass. Per report from Andreas Seltenreich.
Artur Zakirov, with some cosmetic changes by me. Back-patch to all
supported branches.
Discussion: https://postgr.es/m/8737i01dew.fsf@credativ.de
When rebuilding an existing index, ALTER TABLE correctly kept the
physical file in the same tablespace, but it messed up the pg_class
entry if the index had been in the database's default tablespace
and "default_tablespace" was set to some non-default tablespace.
This led to an inaccessible index.
Fix by fixing pg_get_indexdef_string() to always include a tablespace
clause, whether or not the index is in the default tablespace. The
previous behavior was installed in commit 537e92e41, and I think it just
wasn't thought through very clearly; certainly the possible effect of
default_tablespace wasn't considered. There's some risk in changing the
behavior of this function, but there are no other call sites in the core
code. Even if it's being used by some third party extension, it's fairly
hard to envision a usage that is okay with a tablespace clause being
appended some of the time but can't handle it being appended all the time.
Back-patch to all supported versions.
Code fix by me, investigation and test cases by Michael Paquier.
Discussion: <1479294998857-5930602.post@n3.nabble.com>
The code was intentionally not very careful about leaking strdup'd
strings in case of an error. That was forgivable probably, but it
also failed to notice strdup() failures, which could lead to subsequent
null-pointer-dereference crashes, since many callers unsurprisingly
didn't check for null pointers in the struct lconv fields. An even
worse problem is that it could throw error while we were setlocale'd
to a non-C locale, causing unwanted behavior in subsequent libc calls.
Rewrite to ensure that we cannot throw elog(ERROR) until after we've
restored the previous locale settings, or at least attempted to.
(I'm sorely tempted to make restore failure be a FATAL error, but
will refrain for the moment.) Having done that, it's not much more
work to ensure that we clean up strdup'd storage on the way out, too.
This code is substantially the same in all supported branches, so
back-patch all the way.
Michael Paquier and Tom Lane
Discussion: <CAB7nPqRMbGqa_mesopcn4MPyTs34eqtVEK7ELYxvvV=oqS00YA@mail.gmail.com>
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
tsquery_rewrite() tries to find matches to subsets of AND/OR conditions;
for example, in the query 'a | b | c' the substitution subquery 'a | c'
should match and lead to replacement of the first and third items.
That's fine, but the matching algorithm apparently takes about O(2^N)
for an N-clause query (I say "apparently" because the code is also both
unintelligible and uncommented). We could probably do better than that
even without any extra assumptions --- but actually, we know that the
subclauses are sorted, indeed are depending on that elsewhere in this very
same function. So we can just scan the two lists a single time to detect
matches, as though we were doing a merge join.
Also do a re-flattening call (QTNTernary()) in tsquery_rewrite_query, just
to make sure that the tree fits the expectations of the next search cycle.
I didn't try to devise a test case for this, but I'm pretty sure that the
oversight could have led to failure to match in some cases where a match
would be expected.
Improve comments, and also stick a CHECK_FOR_INTERRUPTS into
dofindsubquery, just in case it's still too slow for somebody.
Per report from Andreas Seltenreich. Back-patch to all supported branches.
Discussion: <8760oasf2y.fsf@credativ.de>
QTNTernary() contains logic to flatten, eg, '(a & b) & c' into 'a & b & c',
which is all well and good, but it tries to do that to NOT nodes as well,
so that '!!a' gets changed to '!a'. Explicitly restrict the conversion to
be done only on AND and OR nodes, and add a test case illustrating the bug.
In passing, provide some comments for the sadly naked functions in
tsquery_util.c, and simplify some baroque logic in QTNFree(), which
I think may have been leaking some items it intended to free.
Noted while investigating a complaint from Andreas Seltenreich.
Back-patch to all supported versions.
Despite the argumentation I wrote in commit 7a2fe85b0, it's unsafe to do
this, because in corner cases it's possible for HeapTupleSatisfiesSelf
to try to set hint bits on the target tuple; and at least since 8.2 we
have required the buffer content lock to be held while setting hint bits.
The added regression test exercises one such corner case. Unpatched, it
causes an assertion failure in assert-enabled builds, or otherwise would
cause a hint bit change in a buffer we don't hold lock on, which given
the right race condition could result in checksum failures or other data
consistency problems. The odds of a problem in the field are probably
pretty small, but nonetheless back-patch to all supported branches.
Report: <19391.1477244876@sss.pgh.pa.us>
IANA got rid of the really silly "abbreviation" and replaced it with one
that's only moderately silly. But it's still pointless, so keep on not
showing it.
CommandId is declared as uint32, and values up to 4G are indeed legal.
cidout() handles them properly by treating the value as unsigned int.
But cidin() was just using atoi(), which has platform-dependent behavior
for values outside the range of signed int, as reported by Bart Lengkeek
in bug #14379. Use strtoul() instead, as xidin() does.
In passing, make some purely cosmetic changes to make xidin/xidout
look more like cidin/cidout; the former didn't have a monopoly on
best practice IMO.
Neither xidin nor cidin make any attempt to throw error for invalid input.
I didn't change that here, and am not sure it's worth worrying about
since neither is really a user-facing type. The point is just to ensure
that indubitably-valid inputs work as expected.
It's been like this for a long time, so back-patch to all supported
branches.
Report: <20161018152550.1413.6439@wrigleys.postgresql.org>
bitshiftright() and bitshiftleft() would recursively call each other
infinitely if the user passed INT_MIN for the shift amount, due to integer
overflow in negating the shift amount. To fix, clamp to -VARBITMAXLEN.
That doesn't change the results since any shift distance larger than the
input bit string's length produces an all-zeroes result.
Also fix some places that seemed inadequately paranoid about input typmods
exceeding VARBITMAXLEN. While a typmod accepted by anybit_typmodin() will
certainly be much less than that, at least some of these spots are
reachable with user-chosen integer values.
Andreas Seltenreich and Tom Lane
Discussion: <87d1j2zqtz.fsf@credativ.de>
Commit dd1a3bccc replaced a test on whether a subroutine returned a
null pointer with a test on whether &pointer->backendStatus was null.
This accidentally failed to fail, at least on common compilers, because
backendStatus is the first field in the struct; but it was surely trouble
waiting to happen. Commit f91feba87 then messed things up further,
changing the logic to
local_beentry = pgstat_fetch_stat_local_beentry(curr_backend);
if (!local_beentry)
continue;
beentry = &local_beentry->backendStatus;
if (!beentry)
{
where the second "if" is now dead code, so that the intended behavior of
printing a row with "<backend information not available>" cannot occur.
I suspect this is all moot because pgstat_fetch_stat_local_beentry
will never actually return null in this function's usage, but it's still
very poor coding. Repair back to 9.4 where the original problem was
introduced.
Previously, we threw an error if a dynamic timezone abbreviation did not
match any abbreviation recorded in the referenced IANA time zone entry.
That seemed like a good consistency check at the time, but it turns out
that a number of the abbreviations in the IANA database are things that
Olson and crew made up out of whole cloth. Their current policy is to
remove such names in favor of using simple numeric offsets. Perhaps
unsurprisingly, a lot of these made-up abbreviations have varied in meaning
over time, which meant that our commit b2cbced9e and later changes made
them into dynamic abbreviations. So with newer IANA database versions
that don't mention these abbreviations at all, we fail, as reported in bug
#14307 from Neil Anderson. It's worse than just a few unused-in-the-wild
abbreviations not working, because the pg_timezone_abbrevs view stops
working altogether (since its underlying function tries to compute the
whole view result in one call).
We considered deleting these abbreviations from our abbreviations list, but
the problem with that is that we can't stay ahead of possible future IANA
changes. Instead, let's leave the abbreviations list alone, and treat any
"orphaned" dynamic abbreviation as just meaning the referenced time zone.
It will behave a bit differently than it used to, in that you can't any
longer override the zone's standard vs. daylight rule by using the "wrong"
abbreviation of a pair, but that's better than failing entirely. (Also,
this solution can be interpreted as adding a small new feature, which is
that any abbreviation a user wants can be defined as referencing a time
zone name.)
Back-patch to all supported branches, since this problem affects all
of them when using tzdata 2016f or newer.
Report: <20160902031551.15674.67337@wrigleys.postgresql.org>
Discussion: <6189.1472820913@sss.pgh.pa.us>
NUMERIC_MAX_PRECISION is a purely arbitrary constraint on the precision
and scale you can write in a numeric typmod. It might once have had
something to do with the allowed range of a typmod-less numeric value,
but at least since 9.1 we've allowed, and documented that we allowed,
any value that would physically fit in the numeric storage format;
which is something over 100000 decimal digits, not 1000.
Hence, get rid of numeric_in()'s use of NUMERIC_MAX_PRECISION as a limit
on the allowed range of the exponent in scientific-format input. That was
especially silly in view of the fact that you can enter larger numbers as
long as you don't use 'e' to do it. Just constrain the value enough to
avoid localized overflow, and let make_result be the final arbiter of what
is too large. Likewise adjust ecpg's equivalent of this code.
Also get rid of numeric_recv()'s use of NUMERIC_MAX_PRECISION to limit the
number of base-NBASE digits it would accept. That created a dump/restore
hazard for binary COPY without doing anything useful; the wire-format
limit on number of digits (65535) is about as tight as we would want.
In HEAD, also get rid of pg_size_bytes()'s unnecessary intimacy with what
the numeric range limit is. That code doesn't exist in the back branches.
Per gripe from Aravind Kumar. Back-patch to all supported branches,
since they all contain the documentation claim about allowed range of
NUMERIC (cf commit cabf5d84b).
Discussion: <2895.1471195721@sss.pgh.pa.us>
Several places in NUM_numpart_from_char(), which is called from the SQL
function to_number(text, text), could accidentally read one byte past
the end of the input buffer (which comes from the input text datum and
is not null-terminated).
1. One leading space character would be skipped, but there was no check
that the input was at least one byte long. This does not happen in
practice, but for defensiveness, add a check anyway.
2. Commit 4a3a1e2cf apparently accidentally doubled that code that skips
one space character (so that two spaces might be skipped), but there
was no overflow check before skipping the second byte. Fix by
removing that duplicate code.
3. A logic error would allow a one-byte over-read when looking for a
trailing sign (S) placeholder.
In each case, the extra byte cannot be read out directly, but looking at
it might cause a crash.
The third item was discovered by Piotr Stefaniak, the first two were
found and analyzed by Tom Lane and Peter Eisentraut.
If ANALYZE found no repeated non-null entries in its sample, it set the
column's stadistinct value to -1.0, intending to indicate that the entries
are all distinct. But what this value actually means is that the number
of distinct values is 100% of the table's rowcount, and thus it was
overestimating the number of distinct values by however many nulls there
are. This could lead to very poor selectivity estimates, as for example
in a recent report from Andreas Joseph Krogh. We should discount the
stadistinct value by whatever we've estimated the nulls fraction to be.
(That is what will happen if we choose to use a negative stadistinct for
a column that does have repeated entries, so this code path was just
inconsistent.)
In addition to fixing the stadistinct entries stored by several different
ANALYZE code paths, adjust the logic where get_variable_numdistinct()
forces an "all distinct" estimate on the basis of finding a relevant unique
index. Unique indexes don't reject nulls, so there's no reason to assume
that the null fraction doesn't apply.
Back-patch to all supported branches. Back-patching is a bit of a judgment
call, but this problem seems to affect only a few users (else we'd have
identified it long ago), and it's bad enough when it does happen that
destabilizing plan choices in a worse direction seems unlikely.
Patch by me, with documentation wording suggested by Dean Rasheed
Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>
Discussion: <16143.1470350371@sss.pgh.pa.us>
Commits 4452000f3 et al established semantics for NullTest.argisrow that
are a bit different from its initial conception: rather than being merely
a cache of whether we've determined the input to have composite type,
the flag now has the further meaning that we should apply field-by-field
testing as per the standard's definition of IS [NOT] NULL. If argisrow
is false and yet the input has composite type, the construct instead has
the semantics of IS [NOT] DISTINCT FROM NULL. Update the comments in
primnodes.h to clarify this, and fix ruleutils.c and deparse.c to print
such cases correctly. In the case of ruleutils.c, this merely results in
cosmetic changes in EXPLAIN output, since the case can't currently arise
in stored rules. However, it represents a live bug for deparse.c, which
would formerly have sent a remote query that had semantics different
from the local behavior. (From the user's standpoint, this means that
testing a remote nested-composite column for null-ness could have had
unexpected recursive behavior much like that fixed in 4452000f3.)
In a related but somewhat independent fix, make plancat.c set argisrow
to false in all NullTest expressions constructed to represent "attnotnull"
constructs. Since attnotnull is actually enforced as a simple null-value
check, this is a more accurate representation of the semantics; we were
previously overpromising what it meant for composite columns, which might
possibly lead to incorrect planner optimizations. (It seems that what the
SQL spec expects a NOT NULL constraint to mean is an IS NOT NULL test, so
arguably we are violating the spec and should fix attnotnull to do the
other thing. If we ever do, this part should get reverted.)
Back-patch, same as the previous commit.
Discussion: <10682.1469566308@sss.pgh.pa.us>
runtime.sgml used to contain a table of estimated shared memory consumption
rates for max_connections and some other GUCs. Commit 390bfc643 removed
that on the well-founded grounds that (a) we weren't maintaining the
entries well and (b) it no longer mattered so much once we got out from
under SysV shmem limits. But it missed that there were even-more-obsolete
versions of some of those numbers in comments in postgresql.conf.sample.
Remove those too. Back-patch to 9.3 where the aforesaid commit went in.
The Assert() here seems unreasonably optimistic. Andreas Seltenreich
found that it could fail with NaNs in the input geometries, and it
seems likely to me that it might fail in corner cases due to roundoff
error, even for ordinary input values. As a band-aid, make the function
return SQL NULL instead of crashing.
Report: <87d1md1xji.fsf@credativ.de>
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values. This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs. The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons. Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs. And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.
To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN. Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.
I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data. It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.
Per bug #14238 from Guang-Dih Lei. Back-patch to 9.2; the code used before
commit 7f3bd86843 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.
Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
We were merely Assert'ing that the Var matched the RTE it's supposedly
from. But if the user passes incorrect information to pg_get_expr(),
the RTE might in fact not match; this led either to Assert failures
or core dumps, as reported by Chris Hanks in bug #14220. To fix, just
convert the Asserts to test-and-elog. Adjust an existing test-and-elog
elsewhere in the same function to be consistent in wording.
(If we really felt these were user-facing errors, we might promote them to
ereport's; but I can't convince myself that they're worth translating.)
Back-patch to 9.3; the problematic code doesn't exist before that, and
a quick check says that 9.2 doesn't crash on such cases.
Michael Paquier and Thomas Munro
Report: <20160629224349.1407.32667@wrigleys.postgresql.org>
After pg_upgrade, it is possible that some tuples' Xmax have multixacts
corresponding to the old installation; such multixacts cannot have
running members anymore. In many code sites we already know not to read
them and clobber them silently, but at least when VACUUM tries to freeze
a multixact or determine whether one needs freezing, there's an attempt
to resolve it to its member transactions by calling GetMultiXactIdMembers,
and if the multixact value is "in the future" with regards to the
current valid multixact range, an error like this is raised:
ERROR: MultiXactId 123 has not been created yet -- apparent wraparound
and vacuuming fails. Per discussion with Andrew Gierth, it is completely
bogus to try to resolve multixacts coming from before a pg_upgrade,
regardless of where they stand with regards to the current valid
multixact range.
It's possible to get from under this problem by doing SELECT FOR UPDATE
of the problem tuples, but if tables are large, this is slow and
tedious, so a more thorough solution is desirable.
To fix, we realize that multixacts in xmax created in 9.2 and previous
have a specific bit pattern that is never used in 9.3 and later (we
already knew this, per comments and infomask tests sprinkled in various
places, but we weren't leveraging this knowledge appropriately).
Whenever the infomask of the tuple matches that bit pattern, we just
ignore the multixact completely as if Xmax wasn't set; or, in the case
of tuple freezing, we act as if an unwanted value is set and clobber it
without decoding. This guarantees that no errors will be raised, and
that the values will be progressively removed until all tables are
clean. Most callers of GetMultiXactIdMembers are patched to recognize
directly that the value is a removable "empty" multixact and avoid
calling GetMultiXactIdMembers altogether.
To avoid changing the signature of GetMultiXactIdMembers() in back
branches, we keep the "allow_old" boolean flag but rename it to
"from_pgupgrade"; if the flag is true, we always return an empty set
instead of looking up the multixact. (I suppose we could remove the
argument in the master branch, but I chose not to do so in this commit).
This was broken all along, but the error-facing message appeared first
because of commit 8e9a16ab8f and was partially fixed in a25c2b7c4d.
This fix, backpatched all the way back to 9.3, goes approximately in the
same direction as a25c2b7c4d but should cover all cases.
Bug analysis by Andrew Gierth and Álvaro Herrera.
A number of public reports match this bug:
https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.nethttps://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.comhttps://www.postgresql.org/message-id/556439CF.7070109@pscs.co.ukhttps://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.comhttps://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
The inet/cidr types sometimes failed to reject IPv6 inputs with too many
colon-separated fields, instead translating them to '::/0'. This is the
result of a thinko in the original ISC code that seems to be as yet
unreported elsewhere. Per bug #14198 from Stefan Kaltenbrunner.
Report: <20160616182222.5798.959@wrigleys.postgresql.org>
If both timeout indicators are set when we arrive at ProcessInterrupts,
we've historically just reported "lock timeout". However, some buildfarm
members have been observed to fail isolationtester's timeouts test by
reporting "lock timeout" when the statement timeout was expected to fire
first. The cause seems to be that the process is allowed to sleep longer
than expected (probably due to heavy machine load) so that the lock
timeout happens before we reach the point of reporting the error, and
then this arbitrary tiebreak rule does the wrong thing. We can improve
matters by comparing the scheduled timeout times to decide which error
to report.
I had originally proposed greatly reducing the 1-second window between
the two timeouts in the test cases. On reflection that is a bad idea,
at least for the case where the lock timeout is expected to fire first,
because that would assume that it takes negligible time to get from
statement start to the beginning of the lock wait. Thus, this patch
doesn't completely remove the risk of test failures on slow machines.
Empirically, however, the case this handles is the one we are seeing
in the buildfarm. The explanation may be that the other case requires
the scheduler to take the CPU away from a busy process, whereas the
case fixed here only requires the scheduler to not give the CPU back
right away to a process that has been woken from a multi-second sleep
(and, perhaps, has been swapped out meanwhile).
Back-patch to 9.3 where the isolationtester timeouts test was added.
Discussion: <8693.1464314819@sss.pgh.pa.us>
to_timestamp() handles the TH/th format codes by advancing over two input
characters, whatever those are. It failed to notice whether there were
two characters available to be skipped, making it possible to advance
the pointer past the end of the input string and keep on parsing.
A similar risk existed in the handling of "Y,YYY" format: it would advance
over three characters after the "," whether or not three characters were
available.
In principle this might be exploitable to disclose contents of server
memory. But the security team concluded that it would be very hard to use
that way, because the parsing loop would stop upon hitting any zero byte,
and TH/th format codes can't be consecutive --- they have to follow some
other format code, which would have to match whatever data is there.
So it seems impractical to examine memory very much beyond the end of the
input string via this bug; and the input string will always be in local
memory not in disk buffers, making it unlikely that anything very
interesting is close to it in a predictable way. So this doesn't quite
rise to the level of needing a CVE.
Thanks to Wolf Roediger for reporting this bug.
NetBSD has seen fit to invent a libc function named strtoi(), which
conflicts with the long-established static functions of the same name in
datetime.c and ecpg's interval.c. While muttering darkly about intrusions
on application namespace, we'll rename our functions to avoid the conflict.
Back-patch to all supported branches, since this would affect attempts
to build any of them on recent NetBSD.
Thomas Munro
When we shoehorned "x op ANY (array)" into the SQL syntax, we created a
fundamental ambiguity as to the proper treatment of a sub-SELECT on the
righthand side: perhaps what's meant is to compare x against each row of
the sub-SELECT's result, or perhaps the sub-SELECT is meant as a scalar
sub-SELECT that delivers a single array value whose members should be
compared against x. The grammar resolves it as the former case whenever
the RHS is a select_with_parens, making the latter case hard to reach ---
but you can get at it, with tricks such as attaching a no-op cast to the
sub-SELECT. Parse analysis would throw away the no-op cast, leaving a
parsetree with an EXPR_SUBLINK SubLink directly under a ScalarArrayOpExpr.
ruleutils.c was not clued in on this fine point, and would naively emit
"x op ANY ((SELECT ...))", which would be parsed as the first alternative,
typically leading to errors like "operator does not exist: text = text[]"
during dump/reload of a view or rule containing such a construct. To fix,
emit a no-op cast when dumping such a parsetree. This might well be
exactly what the user wrote to get the construct accepted in the first
place; and even if she got there with some other dodge, it is a valid
representation of the parsetree.
Per report from Karl Czajkowski. He mentioned only a case involving
RLS policies, but actually the problem is very old, so back-patch to
all supported branches.
Report: <20160421001832.GB7976@moraine.isi.edu>
As noted by Julian Schauder in bug #14063, the configuration-file parser
doesn't support embedded newlines in string literals. While there might
someday be a good reason to remove that restriction, there doesn't seem
to be one right now. However, ALTER SYSTEM SET could accept strings
containing newlines, since many of the variable-specific value-checking
routines would just see a newline as whitespace. This led to writing a
postgresql.auto.conf file that was broken and had to be removed manually.
Pending a reason to work harder, just throw an error if someone tries this.
In passing, fix several places in the ALTER SYSTEM logic that failed to
provide an errcode() for an ereport(), and thus would falsely log the
failure as an internal XX000 error.
Back-patch to 9.4 where ALTER SYSTEM was introduced.
Modern Perl has removed psed from its core distribution, so it might not
be readily available on some build platforms. We therefore replace its
use with a Perl script generated by s2p, which is equivalent to the sed
script. The latter is retained for non-MSVC builds to avoid creating a
new hard dependency on Perl for non-Windows tarball builds.
Backpatch to all live branches.
Michael Paquier and me.
In HEAD, fix incorrect field width for hours part of OF when tm_gmtoff is
negative. This was introduced by commit 2d87eedc1d as a result of
falsely applying a pattern that's correct when + signs are omitted, which
is not the case for OF.
In 9.4, fix missing abs() call that allowed a sign to be attached to the
minutes part of OF. This was fixed in 9.5 by 9b43d73b3f, but for
inscrutable reasons not back-patched.
In all three versions, ensure that the sign of tm_gmtoff is correctly
reported even when the GMT offset is less than 1 hour.
Add regression tests, which evidently we desperately need here.
Thomas Munro and Tom Lane, per report from David Fetter
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.
Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability. The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.
Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
A thinko concerning nesting depth caused json_to_record() to produce bogus
output if a field of its input object contained a sub-object with a field
name matching one of the requested output column names. Per bug #13996
from Johann Visagie.
I added a regression test case based on his example, plus parallel tests
for json_to_recordset, jsonb_to_record, jsonb_to_recordset. The latter
three do not exhibit the same bug (which suggests that we may be missing
some opportunities to share code...) but testing seems like a good idea
in any case.
Back-patch to 9.4 where these functions were introduced.
It's harmless, but might confuse readers. Seems to have been introduced
in 6bc8ef0b7f. Back-patch, just to avoid cosmetic cross-branch
differences.
Amit Langote
A failure partway through PGLC_localeconv() led to a situation where
the next call would call free_struct_lconv() a second time, leading
to free() on already-freed strings, typically leading to a core dump.
Add a flag to remember whether we need to do that.
Per report from Thom Brown. His example case only provokes the failure
as far back as 9.4, but nonetheless this code is obviously broken, so
back-patch to all supported branches.
Future PL/Java versions will close CVE-2016-0766 by making these GUCs
PGC_SUSET. This PostgreSQL change independently mitigates that PL/Java
vulnerability, helping sites that update PostgreSQL more frequently than
PL/Java. Back-patch to 9.1 (all supported versions).
Commit e09996ff8d was one brick shy of a load: it didn't insist
that the detected JSON number be the whole of the supplied string.
This allowed inputs such as "2016-01-01" to be misdetected as valid JSON
numbers. Per bug #13906 from Dmitry Ryabov.
In passing, be more wary of zero-length input (I'm not sure this can
happen given current callers, but better safe than sorry), and do some
minor cosmetic cleanup.
This will enable PL/Java to be cleanly compiled, as dynloader.h is a
requirement.
Report by Chapman Flack
Patch by Michael Paquier
Backpatch through 9.1
flatten_reloptions() supposed that it didn't really need to do anything
beyond inserting commas between reloption array elements. However, in
principle the value of a reloption could be nearly anything, since the
grammar allows a quoted string there. Any restrictions on it would come
from validity checking appropriate to the particular option, if any.
A reloption value that isn't a simple identifier or number could thus lead
to dump/reload failures due to syntax errors in CREATE statements issued
by pg_dump. We've gotten away with not worrying about this so far with
the core-supported reloptions, but extensions might allow reloption values
that cause trouble, as in bug #13840 from Kouhei Sutou.
To fix, split the reloption array elements explicitly, and then convert
any value that doesn't look like a safe identifier to a string literal.
(The details of the quoting rule could be debated, but this way is safe
and requires little code.) While we're at it, also quote reloption names
if they're not safe identifiers; that may not be a likely problem in the
field, but we might as well try to be bulletproof here.
It's been like this for a long time, so back-patch to all supported
branches.
Kouhei Sutou, adjusted some by me
A report from Andy Colson showed that gincostestimate() was not being
nearly paranoid enough about whether to believe the statistics it finds in
the index metapage. The problem is that the metapage stats (other than the
pending-pages count) are only updated by VACUUM, and in the worst case
could still reflect the index's original empty state even when it has grown
to many entries. We attempted to deal with that by scaling up the stats to
match the current index size, but if nEntries is zero then scaling it up
still gives zero. Moreover, the proportion of pages that are entry pages
vs. data pages vs. pending pages is unlikely to be estimated very well by
scaling if the index is now orders of magnitude larger than before.
We can improve matters by expanding the use of the rule-of-thumb estimates
I introduced in commit 7fb008c5ee: if the index has grown by more
than a cutoff amount (here set at 4X growth) since VACUUM, then use the
rule-of-thumb numbers instead of scaling. This might not be exactly right
but it seems much less likely to produce insane estimates.
I also improved both the scaling estimate and the rule-of-thumb estimate
to account for numPendingPages, since it's reasonable to expect that that
is accurate in any case, and certainly pages that are in the pending list
are not either entry or data pages.
As a somewhat separate issue, adjust the estimation equations that are
concerned with extra fetches for partial-match searches. These equations
suppose that a fraction partialEntries / numEntries of the entry and data
pages will be visited as a consequence of a partial-match search. Now,
it's physically impossible for that fraction to exceed one, but our
estimate of partialEntries is mostly bunk, and our estimate of numEntries
isn't exactly gospel either, so we could arrive at a silly value. In the
example presented by Andy we were coming out with a value of 100, leading
to insane cost estimates. Clamp the fraction to one to avoid that.
Like the previous patch, back-patch to all supported branches; this
problem can be demonstrated in one form or another in all of them.
Apparently, there are bugs in this code that cause it to loop endlessly.
That bug still needs more research, but in the meantime it's clear that
the loop is missing a check for interrupts so that it can be cancelled
timely.
Backpatch to 9.1 -- this has been missing since 49475aab8d.