path_encode's "closed" argument used to take three values: TRUE, FALSE,
or -1, while being of type bool. Replace that with a three-valued enum
for more clarity.
Was broken by my xloginsert scaling patch. XLogCtl global variable needs
to be initialized in each process, as it's not inherited by fork() on
Windows.
This patch replaces WALInsertLock with a number of WAL insertion slots,
allowing multiple backends to insert WAL records to the WAL buffers
concurrently. This is particularly useful for parallel loading large amounts
of data on a system with many CPUs.
This has one user-visible change: switching to a new WAL segment with
pg_switch_xlog() now fills the remaining unused portion of the segment with
zeros. This potentially adds some overhead, but it has been a very common
practice by DBA's to clear the "tail" of the segment with an external
pg_clearxlogtail utility anyway, to make the WAL files compress better.
With this patch, it's no longer necessary to do that.
This patch adds a new GUC, xloginsert_slots, to tune the number of WAL
insertion slots. Performance testing suggests that the default, 8, works
pretty well for all kinds of worklods, but I left the GUC in place to allow
others with different hardware to test that easily. We might want to remove
that before release.
Reviewed by Andres Freund.
The code in set_append_rel_pathlist() for building parameterized paths
for append relations (inheritance and UNION ALL combinations) supposed
that the cheapest regular path for a child relation would still be cheapest
when reparameterized. Which might not be the case, particularly if the
added join conditions are expensive to compute, as in a recent example from
Jeff Janes. Fix it to compare child path costs *after* reparameterizing.
We can short-circuit that if the cheapest pre-existing path is already
parameterized correctly, which seems likely to be true often enough to be
worth checking for.
Back-patch to 9.2 where parameterized paths were introduced.
On some platforms, posix_fallocate() is available but may still return
EINVAL if the underlying filesystem does not support it. So, in case
of an error, fall through to the alternate implementation that just
writes zeros.
Per buildfarm failure and analysis by Tom Lane.
Commit 31a891857a added some tests in
plpgsql.sql that used a function rather unthinkingly named "foo()".
However, rangefuncs.sql has some much older tests that create a function
of that name, and since these test scripts run in parallel, there is a
chance of failures if the timing is just right. Use another name to
avoid that. Per buildfarm (failure seen today on "hamerkop", but
probably it's happened before and not been noticed).
The old implementation converted PostgreSQL numeric to Python float,
which was always considered a shortcoming. Now numeric is converted to
the Python Decimal object. Either the external cdecimal module or the
standard library decimal module are supported.
From: Szymon Guz <mabewlun@gmail.com>
From: Ronan Dunklau <rdunklau@gmail.com>
Reviewed-by: Steve Singer <steve@ssinger.info>
This function is more efficient than actually writing out zeroes to
the new file, per microbenchmarks by Jon Nelson. Also, it may reduce
the likelihood of WAL file fragmentation.
Jon Nelson, with review by Andres Freund, Greg Smith and me.
This value, now pg_stat_all_tables.n_mod_since_analyze, was already
tracked and used by autovacuum, but not exposed to the user.
Mark Kirkwood, review by Laurenz Albe
Commit 263865a489 switched tuplesort.c and
tuplestore.c variables representing memory usage from type "long" to
type "Size". This was unnecessary; I thought doing so avoided overflow
scenarios on 64-bit Windows, but guc.c already limited work_mem so as to
prevent the overflow. It was also incomplete, not touching the logic
that assumed a signed data type. Change the affected variables to
"int64". This is perfect for 64-bit platforms, and it reduces the need
to contemplate platform-specific overflow scenarios. It also puts us
close to being able to support work_mem over 2 GiB on 64-bit Windows.
Per report from Andres Freund.
In 9.3, there's no particular limit on the number of bgworkers;
instead, we just count up the number that are actually registered,
and use that to set MaxBackends. However, that approach causes
problems for Hot Standby, which needs both MaxBackends and the
size of the lock table to be the same on the standby as on the
master, yet it may not be desirable to run the same bgworkers in
both places. 9.3 handles that by failing to notice the problem,
which will probably work fine in nearly all cases anyway, but is
not theoretically sound.
A further problem with simply counting the number of registered
workers is that new workers can't be registered without a
postmaster restart. This is inconvenient for administrators,
since bouncing the postmaster causes an interruption of service.
Moreover, there are a number of applications for background
processes where, by necessity, the background process must be
started on the fly (e.g. parallel query). While this patch
doesn't actually make it possible to register new background
workers after startup time, it's a necessary prerequisite.
Patch by me. Review by Michael Paquier.
Treat TOAST index just the same as normal one and get the OID
of TOAST index from pg_index but not pg_class.reltoastidxid.
This change allows us to handle multiple TOAST indexes, and
which is required infrastructure for upcoming
REINDEX CONCURRENTLY feature.
Patch by Michael Paquier, reviewed by Andres Freund and me.
The collate.linux.utf8 test covers some of the same territory, but
isn't portable and so probably does not get run often, or on
non-Linux platforms. If this approach turns out to be sufficiently
portable, we may want to look at trimming the redundant tests out
of that file to avoid duplication.
Robins Tharakan, reviewed by Michael Paquier and Fabien Coelho,
with further changes and cleanup by me.
An INSERT into such a view should work just like an INSERT into its base
table, ie the insertion should go directly into that table ... not be
duplicated into each child table, as was happening before, per bug #8275
from Rushabh Lathia. On the other hand, the current behavior for
UPDATE/DELETE seems reasonable: the update/delete traverses the child
tables, or not, depending on whether the view specifies ONLY or not.
Add some regression tests covering this area.
Dean Rasheed
In patch 82233ce7ea, AbortStartTime wasn't being reset appropriately
after the restart sequence, causing subsequent iterations through
ServerLoop to malfunction.
Specifically, permit attaching them to the error in RAISE and retrieving
them from a caught error in GET STACKED DIAGNOSTICS. RAISE enforces
nothing about the content of the fields; for its purposes, they are just
additional string fields. Consequently, clarify in the protocol and
libpq documentation that the usual relationships between error fields,
like a schema name appearing wherever a table name appears, are not
universal. This freedom has other applications; consider a FDW
propagating an error from an RDBMS having no schema support.
Back-patch to 9.3, where core support for the error fields was
introduced. This prevents the confusion of having a release where libpq
exposes the fields and PL/pgSQL does not.
Pavel Stehule, lexical revisions by Noah Misch.
To that end, support tags rather than lengths for external datums.
As an example of how this can be used, add support or "indirect"
tuples which point to some externally allocated memory containing
a toast tuple. Similar infrastructure could be used for other
purposes, including, perhaps, support for alternative compression
algorithms.
Andres Freund, reviewed by Hitoshi Harada and myself
With -Wtype-limits, gcc correctly points out that size_t can never be < 0.
Backpatch to 9.3 and 9.2. It's been like this forever, but in <= 9.1 you got
a lot other warnings with -Wtype-limits anyway (at least with my version of
gcc).
Andres Freund
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
The dependencies on the spi and dummy_seclabel contrib modules were
incomplete, because they did not pick up automatically generated
dependencies on header files. This will manifest itself especially when
switching major versions, where the contrib modules would not be
recompiled to contain the new version number, leading to regression test
failures.
To fix this, use the submake approach already in use elsewhere, so that
the contrib modules are built using their full rules.
Add ability for to_char() to output the timezone's UTC offset (OF). We
already have the ability to return the timezone abbeviation (TZ/tz).
Per request from Andrew Dunstan
A VPATH build will be performed when the module's make file path is not
the current directory or when USE_VPATH is set.
This will assist packagers and others who prefer to build without
polluting the source directories.
There is still a bit of work to do here, notably documentation, but it's
probably a good idea to commit what we have so far and let people test
it out on their modules.
Cédric Villemain, with an addition from me.
The pglz compressor has a significant startup cost, because it has to
initialize to zeros the history-tracking hash table. On a 64-bit system, the
hash table was 64kB in size. While clearing memory is pretty fast, for very
short inputs the relative cost of that was quite large.
This patch alleviates that in two ways. First, instead of storing pointers
in the hash table, store 16-bit indexes into the hist_entries array. That
slashes the size of the hash table to 1/2 or 1/4 of the original, depending
on the pointer width. Secondly, adjust the size of the hash table based on
input size. For very small inputs, you don't need a large hash table to
avoid collisions.
Review by Amit Kapila.
We don't normally bother retrying when the number of bytes written by
write() is short of what was requested. It is generally assumed that a
write() to disk doesn't return short, unless you run out of disk space.
While writing the WAL, however, it seems prudent to try a bit harder,
because a failure leads to PANIC. The write() is also much larger than most
write()s in the backend (up to wal_buffers), so there's more room for
surprises.
Also retry on EINTR. All signals used in the backend are flagged SA_RESTART
nowadays, so it shouldn't happen, but better to be defensive.
On immediate shutdown, or during a restart-after-crash sequence,
postmaster used to send SIGQUIT (and then abandon ship if shutdown); but
this is not a good strategy if backends don't die because of that
signal. (This might happen, for example, if a backend gets tangled
trying to malloc() due to gettext(), as in an example illustrated by
MauMau.) This causes problems when later trying to restart the server,
because some processes are still attached to the shared memory segment.
Instead of just abandoning such backends to their fates, we now have
postmaster hang around for a little while longer, send a SIGKILL after
some reasonable waiting period, and then exit. This makes immediate
shutdown more reliable.
There is disagreement on whether it's best for postmaster to exit after
sending SIGKILL, or to stick around until all children have reported
death. If this controversy is resolved differently than what this patch
implements, it's an easy change to make.
Bug reported by MauMau in message 20DAEA8949EC4E2289C6E8E58560DEC0@maumau
MauMau and Álvaro Herrera
This results in a slightly less specific error message when OVER
is used in a context where we don't accept window functions, but
per discussion, it's worth it to get the benefit of not needing
to reserve this keyword any more. This same refactoring will
also let us avoid reserving some other keywords that we expect
to add in upcoming patches (specifically, IGNORE, RESPECT, and
FILTER).
Troels Nielsen, with minor changes by me