Previously, md.c and checkpointer.c were tightly integrated so that
fsync calls could be handed off and processed in the background.
Introduce a system of callbacks and file tags, so that other modules
can hand off fsync work in the same way.
For now only md.c uses the new interface, but other users are being
proposed. Since there may be use cases that are not strictly SMGR
implementations, use a new function table for sync handlers rather
than extending the traditional SMGR one.
Instead of using a bitmapset of segment numbers for each RelFileNode
in the checkpointer's hash table, make the segment number part of the
key. This requires sending explicit "forget" requests for every
segment individually when relations are dropped, but suits the file
layout schemes of proposed future users better (ie sparse or high
segment numbers).
Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
This change makes it possible to specify sub-millisecond delays,
which work well on most modern platforms, though that was not true
when the cost-delay feature was designed.
To support this without breaking existing configuration entries,
improve guc.c to allow floating-point GUCs to have units. Also,
allow "us" (microseconds) as an input/output unit for time-unit GUCs.
(It's not allowed as a base unit, at least not yet.)
Likewise change the autovacuum_vacuum_cost_delay reloption to be
floating-point; this forces a catversion bump because the layout of
StdRdOptions changes.
This patch doesn't in itself change the default values or allowed
ranges for these parameters, and it should not affect the behavior
for any already-allowed setting for them.
Discussion: https://postgr.es/m/1798.1552165479@sss.pgh.pa.us
Background workers, including parallel workers, were generating
the same sequence of numbers in random(). This showed up as DSM
handle collisions when Parallel Hash created multiple segments,
but any code that calls random() in background workers could be
affected if it cares about different backends generating different
numbers.
Repair by making sure that all new processes initialize the seed
at the same time as they set MyProcPid and MyStartTime in a new
function InitProcessGlobals(), called by the postmaster, its
children and also standalone processes. Also add a new high
resolution MyStartTimestamp as a potentially useful by-product,
and remove SessionStartTime from struct Port as it is now
redundant.
No back-patch for now, as the known consequences so far are just
a bunch of harmless shm_open(O_EXCL) collisions.
Author: Thomas Munro
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D2eJj_6%3DB%2B2tEpGu2nf1BjthCf9nXXUouYvJJ4C5WSwhg%40mail.gmail.com
Currently there are two ways to trigger log rotation in logging collector
process: call pg_rotate_logfile() SQL-function or send SIGUSR1 signal directly
to logging collector process. However, it's nice to have more suitable way
for external tools to do that, which wouldn't require SQL connection or
knowledge of logging collector pid. This commit implements triggering log
rotation by "pg_ctl logrotate" command.
Discussion: https://postgr.es/m/20180416.115435.28153375.horiguchi.kyotaro%40lab.ntt.co.jp
Author: Kyotaro Horiguchi, Alexander Kuzmenkov, Alexander Korotkov
This reverts the backend sides of commit 1fde38beaa.
I have, at least for now, left the pg_verify_checksums tool in place, as
this tool can be very valuable without the rest of the patch as well,
and since it's a read-only tool that only runs when the cluster is down
it should be a lot safer.
This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).
Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.
This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.
Authors: Magnus Hagander and Daniel Gustafsson
Review: Tomas Vondra, Michael Banck, Heikki Linnakangas, Andrey Borodin
THis adds a "flags" field to the BackgroundWorkerInitializeConnection()
and BackgroundWorkerInitializeConnectionByOid(). For now only one flag,
BGWORKER_BYPASS_ALLOWCONN, is defined, which allows the worker to ignore
datallowconn.
Fix the warnings created by the compiler warning options
-Wformat-overflow=2 -Wformat-truncation=2, supported since GCC 7. This
is a more aggressive variant of the fixes in
6275f5d28a, which GCC 7 warned about by
default.
The issues are all harmless, but some dubious coding patterns are
cleaned up.
One issue that is of external interest is that BGW_MAXLEN is increased
from 64 to 96. Apparently, the old value would cause the bgw_name of
logical replication workers to be truncated in some circumstances.
But this doesn't actually add those warning options. It appears that
the warnings depend a bit on compilation and optimization options, so it
would be annoying to have to keep up with that. This is more of a
once-in-a-while cleanup.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Autovacuum's 'workitem' request queue is of limited size, so requests
can fail if they arrive more quickly than autovacuum can process them.
Emit a log message when this happens, to provide better visibility of
this.
Backpatch to 10. While this represents an API change for
AutoVacuumRequestWork, that function is not yet prepared to deal with
external modules calling it, so there doesn't seem to be any risk (other
than log spam, that is.)
Author: Masahiko Sawada
Reviewed-by: Fabrízio Mello, Ildar Musin, Álvaro Herrera
Discussion: https://postgr.es/m/CAD21AoB1HrQhp6_4rTyHN5kWEJCEsG8YzsjZNt-ctoXSn5Uisw@mail.gmail.com
Add bgw_type field to background worker structure. It is intended to be
set to the same value for all workers of the same type, so they can be
grouped in pg_stat_activity, for example.
The backend_type column in pg_stat_activity now shows bgw_type for a
background worker. The ps listing also no longer calls out that a
process is a background worker but just show the bgw_type. That way,
being a background worker is more of an implementation detail now that
is not shown to the user. However, most log messages still refer to
'background worker "%s"'; otherwise constructing sensible and
translatable log messages would become tricky.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:
* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
than the expected column 33.
On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list. This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.
There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses. I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
To prevent future bugs along the lines of the one corrected by commit
8ff518699f, or find any that remain
in the current code, add an Assert() that the difference between
parallel_register_count and parallel_terminate_count is in a sane
range.
Kuntal Ghosh, with considerable tidying-up by me, per a suggestion
from Neha Khatri. Reviewed by Tomas Vondra.
Discussion: http://postgr.es/m/CAFO0U+-E8yzchwVnvn5BeRDPgX2z9vZUxQ8dxx9c0XFGBC7N1Q@mail.gmail.com
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
On EXEC_BACKEND builds, this can fail if ASLR is in use.
Backpatch to 9.5. On master, completely remove the bgw_main field
completely, since there is no situation in which it is safe for an
EXEC_BACKEND build. On 9.6 and 9.5, leave the field intact to avoid
breaking things for third-party code that doesn't care about working
under EXEC_BACKEND. Prior to 9.5, there are no in-core bgworker
entrypoints.
Petr Jelinek, reviewed by me.
Discussion: http://postgr.es/m/09d8ad33-4287-a09b-a77f-77f8761adb5e@2ndquadrant.com
The syslogger will write out the current stderr and csvlog names, if
it's running and there are any, to a new file in the data directory
called "current_logfiles". We take care to remove this file when it
might no longer be valid (but not at shutdown). The function
pg_current_logfile() can be used to read the entries in the file.
Gilles Darold, reviewed and modified by Karl O. Pinc, Michael
Paquier, and me. Further review by Álvaro Herrera and Christoph Berg.
Tom Lane observed buildfarm failures caused by the select_parallel
regression test trying to launch new parallel queries before the
worker slots used by the previous ones were freed. Try to fix this by
having the postmaster free the worker slots before it sends the
SIGUSR1 notifications to the registering process. This doesn't
completely eliminate the possibility that the user backend might
(correctly) observe the worker as dead before the slot is free, but I
believe it should make the window significantly narrower.
Patch by me, per complaint from Tom Lane. Reviewed by Amit Kapila.
Discussion: http://postgr.es/m/30673.1487310734@sss.pgh.pa.us
Increase the default value of the existing max_worker_processes GUC
from 8 to 16, and add a new max_parallel_workers GUC with a maximum
of 8. This way, even if the maximum amount of parallel query is
happening, there is still room for background workers that do other
things, as originally envisioned when max_worker_processes was added.
Julien Rouhaud, reviewed by Amit Kapila and by revised by me.
"xlog" is not a particularly clear abbreviation for "write-ahead log",
and it sometimes confuses users into believe that the contents of the
"pg_xlog" directory are not critical data, leading to unpleasant
consequences. So, rename the directory to "pg_wal".
This patch modifies pg_upgrade and pg_basebackup to understand both
the old and new directory layouts; the former is necessary given the
purpose of the tool, while the latter merely avoids an unnecessary
backward-compatibility break.
We may wish to consider renaming other programs, switches, and
functions which still use the old "xlog" naming to also refer to
"wal". However, that's still under discussion, so let's do just this
much for now.
Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com
Michael Paquier
Commits 470d886c3 et al intended to fix the problem that the postmaster
selected the same "random" DSM control segment ID on every start. But
using PostmasterRandom() for that destroys the intended property that the
delay between random_start_time and random_stop_time will be unpredictable.
(Said delay is probably already more predictable than we could wish, but
that doesn't mean that reducing it by a couple orders of magnitude is OK.)
Revert the previous patch and add a comment warning against misuse of
PostmasterRandom. Fix the original problem by calling srandom() early in
PostmasterMain, using a low-security seed that will later be overwritten
by PostmasterRandom.
Discussion: <20789.1474390434@sss.pgh.pa.us>
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al. But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4. The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.
This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.
To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.
Discussion: 20160118163908.GW10941@awork2.anarazel.de
Up until now, the total amount of data that could be passed to a
background worker at startup was one datum, which can be a small as
4 bytes on some systems. That's enough to pass a dsm_handle or an
array index, but not much else. Add a bgw_extra flag to the
BackgroundWorker struct, allowing up to 128 bytes to be passed to
a new worker on any platform.
Use this to fix a problem I recently discovered with the parallel
context machinery added in 9.5: the master assigns each worker an
array index, and each worker subsequently assigns itself an array
index, and there's nothing to guarantee that the two sets of indexes
match, leading to chaos.
Normally, I would not back-patch the change to add bgw_extra, since it
is basically a feature addition. However, since 9.5 is still in beta
and there seems to be no other sensible way to repair the broken
parallel context machinery, back-patch to 9.5. Existing background
worker code can ignore the bgw_extra field without a problem, but
might need to be recompiled since the structure size has changed.
Report and patch by me. Review by Amit Kapila.
Previously, we would archive the possible-incomplete WAL segment with its
normal filename, but that causes trouble if the server owning that timeline
is still running, and tries to archive the same segment later. It's not nice
for the standby to trip up the master's archival like that. And it's pretty
confusing, anyway, to have an incomplete segment in the archive that's
indistinguishable from a normal, complete segment.
To avoid such confusion, add a .partial suffix to the file. Or to be more
precise, make a copy of the old segment under the .partial suffix, and
archive that instead of the original file. pg_receivexlog also uses the
.partial suffix for the same purpose, to tell apart incompletely streamed
files from complete ones.
There is no automatic mechanism to use the .partial files at recovery, so
they will go unused, unless the administrator manually copies to them to
the pg_xlog directory (and removes the .partial suffix). Recovery won't
normally need the WAL - when recovering to the new timeline, it will find
the same WAL on the first segment on the new timeline instead - but it
nevertheless feels better to archive the file with the .partial suffix, for
debugging purposes if nothing else.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
This improves on commit bbfd7edae5 by
making two simple changes:
* pg_attribute_noreturn now takes parentheses, ie pg_attribute_noreturn().
Likewise pg_attribute_unused(), pg_attribute_packed(). This reduces
pgindent's tendency to misformat declarations involving them.
* attributes are now always attached to function declarations, not
definitions. Previously some places were taking creative shortcuts,
which were not merely candidates for bad misformatting by pgindent
but often were outright wrong anyway. (It does little good to put a
noreturn annotation where callers can't see it.) In any case, if
we would like to believe that these macros can be used with non-gcc
compilers, we should avoid gratuitous variance in usage patterns.
I also went through and manually improved the formatting of a lot of
declarations, and got rid of excessively repetitive (and now obsolete
anyway) comments informing the reader what pg_attribute_printf is for.
Until now __attribute__() was defined to be empty for all compilers but
gcc. That's problematic because it prevents using it in other compilers;
which is necessary e.g. for atomics portability. It's also just
generally dubious to do so in a header as widely included as c.h.
Instead add pg_attribute_format_arg, pg_attribute_printf,
pg_attribute_noreturn macros which are implemented in the compilers that
understand them. Also add pg_attribute_noreturn and pg_attribute_packed,
but don't provide fallbacks, since they can affect functionality.
This means that external code that, possibly unwittingly, relied on
__attribute__ defined to be empty on !gcc compilers may now run into
warnings or errors on those compilers. But there shouldn't be many
occurances of that and it's hard to work around...
Discussion: 54B58BA3.8040302@ohmu.fi
Author: Oskari Saarenmaa, with some minor changes by me.
Replace some bogus "x[1]" declarations with "x[FLEXIBLE_ARRAY_MEMBER]".
Aside from being more self-documenting, this should help prevent bogus
warnings from static code analyzers and perhaps compiler misoptimizations.
This patch is just a down payment on eliminating the whole problem, but
it gets rid of a lot of easy-to-fix cases.
Note that the main problem with doing this is that one must no longer rely
on computing sizeof(the containing struct), since the result would be
compiler-dependent. Instead use offsetof(struct, lastfield). Autoconf
also warns against spelling that offsetof(struct, lastfield[0]).
Michael Paquier, review and additional fixes by me.
Sometimes it's useful for a background worker to be able to initialize
its database connection by OID rather than by name, so provide a way
to do that.
Just as we would start bgworkers immediately after an initial startup
of the server, we should restart them immediately when reinitializing.
Petr Jelinek and Robert Haas
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
Although previously-introduced APIs allow the process that registers a
background worker to obtain the worker's PID, there's no way to prevent
a worker that is not currently running from being restarted. This
patch introduces a new API TerminateBackgroundWorker() that prevents
the background worker from being restarted, terminates it if it is
currently running, and causes it to be unregistered if or when it is
not running.
Patch by me. Review by Michael Paquier and KaiGai Kohei.
Using the infrastructure provided by this patch, it's possible either
to wait for the startup of a dynamically-registered background worker,
or to poll the status of such a worker without waiting. In either
case, the current PID of the worker process can also be obtained.
As usual, worker_spi is updated to demonstrate the new functionality.
Patch by me. Review by Andres Freund.
Previously one had to use slist_delete(), implying an additional scan of
the list, making this infrastructure considerably less efficient than
traditional Lists when deletion of element(s) in a long list is needed.
Modify the slist_foreach_modify() macro to support deleting the current
element in O(1) time, by keeping a "prev" pointer in addition to "cur"
and "next". Although this makes iteration with this macro a bit slower,
no real harm is done, since in any scenario where you're not going to
delete the current list element you might as well just use slist_foreach
instead. Improve the comments about when to use each macro.
Back-patch to 9.3 so that we'll have consistent semantics in all branches
that provide ilist.h. Note this is an ABI break for callers of
slist_foreach_modify().
Andres Freund and Tom Lane
Per discussion on pgsql-hackers, these aren't really needed. Interim
versions of the background worker patch had the worker starting with
signals already unblocked, which would have made this necessary.
But the final version does not, so we don't really need it; and it
doesn't work well with the new facility for starting dynamic background
workers, so just rip it out.
Also per discussion on pgsql-hackers, back-patch this change to 9.3.
It's best to get the API break out of the way before we do an
official release of this facility, to avoid more pain for extension
authors later.
There is a new API, RegisterDynamicBackgroundWorker, which allows
an ordinary user backend to register a new background writer during
normal running. This means that it's no longer necessary for all
background workers to be registered during processing of
shared_preload_libraries, although the option of registering workers
at that time remains available.
When a background worker exits and will not be restarted, the
slot previously used by that background worker is automatically
released and becomes available for reuse. Slots used by background
workers that are configured for automatic restart can't (yet) be
released without shutting down the system.
This commit adds a new source file, bgworker.c, and moves some
of the existing control logic for background workers there.
Previously, there was little enough logic that it made sense to
keep everything in postmaster.c, but not any more.
This commit also makes the worker_spi contrib module into an
extension and adds a new function, worker_spi_launch, which can
be used to demonstrate the new facility.
Commit da07a1e8 was broken for EXEC_BACKEND because I failed to realize
that the MaxBackends recomputation needed to be duplicated by
subprocesses in SubPostmasterMain. However, instead of having the value
be recomputed at all, it's better to assign the correct value at
postmaster initialization time, and have it be propagated to exec'ed
backends via BackendParameters.
MaxBackends stays as zero until after modules in
shared_preload_libraries have had a chance to register bgworkers, since
the value is going to be untrustworthy till that's finished.
Heikki Linnakangas and Álvaro Herrera
Background workers are postmaster subprocesses that run arbitrary
user-specified code. They can request shared memory access as well as
backend database connections; or they can just use plain libpq frontend
database connections.
Modules listed in shared_preload_libraries can register background
workers in their _PG_init() function; this is early enough that it's not
necessary to provide an extra GUC option, because the necessary extra
resources can be allocated early on. Modules can install more than one
bgworker, if necessary.
Care is taken that these extra processes do not interfere with other
postmaster tasks: only one such process is started on each ServerLoop
iteration. This means a large number of them could be waiting to be
started up and postmaster is still able to quickly service external
connection requests. Also, shutdown sequence should not be impacted by
a worker process that's reasonably well behaved (i.e. promptly responds
to termination signals.)
The current implementation lets worker processes specify their start
time, i.e. at what point in the server startup process they are to be
started: right after postmaster start (in which case they mustn't ask
for shared memory access), when consistent state has been reached
(useful during recovery in a HOT standby server), or when recovery has
terminated (i.e. when normal backends are allowed).
In case of a bgworker crash, actions to take depend on registration
data: if shared memory was requested, then all other connections are
taken down (as well as other bgworkers), just like it were a regular
backend crashing. The bgworker itself is restarted, too, within a
configurable timeframe (which can be configured to be never).
More features to add to this framework can be imagined without much
effort, and have been discussed, but this seems good enough as a useful
unit already.
An elementary sample module is supplied.
Author: Álvaro Herrera
This patch is loosely based on prior patches submitted by KaiGai Kohei,
and unsubmitted code by Simon Riggs.
Reviewed by: KaiGai Kohei, Markus Wanner, Andres Freund,
Heikki Linnakangas, Simon Riggs, Amit Kapila
Replace unix_socket_directory with unix_socket_directories, which is a list
of socket directories, and adjust postmaster's code to allow zero or more
Unix-domain sockets to be created.
This is mostly a straightforward change, but since the Unix sockets ought
to be created after the TCP/IP sockets for safety reasons (better chance
of detecting a port number conflict), AddToDataDirLockFile needs to be
fixed to support out-of-order updates of data directory lockfile lines.
That's a change that had been foreseen to be necessary someday anyway.
Honza Horak, reviewed and revised by Tom Lane
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
There was a wild mix of calling conventions: Some were declared to
return void and didn't return, some returned an int exit code, some
claimed to return an exit code, which the callers checked, but
actually never returned, and so on.
Now all of these functions are declared to return void and decorated
with attribute noreturn and don't return. That's easiest, and most
code already worked that way.
Commit 6d90eaaa89 added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption. However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out. That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening. Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop. To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.
In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes. I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately. In any case it did nothing to improve the readability or
robustness of the code.
In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant. I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
bgwriter is now a much less important process, responsible for page
cleaning duties only. checkpointer is now responsible for checkpoints
and so has a key role in shutdown. Later patches will correct doc
references to the now old idea that bgwriter performs checkpoints.
Has beneficial effect on performance at high write rates, but mainly
refactoring to more easily allow changes for power reduction by
simplifying previously tortuous code around required to allow page
cleaning and checkpointing to time slice in the same process.
Patch by me, Review by Dickson Guedes
detect postmaster death. Postmaster keeps the write-end of the pipe open,
so when it dies, children get EOF in the read-end. That can conveniently
be waited for in select(), which allows eliminating some of the polling
loops that check for postmaster death. This patch doesn't yet change all
the loops to use the new mechanism, expect a follow-on patch to do that.
This changes the interface to WaitLatch, so that it takes as argument a
bitmask of events that it waits for. Possible events are latch set, timeout,
postmaster death, and socket becoming readable or writeable.
The pipe method behaves slightly differently from the kill() method
previously used in PostmasterIsAlive() in the case that postmaster has died,
but its parent has not yet read its exit code with waitpid(). The pipe
returns EOF as soon as the process dies, but kill() continues to return
true until waitpid() has been called (IOW while the process is a zombie).
Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
PostmasterIsAlive() would claim it's still alive. That could easily lead to
busy-waiting while postmaster is in zombie state.
Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
Florian Pflug.
This allows us to reliably remove all leftover temporary relation
files on cluster startup without reference to system catalogs or WAL;
therefore, we no longer include temporary relations in XLOG_XACT_COMMIT
and XLOG_XACT_ABORT WAL records.
Since these changes require including a backend ID in each
SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id
field has been reduced from two bytes to one, and the maximum number
of connections has been reduced from INT_MAX / 4 to 2^23-1. It would
be possible to remove these restrictions by increasing the size of
SharedInvalidationMessage by 4 bytes, but right now that doesn't seem
like a good trade-off.
Review by Jaime Casanova and Tom Lane.
Normally, we automatically restart after a backend crash, but in some
cases when PostgreSQL is invoked by clusterware it may be desirable to
suppress this behavior, so we provide an option which does this.
Since no existing GUC group quite fits, create a new group called
"error handling options" for this and the previously undocumented GUC
exit_on_error, which is now documented.
Review by Fujii Masao.
log files created by the syslogger process.
In passing, make unix_file_permissions display its value in octal, same
as log_file_mode now does.
Martin Pihlak
build actually attempts to advertise itself via Bonjour. Formerly it always
did so, which meant that packagers had to decide for their users whether
this behavior was wanted or not. The default is "off" to be on the safe
side, though this represents a change in the default behavior of a
Bonjour-enabled build. Per discussion.
via the "flat files" facility. This requires making it enough like a backend
to be able to run transactions; it's no longer an "auxiliary process" but
more like the autovacuum worker processes. Also, its signal handling has
to be brought into line with backends/workers. In particular, since it
now has to handle procsignal.c processing, the special autovac-launcher-only
signal conditions are moved to SIGUSR2.
Alvaro, with some cleanup from Tom
a backend has done exit(0) or exit(1) without having disengaged itself
from shared memory. We are at risk for this whenever third-party code is
loaded into a backend, since such code might not know it's supposed to go
through proc_exit() instead. Also, it is reported that under Windows
there are ways to externally kill a process that cause the status code
returned to the postmaster to be indistinguishable from a voluntary exit
(thank you, Microsoft). If this does happen then the system is probably
hosed --- for instance, the dead session might still be holding locks.
So the best recovery method is to treat this like a backend crash.
The dead man switch is armed for a particular child process when it
acquires a regular PGPROC, and disarmed when the PGPROC is released;
these should be the first and last touches of shared memory resources
in a backend, or close enough anyway. This choice means there is no
coverage for auxiliary processes, but I doubt we need that, since they
shouldn't be executing any user-provided code anyway.
This patch also improves the management of the EXEC_BACKEND
ShmemBackendArray array a bit, by reducing search costs.
Although this problem is of long standing, the lack of field complaints
seems to mean it's not critical enough to risk back-patching; at least
not till we get some more testing of this mechanism.
of multiple forks, and each fork can be created and grown separately.
The bulk of this patch is about changing the smgr API to include an extra
ForkNumber argument in every smgr function. Also, smgrscheduleunlink and
smgrdounlink no longer implicitly call smgrclose, because other forks might
still exist after unlinking one. The callers of those functions have been
modified to call smgrclose instead.
This patch in itself doesn't have any user-visible effect, but provides the
infrastructure needed for upcoming patches. The additional forks envisioned
are a rewritten FSM implementation that doesn't rely on a fixed-size shared
memory block, and a visibility map to allow skipping portions of a table in
VACUUM that have no dead tuples.
* stats_start_collector goes away; we always start the collector process,
unless prevented by a problem with setting up the stats UDP socket.
* stats_reset_on_server_start goes away; it seems useless in view of the
availability of pg_stat_reset().
* stats_block_level and stats_row_level are merged into a single variable
"track_counts", which controls all reports sent to the collector process.
* stats_command_string is renamed to track_activities.
* log_autovacuum is renamed to log_autovacuum_min_duration to better reflect
its meaning.
The log_autovacuum change is not a compatibility issue since it didn't exist
before 8.3 anyway. The other changes need to be release-noted.
and fsync WAL at convenient intervals. For the moment it just tries to
offload this work from backends, but soon it will be responsible for
guaranteeing a maximum delay before asynchronously-committed transactions
will be flushed to disk.
This is a portion of Simon Riggs' async-commit patch, committed to CVS
separately because a background WAL writer seems like it might be a good idea
independently of the async-commit feature. I rebased walwriter.c on
bgwriter.c because it seemed like a more appropriate way of handling signals;
while the startup/shutdown logic in postmaster.c is more like autovac because
we want walwriter to quit before we start the shutdown checkpoint.
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.
Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.
Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
by having the postmaster signal it when certain failures occur. This requires
the postmaster setting a flag in shared memory, but should be as safe as the
pmsignal.c code is.
Also make sure the launcher honor's a postgresql.conf change turning it off
on SIGHUP.
reassembled in the syslogger before writing to the log file. This prevents
partial messages from being written, which mucks up log rotation, and
messages from different backends being interleaved, which causes garbled
logs. Backport as far as 8.0, where the syslogger was introduced.
Tom Lane and Andrew Dunstan
processes to be running simultaneously. Also, now autovacuum processes do not
count towards the max_connections limit; they are counted separately from
regular processes, and are limited by the new GUC variable
autovacuum_max_workers.
The launcher now has intelligence to launch workers on each database every
autovacuum_naptime seconds, limited only on the max amount of worker slots
available.
Also, the global worker I/O utilization is limited by the vacuum cost-based
delay feature. Workers are "balanced" so that the total I/O consumption does
not exceed the established limit. This part of the patch was contributed by
ITAGAKI Takahiro.
Per discussion.