When pg_{enable|disable}_data_checksums is called while checksums are
being enabled or disabled, the already running launcher is detected
and the new desired state is recorded. Processing will then pick up
the new state and change its operation to fulfill the new request.
If the same state is requested but with different cost values, the
new cost values will take effect on the next relation processed.
The previous coding had a complex logic of starting a new launcher
for this, which is now avoided with the shared mem structure instead
used to signal current processing.
This makes the logic more robust, and fixes a bug where the launcher
would erroneously revert back to the "off" state.
Access to the shared memory is also protected with LWLocks in all
cases. Since the shmem structure is used for signalling between
the worker and the launcher, and there can be only one of each,
there were no concurrency issues detected but it's better to stick
to proper locking protocol should this ever be updated to handle
multiple workers.
Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>
Reviewed-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com>
Discussion: https://postgr.es/m/9197F930-DDEB-4CAC-82A2-16FEC715CCE8@yesql.se
remove_self_join_rel() called adjust_relid_set() on all_result_relids
and leaf_result_relids but threw away the return value. Since
adjust_relid_set() returns a freshly-built Relids and does not modify
the input in place, the calls did nothing. This has been the case
since the SJE feature went in (commit fc069a3a6).
There has been no observable misbehavior, because the relid being
passed is guaranteed not to be a member of either set. At the point
remove_self_join_rel() runs, those sets contain only resultRelation;
inheritance children have not been added yet, as that happens later in
query_planner(), in expand_single_inheritance_child() called from
add_other_rels_to_query(). And remove_self_joins_recurse() rejects
parse->resultRelation as an SJE candidate to preserve the EvalPlanQual
mechanism. Even with the result assigned, the calls would be no-ops
in practice.
Rather than make the calls do the cleanup they pretend to do, replace
them with assertions of the invariant. Any future loosening of the
SJE candidate filter -- for instance to allow eliminating a result
relation under provable conditions -- will trip the assertion and
force whoever does it to revisit this code.
Additionally, decorate adjust_relid_set() with pg_nodiscard so that
any future accidental discard of its return value is caught at compile
time.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49fYQcqJfJ_Gtn8r1GFNoYtb1=2AUab4ieuqY4Zid9ocQ@mail.gmail.com
ExecEvalHashedScalarArrayOp(), when using a strict equality function,
performs a short-circuit when looking up NULL values. When the function
is non-strict, the code incorrectly looked up the hash table for a
zero-valued Datum, which could have resulted in an accidental true
return if the hash table contained zero valued Datum, or could result
in a crash for non-byval types.
Here we fix this by adding an extra step when we build the hash table to
check what the result of a NULL lookup would be. This requires looping
over the array and checking what the non-hashed version of the code
would do. We cache the results of that in the expression so that we can
reuse the result any time we're asked to search for a NULL value.
It's important to note that non-strict equality functions are free to
treat any NULL value as equal to any non-NULL value. For example,
someone may wish to design a type that treats an empty string and NULL
as equal.
All built-in types have strict equality functions, so this could affect
custom / user-defined types.
Author: Chengpeng Yan <chengpeng_yan@outlook.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/A16187AE-2359-4265-9F5E-71D015EC2B2D@outlook.com
Backpatch-through: 14
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places. Most of
these inconsistencies were introduced during Postgres 19 development.
This commit was written with help from clang-tidy, by mechanically
applying the same rules as similar clean-up commits (the earliest such
commit was commit 035ce1fe).
This batch is similar to 462fe0ff62 and addresses a variety of code
style issues, including grammar mistakes, typos, inconsistent variable
names in function declarations, and incorrect function names in comments
and documentation. These fixes have accumulated on the community
mailing lists since the commit mentioned above.
Notably, Alexander Lakhin previously submitted a patch identifying many
of the trivial typos and grammar issues that had been reported on
pgsql-hackers. His patch covered a somewhat large portion of the issues
addressed here, though not all of them.
The documentation changes only affect HEAD.
The argument was marked as "const void *X", but that might rightly
give the compiler the idea that *X cannot be modified through the
resulting Datum, and make incorrect optimizations based on that. Some
functions use pointer Datums to pass output arguments, like GIN
support functions. Coverity started to complain after commit
6f5ad00ab7 that there's dead code in ginExtractEntries(), because it
didn't see that it passes PointerGetDatum(&nentries) to a function
that sets it.
This issue goes back to commit c8b2ef05f4 (version 16), which
changed PointerGetDatum() from a macro to a static inline function.
This commit changes it back to a macro, but uses a trick with a dummy
conditional expression to still produce a compiler error if you try to
pass a non-pointer as the argument.
Even though this goes back to v16, I'm only committing this to
'master' for now, to verify that this silences the Coverity warning.
If this works, we might want to introduce separate const and non-const
versions of PointerGetDatum() instead of this, but that's a bigger
patch. It's also not decided yet whether to back-patch this (or some
other fix), given that we haven't yet seen any hard evidence of
compilers actually producing buggy code because of this.
Discussion: https://www.postgresql.org/message-id/342012.1776017102@sss.pgh.pa.us
The backend running REPACK can check DecodingWorkerShared->initialized
before the worker could have the chance to initialize it, possibly
leading to wrong behavior.
While at it, remove DecodingWorkerShared->worker_dsm_segment, because
that doesn't actually need to be in shared memory; a simple local-memory
global variable is enough.
Oversights in commit 28d534e2ae.
Author: Antonin Houska <ah@cybertec.at>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/18181295-8375-4789-ad32-269d78d6001e@gmail.com
Commit 095c9d4cf06 added errdetail() reporting of the PID and UID of
the process that sent a termination signal. However, as noted by
Andres Freund, the implementation had architectural problems:
1. wrapper_handler() in pqsignal.c contained SIGTERM-specific logic
(setting ProcDieSenderPid/Uid), violating its role as a generic
signal dispatch wrapper.
2. Using globals to pass sender info between wrapper_handler and the
real handler is unsafe when signals nest on some platforms.
3. The syncrep.c errdetail used psprintf() to conditionally embed
text via %s, breaking translatability.
Adopt the approach proposed by Andres Freund: introduce a
pg_signal_info struct that is passed as an argument to all signal
handlers via the SIGNAL_ARGS macro. wrapper_handler populates it
from siginfo_t when SA_SIGINFO is available, or with zeros otherwise.
This keeps wrapper_handler fully generic and avoids any globals for
passing signal metadata.
Since pqsigfunc now has a different signature from the system's
signal handler type, SIG_IGN and SIG_DFL can no longer be passed
directly to pqsignal(). Introduce PG_SIG_IGN and PG_SIG_DFL macros
that cast to the new pqsigfunc type, and update all call sites.
The legacy pqsignal() in libpq retains its original signature via
a local typedef.
Only die() reads pg_siginfo today, copying the sender PID/UID into
ProcDieSenderPid/Uid for later use by ProcessInterrupts(). Only the
first SIGTERM's sender info is recorded.
Also fix the syncrep.c translatability issue by using separate ereport
calls with complete, independently translatable errdetail strings.
Also make the psql TAP test require the DETAIL line on platforms with
SA_SIGINFO, rather than making it unconditionally optional.
On Windows, pg_signal_info uses uint32_t for pid and uid fields
since pid_t/uid_t are not available early enough in the include
chain. The Windows signal dispatch in pgwin32_dispatch_queued_signals()
passes a zeroed pg_signal_info to handlers.
Author: Andres Freund <andres@anarazel.de>
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/cwyyryh2veejuxbj5ifzyaejw7jhhqc5mrdeq56xckknsdecn2@6hzfcxde2nm5
Discussion: https://postgr.es/m/jygesyr7mwg7ovdbxpmjvvbi3hccptpkcreqb645h7f56puwbz@hmkkwi3melfe
The NOTNULL_SOURCE_SYSCACHE code path in var_is_nonnullable() used
get_attnotnull() to check pg_attribute.attnotnull, which is true for
both valid and invalid (NOT VALID) NOT NULL constraints. An invalid
constraint does not guarantee the absence of NULLs, so this could lead
to incorrect results. For example, query_outputs_are_not_nullable()
could wrongly conclude that a subquery's output is non-nullable,
causing NOT IN to be incorrectly converted to an anti-join.
Fix by checking the attnullability field in the relation's tuple
descriptor instead, which correctly distinguishes valid from invalid
constraints, consistent with what the NOTNULL_SOURCE_HASHTABLE code
path already does.
While at it, rename NOTNULL_SOURCE_SYSCACHE to NOTNULL_SOURCE_CATALOG
to reflect that this code path no longer uses a syscache lookup, and
remove the now-unused get_attnotnull() function.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48ALW=mR0ydQ62dGS-Q+3D7WdDSh=EWDezcKp19xi=TUA@mail.gmail.com
Previously we were relying on a snapshot-based check to detect invalid
execution contexts. However, when WAIT FOR is wrapped into a stored
procedure or a DO block, it could pass this check, causing an error
elsewhere.
This commit implements an explicit isTopLevel check to reject WAIT FOR
when called from within a function, procedure, or DO block. The
isTopLevel check catches these cases early with a clear error message,
matching the pattern used by other utility commands like VACUUM and
REINDEX. The snapshot check is retained for the remaining case:
top-level execution within a transaction block using an isolation level
higher than READ COMMITTED.
Also adds tests for WAIT FOR LSN wrapped in a procedure and DO block,
complementing the existing test that uses a function wrapper. Relevant
documentation paragraph is also added.
Reported-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Discussion: https://postgr.es/m/CAHg%2BQDcN-n3NUqgRtj%3DBQb9fFQmH8-DeEROCr%3DPDbw_BBRKOYA%40mail.gmail.com
Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Commit 21b018e7ea lowered some logical decoding messages from LOG to DEBUG1.
However, per discussion on pgsql-hackers, messages from background activity
(e.g., walsender or slotsync worker) should remain at LOG, as they are less
frequent and more likely to indicate issues that DBAs should notice.
For foreground SQL functions (e.g., pg_logical_slot_peek_binary_changes()),
keep these messages at DEBUG1 to avoid excessive log noise. They can still be
enabled by lowering client_min_messages or log_min_messages for the session.
This commit updates logical decoding to log these messages at LOG for
background activity and at DEBUG1 for foreground execution.
Suggested-by: Robert Haas <robertmhaas@gmail.com>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CA+TgmoYsu2+YAo9eLGkDp5VP-pfQ-jOoX382vS4THKHeRTNgew@mail.gmail.com
Presently, relation_needs_vacanalyze() unconditionally frees the
pgstat entry returned by pgstat_fetch_stat_tabentry_ext(). This
behavior was first added by commit 02502c1bca to avoid memory
leakage in autovacuum. While this is fine for autovacuum since it
forces stats_fetch_consistency to "none", it is not okay for other
callers that use "cache" or "snapshot". This manifests as a
double-free when pg_stat_autovacuum_scores is called multiple times
in the same transaction.
To fix, add a "bool *may_free" parameter to
pgstat_fetch_stat_tabentry_ext() that returns whether it is safe
for the caller to explicitly pfree() the result. If a caller would
rather leave it to the memory context machinery to free the result,
it can pass NULL as the "may_free" argument (or just ignore its
value).
Oversight in commit 87f61f0c82.
Reported-by: Tender Wang <tndrwang@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAHewXNkJKdwb3D5OnksrdOqzqUnXUEMpDam1TPW0vfUkW%3D7jUw%40mail.gmail.com
Discussion: https://postgr.es/m/5684f479-858e-4c5d-b8f5-bcf05de1f909%40gmail.com
This logging level means not to emit the log, which is useful for
functions like relation_needs_vacanalyze(). This function accepts
a log level argument but not all callers want it to emit logs.
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3101163.1775676098%40sss.pgh.pa.us
When pulling up a subquery, its targetlist items may be wrapped in
PlaceHolderVars to enforce separate identity or as a result of outer
joins. This causes any upper-level WHERE clauses referencing these
outputs to contain PlaceHolderVars, which prevents partprune.c from
recognizing that they match partition key columns, defeating partition
pruning.
To fix, strip PlaceHolderVars from operands before comparing them to
partition keys. A PlaceHolderVar with empty phnullingrels appearing
in a relation-scan-level expression is effectively a no-op, so
stripping it is safe. This parallels the existing treatment in
indxpath.c for index matching.
In passing, rename strip_phvs_in_index_operand() to strip_noop_phvs()
and move it from indxpath.c to placeholder.c, since it is now a
general-purpose utility used by both index matching and partition
pruning code.
Back-patch to v18. Although this issue exists before that, changes in
that version made it common enough to notice. Given the lack of field
reports for older versions, I am not back-patching further. In the
v18 back-patch, strip_phvs_in_index_operand() is retained as a thin
wrapper around the new strip_noop_phvs() to avoid breaking third-party
extensions that may reference it.
Reported-by: Cándido Antonio Martínez Descalzo <candido@ninehq.com>
Diagnosed-by: David Rowley <dgrowleyml@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAH5YaUwVUWETTyVECTnhs7C=CVwi+uMSQH=cOkwAUqMdvXdwWA@mail.gmail.com
Backpatch-through: 18
ee642cccc4 has added syscache.h in inval.h and objectaddress.h,
enlarging by a lot the footprint of this header, particularly via
objectaddress.h. A change in syscache.h would cause a lot more files to
be recompiled.
This commit reduces the presence of syscache.h by switching to a direct
use of syscache_ids.h in inval.h and objectaddress.h, where the enum
SysCacheIdentifier is defined. genbki.pl gains an #ifndef block for
this header, so as its inclusion is more controlled.
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/vlcexdcimsmvu3aplt2yxpfndkgtuvjsrms2fdl46rbw3k2kug@drspkoxlaije
Our RADIUS implementation supported only the deprecated RADIUS/UDP
variant, without the recommended Message-Authenticator attribute to
mitigate against the Blast-RADIUS vulnerability. By now, popular RADIUS
servers are expected to generate loud warnings or reject our
authentication attempts outright.
Since there have been no user reports about this, it seems unlikely that
there are users.
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Discussion: https://postgr.es/m/CA%2BhUKG%2BSH309V8KECU5%3DxuLP9Dks0v9f9UVS2W74fPAE5O21dg%40mail.gmail.com
Add a new FDW callback routine that allows importing remote statistics
for a foreign table directly to the local server, instead of collecting
statistics locally. The new callback routine is called at the beginning
of the ANALYZE operation on the table, and if the FDW failed to import
the statistics, the existing callback routine is called on the table to
collect statistics locally.
Also implement this for postgres_fdw. It is enabled by "restore_stats"
option both at the server and table level. Currently, it is the user's
responsibility to ensure remote statistics to import are up-to-date, so
the default is false.
Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Etsuro Fujita <etsuro.fujita@gmail.com>
Discussion: https://postgr.es/m/CADkLM%3DchrYAx%3DX2KUcDRST4RLaRLivYDohZrkW4LLBa0iBhb5w%40mail.gmail.com
The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3. It
was hard to know how to tune it effectively. That is replaced with:
io_min_workers=2
io_max_workers=8 (up to 32)
io_worker_idle_timeout=60s
io_worker_launch_interval=100ms
The pool is automatically sized within the configured range according to
recent variation in demand. It grows when existing workers detect that
latency might be introduced by queuing, and shrinks when the
highest-numbered worker is idle for too long. Work was already
concentrated into low-numbered workers in anticipation of this logic.
The logic for waking extra workers now also tries to measure and reduce
the number of spurious wakeups, though they are not entirely eliminated.
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
This was useful before widespread Unicode adoption, and was based on the
internal encoding Emacs used to mix multiple sub-encodings. Emacs
itself has stopped using it, and our implementation hadn't been updated
with modern underlying standards. It is thought to be very unlikely
that anyone is still using it in the field. Since such a complex
encoding comes with costs and risks, we agreed to drop support.
Any existing database using this encoding would need to be dumped and
restored with a new encoding to upgrade to PostgreSQL 19, most likely
UTF8, since pg_upgrade would fail.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CA%2BhUKGKXDXh-FdU0orjfv%2BF08f%3DD91BhV3Ra-4zL-q%2BJmGYqTA%40mail.gmail.com
Until now extensions that wanted to measure overall query execution could
create QueryDesc->totaltime, which the core executor would then start and
stop. That's a bit odd and composes badly, e.g. extensions always had to use
INSTRUMENT_ALL, because otherwise another extension might not get what they
need.
Instead this introduces a new field, QueryDesc->query_instr_options, that
extensions can use to indicate whether they need query level instrumentation
populated, and with which instrumentation options. Extensions should take care
to only add options they need, instead of replacing the options of others.
The prior name of the field, totaltime, sounded like it would only measure
time, but these days the instrumentation infrastructure can track more
resources. The secondary benefit is that this will make it obvious to
extensions that they may not create the Instrumentation struct themselves
anymore (often extensions build only against a postgres build without
assertions).
Adjust pg_stat_statements and auto_explain to match, and lower the
requested instrumentation level for auto_explain to INSTRUMENT_TIMER,
since the summary instrumentation it needs is only runtime.
The reason to push this now, rather in the PG 20 cycle, is that 5a79e78501
already required extensions using query level instrumentations to adjust their
code, and it seemed undesirable to require them to do so again for 20.
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53Pkyqsht+exJQYRsjhSWYKu+vFGHhPub7m6PmFD6Or0=p1g@mail.gmail.com
Previously, on standby promotion, the startup process sent SIGUSR1 to
the slotsync worker (or a backend performing slot synchronization) and
waited for it to exit. This worked in most cases, but if the process was
blocked waiting for a response from the primary (e.g., due to a network
failure), SIGUSR1 would not interrupt the wait. As a result, the process
could remain stuck, causing the startup process to wait for a long time
and delaying promotion.
This commit fixes the issue by introducing a new procsignal reason,
PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process
sends this signal, and the handler sets interrupt flags so the process
exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing
promotion to complete without delay.
Backpatch to v17, where slotsync was introduced.
Author: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com
Backpatch-through: 17
This moves the implementation of ExecProcNodeInstr, the ExecProcNode variant
that gets used when instrumentation is on, to be defined in instrument.c
instead of execProcNode.c, and marks functions it uses as inline.
This allows compilers to generate an optimized implementation, and shows a 4
to 12% reduction in instrumentation overhead for queries that move lots of
rows.
Author: Lukas Fittl <lukas@fittl.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
Adds support for EXPLAIN (IO) instrumentation for TidRange scans. This
requires adding shared instrumentation for parallel scans, using the
separate DSM approach introduced by dd78e69cfc.
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
This adds support to pg_test_timing for the different timing sources added by
294520c444.
Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> (in an earlier version)
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
Adds support for EXPLAIN (IO) instrumentation for sequential scans. This
requires adding shared instrumentation, using the separate DSM approach
introduced by dd78e69cfc.
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
Allows collecting details about AIO / prefetch for scan nodes backed by
a ReadStream. This may be enabled by a new "IO" option in EXPLAIN, and
it shows information about the prefetch distance and I/O requests.
As of this commit this applies only to BitmapHeapScan, because that's
the only scan node using a ReadStream and collecting instrumentation
from workers in a parallel query. Support for SeqScan and TidRangeScan,
the other scan nodes using ReadStream, will be added in subsequent
commits.
The stats are collected only when required by EXPLAIN ANALYZE, with the
IO option (disabled by default). The amount of collected statistics is
very limited, but we don't want to clutter EXPLAIN with too much data.
The IOStats struct is stored in the scan descriptor as a field, next to
other fields used by table AMs. A pointer to the field is passed to the
ReadStream, and updated directly.
It's the responsibility of the table AM to allocate the struct (e.g. in
ambeginscan) whenever the flag SO_SCAN_INSTRUMENT flag is passed to the
scan, so that the executor and ReadStream has access to it.
The collected stats are designed for ReadStream, but are meant to be
reasonably generic in case a TAM manages I/Os in different ways.
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSCP instructions, instead of APIs like
clock_gettime() on POSIX systems.
This reduces the overhead of EXPLAIN with ANALYZE and TIMING ON. Tests showed
that the overhead on top of actual runtime when instrumenting queries moving
lots of rows through the plan can be reduced from 2x as slow to 1.2x as slow
compared to the actual runtime. More complex workloads such as TPCH queries
have also shown ~20% gains when instrumented compared to before.
To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when reliable, for example
when running on modern Intel CPUs, or when running on Linux and the system
clocksource is reported as "tsc". The use of the operating system clock source
can be enforced by setting "system", or on x86-64 architectures the use of TSC
can be enforced by explicitly setting "tsc".
In order to use the TSC the frequency is first determined by use of CPUID, and
if not available, by running a short calibration loop at program start,
falling back to the system clock source if TSC values are not stable.
Note, that we split TSC usage into the RDTSC CPU instruction which does not
wait for out-of-order execution (faster, less precise) and the RDTSCP
instruction, which waits for outstanding instructions to retire. RDTSCP is
deemed to have little benefit in the typical InstrStartNode() /
InstrStopNode() use case of EXPLAIN, and can be up to twice as slow. To
separate these use cases, the new macro INSTR_TIME_SET_CURRENT_FAST() is
introduced, which uses RDTSC.
The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed to be
used when precision is more important than performance. When the system timing
clock source is used both of these macros instead utilize the system
APIs (clock_gettime / QueryPerformanceCounter) like before.
Additional users of interval timing, such as track_io_timing and
track_wal_io_timing could also benefit from being converted to use
INSTR_TIME_SET_CURRENT_FAST() but are left for future changes.
Author: Lukas Fittl <lukas@fittl.com>
Author: Andres Freund <andres@anarazel.de>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an earlier version)
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com> (in an earlier version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (in an earlier version)
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version)
Discussion: https://postgr.es/m/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de
This adds additional x86 specific CPUID checks for flags needed for
determining whether the Time-Stamp Counter (TSC) is usable on a given system,
as well as a helper function to retrieve the TSC frequency from CPUID.
This is intended for a future patch that will utilize the TSC to lower the
overhead of timing instrumentation.
In passing, always make pg_cpuid_subleaf reset the variables used for its
result, to avoid accidentally using stale results if __get_cpuid_count errors
out and the caller doesn't check for it.
Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: John Naylor <john.naylor@postgresql.org>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version)
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
The timing infrastructure (INSTR_* macros) measures time elapsed using
clock_gettime() on POSIX systems, which returns the time as nanoseconds,
and QueryPerformanceCounter() on Windows, which is a specialized timing
clock source that returns a tick counter that needs to be converted to
nanoseconds using the result of QueryPerformanceFrequency().
This conversion currently happens ad-hoc on Windows, e.g. when calling
INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
invocation, despite the frequency being stable after program start,
incurring unnecessary overhead. It also causes a fractured implementation
where macros are defined differently between platforms.
To ease code readability, and prepare for a future change that intends
to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
new pg_ticks_to_ns() / pg_ns_to_ticks() functions that get called from
INSTR_* macros on all platforms.
These functions rely on a separately initialized ticks_per_ns_scaled
value, that represents the conversion ratio. This value is initialized
from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
POSIX systems, which results in the ticks being treated as nanoseconds.
Other architectures always directly return the original ticks.
To support this, pg_initialize_timing() is introduced, and is now
mandatory for both the backend and any frontend programs to call before
utilizing INSTR_* macros.
In passing, fix variable names in comment documenting INSTR_TIME_ADD_NANOSEC().
Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
OAuth validators can already use custom GUCs to configure behavior
globally. But we currently provide no ability to adjust settings for
individual HBA entries, because the original design focused on a world
where a provider covered a "single audience" of users for one database
cluster. This assumption does not apply to multitenant use cases, where
a single validator may be controlling access for wildly different user
groups.
To improve this use case, add two new API calls for use by validator
callbacks: RegisterOAuthHBAOptions() and GetOAuthHBAOption().
Registering options "foo" and "bar" allows a user to set "validator.foo"
and "validator.bar" in an oauth HBA entry. These options are stringly
typed (syntax validation is solely the responsibility of the defining
module), and names are restricted to a subset of ASCII to avoid tying
our hands with future HBA syntax improvements.
Unfortunately, we can't check the custom option names during a reload of
the configuration, like we do with standard HBA options, without
requiring all validators to be loaded via shared_preload_libraries.
(I consider this to be a nonstarter: most validators should probably use
session_preload_libraries at most, since requiring a full restart just
to update authentication behavior will be unacceptable to many users.)
Instead, the new validator.* options are checked against the registered
list at connection time.
Multiple alternatives were proposed and/or prototyped, including
extending the GUC system to allow per-HBA overrides, joining forces with
recent refactoring work on the reloptions subsystem, and giving the
ability to customize HBA options to all PostgreSQL extensions. I
personally believe per-HBA GUC overrides are the best option, because
several existing GUCs like authentication_timeout and pre_auth_delay
would fit there usefully. But the recent addition of SNI per-host
settings in 4f433025f indicates that a more general solution is needed,
and I expect that to take multiple releases' worth of discussion.
This compromise patch, then, is intentionally designed to be an
architectural dead end: simple to describe, cheap to maintain, and
providing just enough functionality to let validators move forward for
PG19. The hope is that it will be replaced in the future by a solution
that can handle per-host, per-HBA, and other per-context configuration
with the same functionality that GUCs provide today. In the meantime,
the bulk of the code in this patch consists of strict guardrails on the
simple API, to try to ensure that we don't have any reason to regret its
existence during its unknown lifespan.
I owe particular thanks here to Zsolt Parragi, who prototyped several
approaches that guided the final design.
Suggested-by: Zsolt Parragi <zsolt.parragi@percona.com>
Suggested-by: VASUKI M <vasukianand0119@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAN4CZFM3b8u5uNNNsY6XCya257u%2BDofms3su9f11iMCxvCacag%40mail.gmail.com
Add a new GUC max_repack_replication_slots, which lets the user reserve
some additional replication slots for concurrent repack (and only
concurrent repack). With this, the user doesn't have to worry about
changing the max_replication_slots in order to cater for use of
concurrent repack.
(We still use the same pool of bgworkers though, but that's less
commonly a problem than slots.)
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Discussion: https://postgr.es/m/202604012148.nnnmyxxrr6nh@alvherre.pgsql
When a backend is terminated via pg_terminate_backend() or an external
SIGTERM, the error message now includes the sender's PID and UID as
errdetail, making it easier to identify the source of unexpected
terminations in multi-user environments.
On platforms that support SA_SIGINFO (Linux, FreeBSD, and most modern
Unix systems), the signal handler captures si_pid and si_uid from the
siginfo_t structure. On platforms without SA_SIGINFO, the detail is
simply omitted.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Chao Li <1356863904@qq.com>
Discussion: https://postgr.es/m/CAKZiRmyrOWovZSdixpLd3PGMQXuQL_zw2Ght5XhHCkQ1uDsxjw@mail.gmail.com
Allocates shared bitmap table scan instrumentation for all parallel
scans. Previously, the instrumentation was only allocated for
parallel-aware scans, other bitmap heap scans in the parallel query had
no shared instrumentation and EXPLAIN didn't report exact/lossy pages.
This affected cases like scans on the outside of a parallel join or
queries run with debug_parallel_query=regress.
Fixed by allocating a separate DSM chunk for shared instrumentation and
doing so regardless of parallel-awareness. The instrumentation is
allocated in its own DSM chunk, separate from ParallelBitmapHeapState.
Report an initial patch by me. The approach with a separate DSM was
proposed and implemented by Melanie.
Not backpatched. The issue affects Postgres 18 (since 5a1e6df3b8), but
having multiple DSM chunks is possible only since dd78e69cfc. If we
decide to fix this in backbranches too, it will need to be done in a
less invasive way.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
By default, the logical decoding assumes access to shared catalogs, so
the snapshot builder needs to consider cluster-wide XIDs during startup.
That in turn means that, if any transaction is already running (and has
XID assigned), the snapshot builder needs to wait for its completion, as
it does not know if that transaction performed catalog changes earlier.
A possible problem with this concept is that if REPACK (CONCURRENTLY) is
running in some database, backends running the same command in other
databases get stuck until the first one has committed. Thus only a
single backend in the cluster can run REPACK (CONCURRENTLY) at any time.
Likewise, REPACK (CONCURRENTLY) can block walsenders starting on behalf
of subscriptions throughout the cluster.
This patch adds a new option to logical replication output plugin, to
declare that it does not use shared catalogs (i.e. catalogs that can be
changed by transactions running in other databases in the cluster). In
that case, no snapshot the backend will use during the decoding needs to
contain information about transactions running in other databases. Thus
the snapshot builder only needs to wait for completion of transactions
in the current database.
Currently we only use this option in the REPACK background worker. It
could possibly be used in the plugin for logical replication too,
however that would need thorough analysis of that plugin.
Bump WAL version number, due to a new field in xl_running_xacts.
Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/90475.1775218118@localhost
Remove NULLs from the array first, and use qsort to deduplicate only
the non-NULL items. This simplifies the comparison function. Also
replace qsort_arg() with a templated version so that the comparison
function can be inlined. These changes make ginExtractEntries() a
little faster especially for simple datatypes like integers.
Author: David Geier <geidav.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/6d16b6bd-a1ff-4469-aefb-a1c8274e561a@iki.fi
Commit 5e13b0f24 used a .c file for a file containing a code fragment,
to avoid adding an exception to headerscheck. That turned out to be
too clever, since it meant installation didn't happen by the usual
mechanism. Make it look like a normal header and add the requisite
exception.
Bug: #19450
Reported-by: RekGRpth <rekgrpth@gmail.com>
Discussion: https://postgr.es/m/19450-bb0612c50c6786e5@postgresql.org
This commit changes the post_parse_analyze_hook_type() hook to take a
const JumbleState, to tell external modules that they are not allowed to
touch the JumbleState that has been compiled by the core code. This
fixes a pretty old problem with pg_stat_statements, that had always the
idea of modifying the lengths of the constants stored in the
JumbleState. The previous state could confuse extensions that need to
look at a JumbleState depending on the loading order, if
pg_stat_statements is part of the stack loaded.
Another piece included in this commit is the move of the routine
fill_in_constant_lengths() to queryjumblefuncs.c, to give an option to
extensions to compile the lengths of the constants, if necessary. I was
surprised by the number of external code that carries a copy of this
routine (see the thread for details). Previously, this routine modified
JumbleState. It now copies the set of LocationLens from JumbleState,
and fills the constant lengths for separate use.
pg_stat_statements is updated to use the new ComputeConstantLengths().
JumbleState is now marked with a const in the module, where relevant.
Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
Previously, parallel index and index-only scans packed the parallel scan
descriptor and shared instrumentation (for EXPLAIN ANALYZE) into a
single DSM allocation. Since scans may be instrumented without being
parallel-aware, and vice versa, using separate DSM chunks -- each with
its own TOC key -- is cleaner. A future commit will extend this pattern
to other scan node types.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
This view contains one row for each table in the current database,
showing the current autovacuum scores for that specific table. It
also shows whether autovacuum would vacuum or analyze the table.
Bumps catversion.
Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
When this flag is specified, REPACK no longer acquires access-exclusive
lock while the new copy of the table is being created; instead, it
creates the initial copy under share-update-exclusive lock only (same as
vacuum, etc), and it follows an MVCC snapshot; it sets up a replication
slot starting at that snapshot, and uses a concurrent background worker
to do logical decoding starting at the snapshot to populate a stash of
concurrent data changes. Those changes can then be re-applied to the
new copy of the table just before swapping the relfilenodes.
Applications can continue to access the original copy of the table
normally until just before the swap, which is the only point at which
the access-exclusive lock is needed.
There are some loose ends in this commit:
1. concurrent repack needs its own replication slot in order to apply
logical decoding, which are a scarce resource and easy to run out of.
2. due to the way the historic snapshot is initially set up, only one
REPACK process can be running at any one time on the whole system.
3. there's a danger of deadlocking (and thus abort) due to the lock
upgrade required at the final phase.
These issues will be addressed in upcoming commits.
The design and most of the code are by Antonin Houska, heavily based on
his own pg_squeeze third-party implementation.
Author: Antonin Houska <ah@cybertec.at>
Co-authored-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Discussion: https://postgr.es/m/5186.1706694913@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql
transformCreateSchemaStmtElements has always believed that it is
supposed to re-order the subcommands of CREATE SCHEMA into a safe
execution order. However, it is nowhere near being capable of doing
that correctly. Nor is there reason to think that it ever will be,
or that that is a well-defined requirement. (The SQL standard does
say that it should be possible to do foreign-key forward references
within CREATE SCHEMA, but it's not clear that the text requires
anything more than that.) Moreover, the problem will get worse as
we add more subcommand types. Let's just drop the whole idea and
execute the commands in the order given, which seems like a much
less astonishment-prone definition anyway. The foreign-key issue
will be handled in a follow-up patch.
This will result in a release-note-worthy incompatibility,
which is that forward references like
CREATE SCHEMA myschema
CREATE VIEW myview AS SELECT * FROM mytable
CREATE TABLE mytable (...);
used to work and no longer will. Considering how many closely
related variants never worked, this isn't much of a loss.
Along the way, pass down a ParseState so that we can provide an
error cursor for "wrong schema name" and related errors, and fix
transformCreateSchemaStmtElements so that it doesn't scribble
on the parsetree passed to it.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
Previously, autovacuum always disabled parallel vacuum regardless of
the table's index count or configuration. This commit enables
autovacuum workers to use parallel index vacuuming and index cleanup,
using the same parallel vacuum infrastructure as manual VACUUM.
Two new configuration options control the feature. The GUC
autovacuum_max_parallel_workers sets the maximum number of parallel
workers a single autovacuum worker may launch; it defaults to 0,
preserving existing behavior unless explicitly enabled. The per-table
storage parameter autovacuum_parallel_workers provides per-table
limits. A value of 0 disables parallel vacuum for the table, a
positive value caps the worker count (still bounded by the GUC), and
-1 (the default) defers to the GUC.
To handle cases where autovacuum workers receive a SIGHUP and update
their cost-based vacuum delay parameters mid-operation, a new
propagation mechanism is added to vacuumparallel.c. The leader stores
its effective cost parameters in a DSM segment. Parallel vacuum
workers poll for changes in vacuum_delay_point(); if an update is
detected, they apply the new values locally via VacuumUpdateCosts().
A new test module, src/test/modules/test_autovacuum, is added to
verify that parallel autovacuum workers are correctly launched and
that cost-parameter updates are propagated as expected.
The patch was originally proposed by Maxim Orlov, but the
implementation has undergone significant architectural changes
since then during the review process.
Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: zengman <zengman@halodbtech.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
CLUSTER is no longer the favored way to invoke this functionality, and
the code is about to shift its focus to the REPACK more ambitiously.
Rename the file to avoid leaving an unnecessary historical artifact
around.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202603271635.owyhm7btgoic@alvherre.pgsql
It would be useful to be able to tell auto_explain to set a custom
EXPLAIN option, but it would be bad if it tried to do so and the
option name or value wasn't valid, because then every query would fail
with a complaint about the EXPLAIN option. So add a guc_check_handler
that auto_explain will be able to use to only try to set option
name/value/type combinations that have been determined to be legal,
and to emit useful messages about ones that aren't.
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
Previously, this logic was embedded within SplitIdentifierString,
SplitDirectoriesString, and SplitGUCList. Factoring it out saves
a bit of duplicated code, and also makes it available to extensions
that might want to do similar things without necessarily wanting to
do exactly the same thing.
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
Some compilers didn't like the empty initializer when compiled without
USE_INJECTION_POINTS. Per buildfarm member 'drongo', using Visual
Studio 2019.
Author: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/adNHcBVJO5gIOp1l@paquier.xyz
Previously, one LWLock was used for each lock type, adding complexity
without an observable performance benefit as data is gathered only for
paths involving lock waits, at least currently. This commit replaces
the per-type set of LWLocks with a single LWLock protecting the stats
data of all the lock types, like the stats kinds for SLRU or WAL. A
good chunk of the callbacks get simpler thanks to this change.
The previous approach also had one bug in the flush callback when nowait
was called with "true": a backend iterating over all entries could
successfully flush some entries while skipping others due to contention,
then unconditionally reset the pending data. This would cause some
stats data loss.
Oversight in 4019f725f5.
Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/1af63e6d-16d5-4d5b-9b03-11472ef1adf9@vondra.me
Previously, during shutdown, walsenders always waited until all pending data
was replicated to receivers. This ensures sender and receiver stay in sync
after shutdown, which is important for physical replication switchovers,
but it can significantly delay shutdown. For example, in logical replication,
if apply workers are blocked on locks, walsenders may wait until those locks
are released, preventing shutdown from completing for a long time.
This commit introduces a new GUC, wal_sender_shutdown_timeout,
which specifies the maximum time a walsender waits during shutdown for all
pending data to be replicated. When set, shutdown completes once all data is
replicated or the timeout expires. A value of -1 (the default) disables
the timeout.
This can reduce shutdown time when replication is slow or stalled. However,
if the timeout is reached, the sender and receiver may be left out of sync,
which can be problematic for physical replication switchovers.
Author: Andrey Silitskiy <a.silitskiy@postgrespro.ru>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Ronan Dunklau <ronan@dunklau.fr>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com