Due to the recent changes to use a share-exclusive mode for setting hint bits
and for flushing pages - instead of using share mode as before - a buffer
cannot be dirtied while the flush is ongoing. The reason we needed
JUST_DIRTIED was to handle the case where the buffer was dirtied while IO was
ongoing - which is not possible anymore.
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
On platforms where we can read or write the whole LSN atomically, we do
not need to lock the buffer header to prevent torn LSNs. We can do this
only on platforms with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY, and when the
pd_lsn field is properly aligned.
For historical reasons the PageXLogRecPtr was defined as a struct with
two uint32 fields. This replaces it with a single uint64 value, to make
the intent clearer. To prevent issues with weak typedefs the value is
still wrapped in a struct.
This also adjusts heapfuncs() in pageinspect, to ensure proper alignment
when reading the LSN from a page on alignment-sensitive hardware.
Idea by Andres Freund. Initial patch by Andreas Karlsson, improved by
Peter Geoghegan. Minor tweaks by me.
Author: Andreas Karlsson <andreas@proxel.se>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se
This fixes two bugs in commit 1887d822f1.
First, if we are using the fallback C++ implementation of typeof, then
we need to include the C++ header <type_traits> for
std::remove_reference_t. This header is also likely to be used for
other C++ implementations of type tricks, so we'll put it into the
global includes.
Second, for the case that the C compiler supports typeof in a spelling
that is not "typeof" (for example, __typeof__), then we need to #undef
typeof in the C++ section to avoid warnings about duplicate macro
definitions.
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
At the moment hint bits can be set with just a share lock on a page (and,
until 45f658dacb, in one case even without any lock). Because of this we need
to copy pages while writing them out, as otherwise the checksum could be
corrupted.
The need to copy the page is problematic to implement AIO writes:
1) Instead of just needing a single buffer for a copied page we need one for
each page that's potentially undergoing I/O
2) To be able to use the "worker" AIO implementation the copied page needs to
reside in shared memory
It also causes problems for using unbuffered/direct-IO, independent of AIO:
Some filesystems, raid implementations, ... do not tolerate the data being
written out to change during the write. E.g. they may compute internal
checksums that can be invalidated by concurrent modifications, leading e.g. to
filesystem errors (as the case with btrfs).
It also just is plain odd to allow modifications of buffers that are just
share locked.
To address these issues, this commit changes the rules so that modifications
to pages are not allowed anymore while holding a share lock. Instead the new
share-exclusive lock (introduced in fcb9c977aa) allows at most one backend to
modify a buffer while other backends have the same page share locked. An
existing share-lock can be upgraded to a share-exclusive lock, if there are no
conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits()
and BufferSetHintBits16() have been introduced.
To prevent hint bits from being set while the buffer is being written out,
writing out buffers now requires a share-exclusive lock.
The use of share-exclusive to gate setting hint bits means that from now on
only one backend can set hint bits at a time. To allow multiple backends to
set hint bits would require more complicated locking: For setting hint bits
we'd need to store the count of backends currently setting hint bits and we
would need another lock-level for I/O conflicting with the lock-level to set
hint bits. Given that the share-exclusive lock for setting hint bits is only
held for a short time, that backends would often just set the same hint bits
and that the cost of occasionally not setting hint bits in hotly accessed
pages is fairly low, this seems like an acceptable tradeoff.
The biggest change to adapt to this is in heapam. To avoid performance
regressions for sequential scans that need to set a lot of hint bits, we need
to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the
new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint
bits on a page have been set. Conversely, to avoid regressions in cases where
we can't set hint bits in bulk (because we're looking only at individual
tuples), use BufferSetHintBits16() when setting hint bits without batching.
Several other places also need to be adapted, but those changes are
comparatively simpler.
After this we do not need to copy buffers to write them out anymore. That
change is done separately however.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
Previously WAL records that froze tuples used OldestXmin as the snapshot
conflict horizon, or the visibility cutoff if the page would become
all-frozen. Both are newer than (or equal to) the newst XID actually
frozen on the page.
Track the newest XID that will be frozen and use that as the snapshot
conflict horizon instead. This yields an older horizon resulting in
fewer query cancellations on standbys.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com
REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single
command. Because this functionality is completely different from
regular VACUUM, having it separate from VACUUM makes it easier for users
to understand; as for CLUSTER, the term is heavily overloaded in the
IT world and even in Postgres itself, so it's good that we can avoid it.
We retain those older commands, but de-emphasize them in the
documentation, in favor of REPACK; the difference between VACUUM FULL
and CLUSTER (namely, the fact that tuples are written in a specific
ordering) is neatly absorbed as two different modes of REPACK.
This allows us to introduce further functionality in the future that
works regardless of whether an ordering is being applied, such as (and
especially) a concurrent mode.
Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/82651.1720540558@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql
Up until now, the only way for a loadable module to disable the use of a
particular index was to use build_simple_rel_hook (or, previous to
yesterday's commit, get_relation_info_hook) to remove it from the index
list. While that works, it has some disadvantages. First, the index
becomes invisible for all purposes, and can no longer be used for
optimizations such as self-join elimination or left join removal, which
can severely degrade the resulting plan.
Second, if the module attempts to compel the use of a certain index
by removing all other indexes from the index list and disabling
other scan types, but the planner is unable to use the chosen index
for some reason, it will fall back to a sequential scan, because that
is only disabled, whereas the other indexes are, from the planner's
point of view, completely gone. While this situation ideally shouldn't
occur, it's hard for a loadable module to be completely sure whether
the planner will view a certain index as usable for a certain query.
If it isn't, it may be better to fall back to a scan using a disabled
index rather than falling back to an also-disabled sequential scan.
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYS4ZCVAF2jTce%3DbMP0Oq_db_srocR4cZyO0OBp9oUoGg%40mail.gmail.com
For a long time, PostgreSQL has had a get_relation_info_hook which
plugins can use to editorialize on the information that
get_relation_info obtains from the catalogs. However, this hook is
only called for baserels of type RTE_RELATION, and there is
potential utility in a similar call back for other types of
RTEs. This might have had utility even before commit
4020b370f2 added pgs_mask to
RelOptInfo, but it certainly has utility now.
So, move the callback up one level, deleting get_relation_info_hook and
adding build_simple_rel_hook instead. The new callback is called just
slightly later than before and with slightly different arguments, but it
should be fairly straightforward to adjust existing code that currently
uses get_relation_info_hook: the values previously available as
relationObjectId and inhparent are now available via rte->relid and
rte->inh, and calls where rte->rtekind != RTE_RELATION can be ignored if
desired.
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYg8uUWyco7Pb3HYLMBRQoO6Zh9hwgm27V39Pb6Pdf%3Dug%40mail.gmail.com
Previously, the comments stated that there was no purpose to considering
startup cost for partial paths, but this is not the case: it's perfectly
reasonable to want a fast-start path for a plan that involves a LIMIT
(perhaps over an aggregate, so that there is enough data being processed
to justify parallel query but yet we don't want all the result rows).
Accordingly, rewrite add_partial_path and add_partial_path_precheck to
consider startup costs. This also fixes an independent bug in
add_partial_path_precheck: commit e222534679
failed to update it to do anything with the new disabled_nodes field.
That bug fix is formally separate from the rest of this patch and could
be committed separately, but I think it makes more sense to fix both
issues together, because then we can (as this commit does) just make
add_partial_path_precheck do the cost comparisons in the same way as
compare_path_costs_fuzzily, which hopefully reduces the chances of
ending up with something that's still incorrect.
This patch is based on earlier work on this topic by Tomas Vondra,
but I have rewritten a great deal of it.
Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tomas Vondra <tomas@vondra.me>
Discussion: http://postgr.es/m/CA+TgmobRufbUSksBoxytGJS1P+mQY4rWctCk-d0iAUO6-k9Wrg@mail.gmail.com
When I (rhaas) wrote the WAL summarizer code, I incorrectly believed
that XLOG_SMGR_TRUNCATE truncates all forks to the same length. In
fact, what other parts of the code do is compute the truncation length
for the FSM and VM forks from the truncation length used for the main
fork. But, because I was confused, I coded the WAL summarizer to set the
limit block for the VM fork to the same value as for the main fork.
(Incremental backup always copies FSM forks in full, so there is no
similar issue in that case.)
Doing that doesn't directly cause any data corruption, as far as I can
see. However, it does create a serious risk of consuming a large amount
of extra disk space, because pg_combinebackup's reconstruct.c believes
that the reconstructed file should always be at least as long as the
limit block value. We might want to be smarter about that at some point
in the future, because it's always safe to omit all-zeroes blocks at the
end of the last segment of a relation, and doing so could save disk
space, but the current algorithm will rarely waste enough disk space to
worry about unless we believe that a relation has been truncated to a
length much longer than its actual length on disk, which is exactly what
happens as a result of the problem mentioned in the previous paragraph.
To fix, create a new visibilitymap helper function and use it to include
the right limit block in the summary files. Incremental backups taken
with existing summary files will still have this issue, but this should
improve the situation going forward.
Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com>
Diagnosed-by: Amul Sul <sulamul@gmail.com>
Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com
Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com
Backpatch-through: 17
pg_dumpall is missing checks for some conflicting options,
including those passed through to pg_dump. To fix, introduce a
new function that checks whether mutually exclusive options are
set, and use that in pg_dumpall. A similar change could likely be
made for pg_dump and pg_restore, but that is left as a future
exercise.
This is arguably a bug fix, but since this might break existing
scripts, no back-patch for now.
Author: Jian He <jian.universality@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Wang Peng <215722532@qq.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CACJufxFf5%3DwSv2MsuO8iZOvpLZQ1-meAMwhw7JX5gNvWo5PDug%40mail.gmail.com
Until now, substitute_grouped_columns and its predecessor
check_ungrouped_columns intentionally did not cope with references
to GROUP BY expressions (anything more complex than a Var) within
subqueries of the query having GROUP BY. Because they didn't try to
match subexpressions of subqueries to the GROUP BY list, they'd drill
down to raw Vars of the grouping level and then fail with "subquery
uses ungrouped column from outer query". There have been remarkably
few complaints about this deficiency, so nobody ever did anything
about it.
The reason for not wanting to deal with it is that within a subquery,
Vars will have varlevelsup different from zero and will thus not be
equal() to the expressions seen in the outer query. We recognized
this at least as far back as 96ca8ffeb, although I think the comment
I added about it then was just documenting a pre-existing deficiency.
It looks like at the time, the solutions I considered were
(1) write a version of equal() that permits an offset in varlevelsup,
or (2) dynamically apply IncrementVarSublevelsUp at each
subexpression. (1) would require an amount of new code that seems
rather out of proportion to the benefit, while (2) would add an
exponential amount of cost to the matching process. But rethinking
it now, what seems attractive is (3) apply IncrementVarSublevelsUp to
the groupingClause list not the subexpressions, and do so only once
per subquery depth level. Then we can still use plain equal() to
check for matches, and we're not incurring cost proportional to some
power of the subquery's complexity.
This patch continues to use the old logic when the GROUP BY list is
all Vars. We could discard the special comparison logic for that and
always do it the more general way, but that would be a good deal
slower. (Micro-benchmarking just parse analysis suggests it's about
50% slower than the Vars-only path. But we've not heard complaints
about the speed of matching within the main query, so I doubt that
applying the same matching logic within subqueries will be a problem.)
The lack of complaints suggests strongly that this is a very minority
use-case, so I don't want to make the typical case slower to fix it.
While testing that, I was surprised to discover a nearby bug:
GROUPING() within a subquery fails to match GROUP BY Vars that are
join alias Vars. It tries to apply flatten_join_alias_vars to make
such cases work, but that fails to work inside a subquery because
varlevelsup is wrong. Therefore, this patch invents a new entry point
flatten_join_alias_for_parser() that allows specification of a
sublevels_up offset. (It seems cleaner to give the parser its own
entry point rather than abuse the planner's conventions even further.)
While this is pretty clearly a bug fix, I'm hesitant to take the risk
of back-patching, seeing that the existing behavior has stood for so
long with so few complaints. Maybe we can reconsider once this patch
has baked awhile in master.
Reported-by: PALAYRET Jacques <jacques.palayret@meteo.fr>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/531183.1772058731@sss.pgh.pa.us
Allow CREATE SUBSCRIPTION to accept a foreign server using the SERVER
clause instead of a raw connection string using the CONNECTION clause.
* Enables a user with sufficient privileges to create a subscription
using a foreign server by name without specifying the connection
details.
* Integrates with user mappings (and other FDW infrastructure) using
the subscription owner.
* Provides a layer of indirection to manage multiple subscriptions
to the same remote server more easily.
Also add CREATE FOREIGN DATA WRAPPER ... CONNECTION clause to specify
a connection_function. To be eligible for a subscription, the foreign
server's foreign data wrapper must specify a connection_function.
Add connection_function support to postgres_fdw, and bump postgres_fdw
version to 1.3.
Bump catversion.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/61831790a0a937038f78ce09f8dd4cef7de7456a.camel@j-davis.com
wait_event.h itself includes wait_event_types.h, which is a generated
file, so it's nice that we can avoid compiling >10% of the tree just
because that file is regenerated.
To avoid breaking too many third-party modules, we now #include
utils/wait_classes.h in storage/latch.h. Then, the very common case
of doing
WaitLatch(..., PG_WAIT_EXTENSION)
continues to work by including just storage/latch.h. (I didn't try to
determine how many modules would actually break if we don't do this, but
this seems a convenient and low-impact measure.)
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/202602181214.gcmhx2vhlxzp@alvherre.pgsql
Use a different way to write StaticAssertExpr() that does not require
the GCC extension statement expressions.
For C, we put the static_assert into a struct. This appears to be a
common approach.
We still need to keep the fallback implementation to support buggy
MSVC < 19.33.
For C++, we put it into a lambda expression. (The C approach doesn't
work; it's not permitted to define a new type inside sizeof.)
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/5fa3a9f5-eb9a-4408-9baf-403d281f8b10%40eisentraut.org
This commit introduces pg_stat_recovery, that exposes at SQL level the
state of recovery as tracked by XLogRecoveryCtlData in shared memory,
maintained by the startup process. This new view includes the following
fields, that are useful for monitoring purposes on a standby, once it
has reached a consistent state (making the execution of the SQL function
possible):
- Last-successfully replayed WAL record LSN boundaries and its timeline.
- Currently replaying WAL record end LSN and its timeline.
- Current WAL chunk start time.
- Promotion trigger state.
- Timestamp of latest processed commit/abort.
- Recovery pause state.
Some of this data can already be recovered from different system
functions, but not all of it. See pg_get_wal_replay_pause_state or
pg_last_xact_replay_timestamp. This new view offers a stronger
consistency guarantee, by grabbing the recovery state for all fields
through one spinlock acquisition.
The system view relies on a new function, called pg_stat_get_recovery().
Querying this data requires the pg_read_all_stats privilege. The view
returns no rows if the node is not in recovery.
This feature originates from a suggestion I have made while discussion
the addition of a CONNECTING state to the WAL receiver's shared memory
state, because we lacked access to some of the state data. The author
has taken the time to implement it, so thanks for that.
Bump catalog version.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
Discussion: https://postgr.es/m/aW13GJn_RfTJIFCa@paquier.xyz
Up to now, to create such a function, one had to make a pg_proc.dat
entry and then modify it with GRANT/REVOKE commands, which we put in
system_functions.sql. That seems a little ugly though, because it
violates the idea of having a single source of truth about the initial
contents of pg_proc, and it results in leaving dead rows in the
initial contents of pg_proc.
This patch improves matters by allowing aclitemin to work during early
bootstrap, before pg_authid has been loaded. On the same principle
that we use for early access to pg_type details, put a table of known
built-in role names into bootstrap.c, and use that in bootstrap mode.
To create a built-in function with a non-default ACL, one should write
the desired ACL list in its pg_proc.dat entry, using a simplified
version of aclitemout's notation: omit the grantor (if it is the
bootstrap superuser, which it pretty much always should be) and spell
the bootstrap superuser's name as POSTGRES, similarly to the notation
used elsewhere in src/include/catalog. This results in entries like
proacl => '{POSTGRES=X,pg_monitor=X}'
which shows that we've revoked public execute permissions and instead
granted that to pg_monitor.
In addition to fixing up pg_proc.dat entries, I got rid of some
role grants that had been stuck into system_functions.sql,
and instead put them into a new file pg_auth_members.dat;
that seems like a far less random place to put the information.
The correctness of the data changes can be verified by comparing the
initial contents of pg_proc and pg_auth_members before and after.
pg_proc should match exactly, but the OID column of pg_auth_members
will probably be different because those OIDs now get assigned a
little earlier in bootstrap. (I forced a catversion bump out of
caution, but it wasn't really necessary.)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net
The PruneState had members called "all_visible" and "all_frozen" which
reflect not the current state of the page but the state it could be in
once pruning and freezing have been executed. These are then saved in
the PruneFreezeResult so the caller can set the VM accordingly.
Prefix the PruneState members as well as the corresponsding
PruneFreezeResult members with "set_" to clarify that they represent the
proposed state of the all-visible and all-frozen bits for a heap page in
the visibility map, not the current state.
Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
This is similar to the other page accessors in bufpage.h. It improves
readability and avoids long lines.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com
XLogRecoveryCtlData is the structure that stores the shared-memory state
of WAL recovery, including information such as promotion requests, the
timeline ID (TLI), and the LSNs of replayed records.
This refactoring is independently useful because it allows code outside
of core to access the recovery state in live. It will be used by an
upcoming patch that introduces a SQL function for querying this
information, that can be accessed on a standby once a consistent state
has been reached. This only moves code around, changing nothing
functionally.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
The default value for default_toast_compression was "pglz". The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres. However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average. As
of this commit, the default value of default_toast_compression becomes
"lz4", if available. By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.
Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago. This should be long enough to consider this feature as
stable.
While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample. Quotes are not required in this case. The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.
Note that this is a version lighter than 7c1849311e, that included a
replacement of --with-lz4 by --without-lz4 in configure builds, forcing
a requirement for LZ4 in all environments. The buildfarm did not like
it, at all. This commit switches default_toast_compression to lz4 as
default only when --with-lz4 is defined, which should keep the buildfarm
at bay while still allowing users to benefit from LZ4 compression in
TOAST as long as the code is compiled with it.
Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org
This reverts commit 7c1849311e, due to the fact that more than 60% of
the buildfarm members do not have lz4 installed. As we are in the last
commit fest of the development cycle, and that it could take a couple
of weeks to stabilize things, this change is reverted for now.
This commit will be reworked in a lighter version, as
default_toast_compression's default can be changed to "lz4" without the
switch from --with-lz4 to --without-lz4. This approach will keep the
buildfarm at bay, and still allow builds to take advantage of LZ4 in
TOAST by default, as long as the code is compiled with LZ4 support.
A harder requirement based on LZ4 should be achievable at some point,
but it is going to require some work from the buildfarm owners first.
Perhaps this part could be revisited at the beginning of the next
development cycle.
Discussion: https://postgr.es/m/CAOYmi+meTT0NbLbnVqOJD5OKwCtHL86PQ+RZZTrn6umfmHyWaw@mail.gmail.com
Extend CREATE PUBLICATION ... FOR ALL TABLES to support the EXCEPT TABLE
syntax. This allows one or more tables to be excluded. The publisher will
not send the data of excluded tables to the subscriber.
To support this, pg_publication_rel now includes a prexcept column to flag
excluded relations. For partitioned tables, the exclusion is applied at
the root level; specifying a root table excludes all current and future
partitions in that tree.
Follow-up work will implement ALTER PUBLICATION support for managing these
exclusions.
Author: vignesh C <vignesh21@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
The default value for default_toast_compression was "pglz". The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres. However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average. As
of this commit, the default value of default_toast_compression becomes
"lz4", if available. By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.
Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago. This should be long enough to consider this feature as
stable.
--with-lz4 is removed, replaced by a --without-lz4 to disable LZ4 in the
builds on an option-basis, following a practice similar to readline or
ICU. References to --with-lz4 are removed from the documentation.
While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample. Quotes are not required in this case. The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.
For the reference, a similar switch has been done with ICU in
fcb21b3acd. Some of the changes done in this commit are consistent
with that.
Note: this is going to create some disturbance in the buildfarm, in
environments where lz4 is not installed.
Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org
Read stream users can now pause lookahead when no blocks are currently
available. After resuming, subsequent read_stream_next_buffer() calls
continue lookahead with the previous lookahead distance.
This is especially useful for read stream users with self-referential
access patterns (where consuming already-read buffers can produce
additional block numbers).
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com
If ON_ERROR SET_NULL is specified during COPY FROM, any data type
conversion errors will result in the affected column being set to a
null value. A column's not-null constraints are still enforced, and
attempting to set a null value in such columns will raise a constraint
violation error. This applies to a column whose data type is a domain
with a NOT NULL constraint.
Author: Jian He <jian.universality@gmail.com>
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-by: Yugo NAGATA <nagata@sraoss.co.jp>
Reviewed-by: torikoshia <torikoshia@oss.nttdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CAKFQuwawy1e6YR4S%3Dj%2By7pXqg_Dw1WBVrgvf%3DBP3d1_aSfe_%2BQ%40mail.gmail.com
Commit ab355e3a88 changed how the OldestMemberMXactId array is
indexed. It's no longer indexed by synthetic dummyBackendId, but with
ProcNumber. The PGPROC entries for prepared xacts come after auxiliary
processes in the allProcs array, which rendered the calculation for
MaxOldestSlot and the indexes into the array incorrect. (The
OldestVisibleMXactId array is not used for prepared xacts, and thus
never accessed with ProcNumber's greater than MaxBackends, so this
only affects the OldestMemberMXactId array.)
As a result, a prepared xact would store its value past the end of the
OldestMemberMXactId array, overflowing into the OldestVisibleMXactId
array. That could cause a transaction's row lock to appear invisible
to other backends, or other such visibility issues. With a very small
max_connections setting, the store could even go beyond the
OldestVisibleMXactId array, stomping over the first element in the
BufferDescriptor array.
To fix, calculate the array sizes more precisely, and introduce helper
functions to calculate the array indexes correctly.
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/7acc94b0-ea82-4657-b1b0-77842cb7a60c@postgrespro.ru
Backpatch-through: 17
Calling copyObject in C++ without GNU extensions (e.g. when using
-std=c++11 instead of -std=gnu++11) fails with an error like this:
error: use of undeclared identifier 'typeof'; did you mean 'typeid'
This is due to the C compiler used to compile PostgreSQL supporting
typeof, but that function actually not being present in the C++
compiler. This fixes that by explicitely checking for typeof support
in C++, and then either use that or define typeof ourselves as:
std::remove_reference_t<decltype(x)>
According to the paper that led to adding typeof to the C standard,
that's the C++ equivalent of the C typeof:
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2927.htm#existing-decltype
Author: Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/DGPW5WCFY7WY.1IHCDNIVVT300%2540jeltef.nl
We can use either of these to implement a missing explicit_bzero().
explicit_memset() is supported on NetBSD. NetBSD hitherto didn't have
a way to implement explicit_bzero() other than the fallback variant.
memset_explicit() is the C23 standard, so we use it as first
preference. It is currently supported on:
- NetBSD 11
- FreeBSD 15
- glibc 2.43
It doesn't provide additional coverage, but as it's the new standard,
its availability will presumably grow.
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/c4701776-8d99-41da-938d-88528a3adc15%40eisentraut.org
There are no known users of this flag. The last supposed user was
pglesslog, which is the reason why this flag has been introduced in
core, based on an historical search pointing at a8d539f124.
I have mentioned that we may want to remove this flag back in 2018, due
to zero users of it in core. More recently, Noah has pointed out that
this flag is not safe to use: XLP_BKP_REMOVABLE can be set by the WAL
writer in a lock-free fashion with runningBackups > 0, meaning that some
full-page images could be required but not logged, ultimately corrupting
backups.
Bump XLOG_PAGE_MAGIC.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/20250705001628.c3.nmisch@google.com
Discussion: https://postgr.es/m/CAEze2WhiwKSoAvfUggjDeoeY0-rz9cTpfrHcqvBMmJxv-K_5DA@mail.gmail.com
We now maintain an array of booleans that indicate which features were
detected at runtime. When code wants to check for a given feature,
the array is automatically checked if it has been initialized and if
not, a single function checks all features at once.
Move all x86 feature detection to pg_cpu_x86.c, and move the CRC
function choosing logic to the file where the hardware-specific
functions are defined, consistent with more recent hardware-specific
files in src/port.
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CANWCAZbgEUFw7LuYSVeJ=Tj98R5HoOB1Ffeqk3aLvbw5rU5NTw@mail.gmail.com
heapam_scan_analyze_next_tuple() doesn't distinguish between dead and
recently dead tuples when counting them, so it doesn't need OldestXmin.
GetOldestNonRemovableTransactionId() isn't free, so removing it is a
win.
Looking at other table AMs implementing table_scan_analyze_next_tuple(),
we couldn't find one using OldestXmin either, so remove it from the
callback.
Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CALdSSPjvhGXihT_9f-GJabYU%3D_PjrFDUxYaURuTbfLyQM6TErg%40mail.gmail.com
This macro had exactly one user in InstrStartNode, and the caller can
instead use INSTR_TIME_IS_ZERO / INSTR_TIME_SET_CURRENT directly.
This supports a future change that intends to modify the time source being
used in the InstrStartNode case.
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53Pkx1bK1FB71_nBqYmzvSSXnp_MbE0ZDnU+baPJF6Ud2WDA@mail.gmail.com
This was incorrectly named "LT" for "larger than" in e5a5e0a907, but
that is against existing conventions, where "LT" means "less than".
Clarify by using "GT" for "greater than" in macro name, and add a missing
comment at the top of instr_time.h to note the macro's existence.
Reported by: Peter Smith <smithpb2250@gmail.com>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAHut%2BPut94CTpjQsqOJHdHkgJ2ZXq%2BqVSfMEcmDKLiWLW-hPfA%40mail.gmail.com
This commit includes a batch of fixes for various minor typos and
grammar mistakes, that have been proposed to the hackers mailing list
since the beginning of January.
Similar batches are planned on a bi-monthly basis depending on the
amount received, with the next one for the end of April.
There were a couple of comments in parse_relation.c
> Note: properly, lockmode should be declared LOCKMODE not int, but that
> would require importing storage/lock.h into parse_relation.h. Since
> LOCKMODE is typedef'd as int anyway, that seems like overkill.
but actually LOCKMODE has been in storage/lockdefs.h for a while,
which is intentionally a more narrow header. So we can include that
one in parse_relation.h and just use LOCKMODE normally.
An alternative would be to add a duplicate typedef into
parse_relation.h, but that doesn't seem necessary here.
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/4bcd65fb-2497-484c-bb41-83cb435eb64d%40eisentraut.org
The concerns that led us to remove AIX support in commit 0b16bb877
have now been alleviated:
1. IBM has stepped forward to provide support, including buildfarm
animal(s).
2. AIX 7.2 and later seem to be fine with large pg_attribute_aligned
requirements. Since 7.1 is now EOL anyway, we can just cease to
support it.
3. Tossing xlc support overboard seems okay as well. It's a bit
sad to drop one of the few remaining non-gcc-alike compilers, but
working around xlc's bugs and idiosyncrasies doesn't seem justified
by the theoretical portability benefits.
4. Likewise, we can stop supporting 32-bit AIX builds. This is
not so much about whether we could build such executables as that
they're too much of a pain to manage in the field, due to limited
address space available for dynamic library loading.
5. We hit on a way to manage catalog column alignment that doesn't
require continuing developer effort (see commit ecae09725).
Hence, this commit reverts 0b16bb877 and some follow-on commits
such as e6bb491bf, except for not putting back XLC support nor
the changes related to catalog column alignment.
Some other notable changes from the way things were in v16:
Prefer unnamed POSIX semaphores on AIX, rather than the default
choice of SysV semaphores.
Include /opt/freeware/lib in -Wl,-blibpath, even when it is not
mentioned anywhere in LDFLAGS.
Remove platform-specific adjustment of MEMSET_LOOP_LIMIT; maybe
that's still the right thing, but it really ought to be re-tested.
Silence compiler warnings related to getpeereid(), wcstombs_l(),
and PAM conversation procs.
Accept "libpythonXXX.a" as an okay name for the Python shared
library (but only on AIX!).
Author: Aditya Kamath <Aditya.Kamath1@ibm.com>
Author: Srirama Kucherlapati <sriram.rk@in.ibm.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CY5PR11MB63928CC05906F27FB10D74D0FD322@CY5PR11MB6392.namprd11.prod.outlook.com
Because we assume that int64 and double have the same alignment
requirement, AIX's default behavior that alignof(double) = 4 while
alignof(int64) = 8 is a headache. There are two issues:
1. We align both int8 and float8 tuple columns per ALIGNOF_DOUBLE,
which is an ancient choice that can't be undone without breaking
pg_upgrade and creating some subtle SQL-level compatibility issues
too. However, the cost of that is just some marginal inefficiency
in fetching int8 values, which can't be too awful if the platform
architects were willing to pay the same costs for fetching float8s.
So our decision is to leave that alone. This patch makes our
alignment choices the same as they were pre-v17, namely that
ALIGNOF_DOUBLE and ALIGNOF_INT64_T are whatever the compiler prefers
and then MAXIMUM_ALIGNOF is the larger of the two. (On all supported
platforms other than AIX, all three values will be the same.)
2. We need to overlay C structs onto catalog tuples, and int8 fields
in those struct declarations may not be aligned to match this rule.
In the old branches we had some annoying rules about ordering catalog
columns to avoid alignment problems, but nobody wants to resurrect
those. However, there's a better answer: make the compiler construe
those struct declarations the way we need it to by using the pack(N)
pragma. This requires no manual effort to maintain going forward;
we only have to insert the pragma into all the catalog *.h files.
(As the catalogs stand at this writing, nothing actually changes
because we've not moved any affected columns since v16; hence no
catversion bump is required. The point of this is to not have
to worry about the issue going forward.)
We did not have this option when the AIX port was first made. This
patch depends on the C99 feature _Pragma(), as well as the pack(N)
pragma which dates to somewhere around gcc 4.0, and probably doesn't
exist in xlc at all. But now that we've agreed to toss xlc support
out the window, there doesn't seem to be a reason not to go this way.
In passing, I got rid of LONGALIGN[_DOWN] along with the configure
probes for ALIGNOF_LONG. We were not using those anywhere and it
seems highly unlikely that we'd do so in future. Instead supply
INT64ALIGN[_DOWN], which isn't used either but at least could
have a good reason to be used.
Discussion: https://postgr.es/m/1127261.1769649624@sss.pgh.pa.us
This replaces some loops over word-length popcount functions with
calls to pg_popcount(). Since pg_popcount() may use a function
pointer for inputs with sizes >= a Bitmapset word, this produces a
small regression for the common one-word case in bms_num_members().
To deal with that, this commit adds an inlined fast-path for that
case. This fast-path could arguably go in pg_popcount() itself
(with an appropriate alignment check), but that is left for future
study.
Suggested-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CANWCAZY7R%2Biy%2Br9YM_sySNydHzNqUirx1xk0tB3ej5HO62GdgQ%40mail.gmail.com
This commit replaces the implementations of pg_popcount{32,64} with
branchless ones in plain C. While these new implementations do not
make use of more sophisticated population count instructions
available on some CPUs, testing indicates they perform well,
especially now that they are inlined. Newer versions of popular
compilers will automatically replace these with special
instructions if possible, anyway. A follow-up commit will replace
various loops over these functions with calls to pg_popcount(),
leaving us little reason to worry about micro-optimizing them
further.
Since this commit removes the only uses of the popcount builtins,
we can also remove the corresponding configuration checks.
Suggested-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CANWCAZY7R%2Biy%2Br9YM_sySNydHzNqUirx1xk0tB3ej5HO62GdgQ%40mail.gmail.com
conflict.h currently includes utils/timestamp.h despite only requiring
basic timestamp type definitions. This creates unnecessary overhead.
Replace the include with datatype/timestamp.h to provide the necessary
types. This change requires explicitly including utils/timestamp.h in
test_custom_fixed_stats.c, which previously relied on the indirect
inclusion.
Extracted from the larger patch by Andres Freund.
Discussion: https://postgr.es/m/aY-UE-4t7FiYgH3t@alap3.anarazel.de
On common architectures, the PGPROC struct happened to be a multiple
of 64 bytes on PG 18, but it's changed on 'master' since. There was
worry that changing the alignment might hurt performance, due to false
cacheline sharing across elements in the proc array. However, there
was no explicit alignment, so any alignment to cache lines was
accidental. Add explicit alignment to remove worry about false
sharing.
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/3dd6f70c-b94d-4428-8e75-74a7136396be@iki.fi