Commit graph

1725 commits

Author SHA1 Message Date
Noah Misch
9311fcb864 At update of non-LP_NORMAL TID, fail instead of corrupting page header.
The right mix of DDL and VACUUM could corrupt a catalog page header such
that PageIsVerified() durably fails, requiring a restore from backup.
This affects only catalogs that both have a syscache and have DDL code
that uses syscache tuples to construct updates.  One of the test
permutations shows a variant not yet fixed.

This makes !TransactionIdIsValid(TM_FailureData.xmax) possible with
TM_Deleted.  I think core and PGXN are indifferent to that.

Per bug #17821 from Alexander Lakhin.  Back-patch to v13 (all supported
versions).  The test case is v17+, since it uses INJECTION_POINT.

Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org
2025-01-25 11:28:18 -08:00
Tom Lane
be5db08ed3 Ensure that pg_amop/amproc entries depend on their lefttype/righttype.
Usually an entry in pg_amop or pg_amproc does not need a dependency on
its amoplefttype/amoprighttype/amproclefttype/amprocrighttype types,
because there is an indirect dependency via the argument types of its
referenced operator or procedure, or via the opclass it belongs to.
However, for some support procedures in some index AMs, the argument
types of the support procedure might not mention the column data type
at all.  Also, the amop/amproc entry might be treated as "loose" in
the opfamily, in which case it lacks a dependency on any particular
opclass; or it might be a cross-type entry having a reference to a
datatype that is not its opclass' opcintype.

The upshot of all this is that there are cases where a datatype can
be dropped while leaving behind amop/amproc entries that mention it,
because there is no path in pg_depend showing that those entries
depend on that type.  Such entries are harmless in normal activity,
because they won't get used, but they cause problems for maintenance
actions such as dropping the operator family.  They also cause pg_dump
to produce bogus output.  The previous commit put a band-aid on the
DROP OPERATOR FAMILY failure, but a real fix is needed.

To fix, add pg_depend entries showing that a pg_amop/pg_amproc entry
depends on its lefttype/righttype.  To avoid bloating pg_depend too
much, skip this if the referenced operator or function has that type
as an input type.  (I did not bother with considering the possible
indirect dependency via the opclass' opcintype; at least in the
reported case, that wouldn't help anyway.)

Probably, the reason this has escaped notice for so long is that
add-on datatypes and relevant opclasses/opfamilies are usually
packaged as extensions nowadays, so that there's no way to drop
a type without dropping the referencing opclasses/opfamilies too.
Still, in the absence of pg_depend entries there's nothing that
constrains DROP EXTENSION to drop the opfamily entries before the
datatype, so it seems possible for a DROP failure to occur anyway.

The specific case that was reported doesn't fail in v13, because
v13 prefers to attach the support procedure to the opclass not the
opfamily.  But it's surely possible to construct other edge cases
that do fail in v13, so patch that too.

Per report from Yoran Heling.  Back-patch to all supported branches.

Discussion: https://postgr.es/m/Z1MVCOh1hprjK5Sf@gmai021
2024-12-07 15:56:28 -05:00
Noah Misch
370bc77402 Unpin buffer before inplace update waits for an XID to end.
Commit a07e03fd8f changed inplace updates
to wait for heap_update() commands like GRANT TABLE and GRANT DATABASE.
By keeping the pin during that wait, a sequence of autovacuum workers
and an uncommitted GRANT starved one foreground LockBufferForCleanup()
for six minutes, on buildfarm member sarus.  Prevent, at the cost of a
bit of complexity.  Back-patch to v12, like the earlier commit.  That
commit and heap_inplace_lock() have not yet appeared in any release.

Discussion: https://postgr.es/m/20241026184936.ae.nmisch@google.com
2024-10-29 09:39:58 -07:00
Noah Misch
63f0198056 Fix data loss at inplace update after heap_update().
As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reported by Smolkin Grigory.  Reviewed by Nitin Motiani, (in earlier
versions) Heikki Linnakangas, and (in earlier versions) Alexander
Lakhin.

Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
2024-09-24 15:25:22 -07:00
Melanie Plageman
06bf404cd0 Ensure vacuum removes all visibly dead tuples older than OldestXmin
If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed,
it will loop infinitely in lazy_scan_prune(), which compares tuples'
visibility information to OldestXmin.

Starting in version 14, which uses GlobalVisState for visibility testing
during pruning, it is possible for GlobalVisState->maybe_needed to
precede OldestXmin if maybe_needed is forced to go backward while vacuum
is running. This can happen if a disconnected standby with a running
transaction older than VacuumCutoffs->OldestXmin reconnects to the
primary after vacuum initially calculates GlobalVisState and OldestXmin.

Fix this by having vacuum always remove tuples older than OldestXmin
during pruning. This is okay because the standby won't replay the tuple
removal until the tuple is removable. Thus, the worst that can happen is
a recovery conflict.

Fixes BUG# 17257

Back-patched in versions 14-17

Author: Melanie Plageman

Reviewed-by: Noah Misch, Peter Geoghegan, Robert Haas, Andres Freund, and Heikki Linnakangas
Discussion: https://postgr.es/m/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com
2024-07-19 12:11:41 -04:00
Noah Misch
e352ba7b75 Remove comment about xl_heap_inplace "AT END OF STRUCT".
Commit 2c03216d83 moved the tuple data
from there to the buffer-0 data.  Back-patch to v12 (all supported
versions), the plan for the next change to this struct.

Discussion: https://postgr.es/m/20240523000548.58.nmisch@google.com
2024-06-27 19:21:10 -07:00
Tom Lane
06f81fed3c Fix insertion of SP-GiST REDIRECT tuples during REINDEX CONCURRENTLY.
Reconstruction of an SP-GiST index by REINDEX CONCURRENTLY may
insert some REDIRECT tuples.  This will typically happen in
a transaction that lacks an XID, which leads either to assertion
failure in spgFormDeadTuple or to insertion of a REDIRECT tuple
with zero xid.  The latter's not good either, since eventually
VACUUM will apply GlobalVisTestIsRemovableXid() to the zero xid,
resulting in either an assertion failure or a garbage answer.

In practice, since REINDEX CONCURRENTLY locks out index scans
till it's done, it doesn't matter whether it inserts REDIRECTs
or PLACEHOLDERs; and likewise it doesn't matter how soon VACUUM
reduces such a REDIRECT to a PLACEHOLDER.  So in non-assert builds
there's no observable problem here, other than perhaps a little
index bloat.  But it's not behaving as intended.

To fix, remove the failing Assert in spgFormDeadTuple, acknowledging
that we might sometimes insert a zero XID; and guard VACUUM's
GlobalVisTestIsRemovableXid() call with a test for valid XID,
ensuring that we'll reduce such a REDIRECT the first time VACUUM
sees it.  (Versions before v14 use TransactionIdPrecedes here,
which won't fail on zero xid, so they really have no bug at all
in non-assert builds.)

Another solution could be to not create REDIRECTs at all during
REINDEX CONCURRENTLY, making the relevant code paths treat that
case like index build (which likewise knows that no concurrent
index scans can be happening).  That would allow restoring the
Assert in spgFormDeadTuple, but we'd still need the VACUUM change
because redirection tuples with zero xid may be out there already.
But there doesn't seem to be a nice way for spginsert() to tell that
it's being called in REINDEX CONCURRENTLY without some API changes,
so we'll leave that as a possible future improvement.

In HEAD, also rename the SpGistState.myXid field to redirectXid,
which seems less misleading (since it might not in fact be our
transaction's XID) and is certainly less uninformatively generic.

Per bug #18499 from Alexander Lakhin.  Back-patch to all supported
branches.

Discussion: https://postgr.es/m/18499-8a519c280f956480@postgresql.org
2024-06-17 14:30:59 -04:00
Alexander Korotkov
bafad43c37 Fix the parameters order for TableAmRoutine.relation_copy_for_cluster()
Specify OldTable first, NewTable second as used by
table_relation_copy_for_cluster() and as implemented in
heapam_relation_copy_for_cluster().

Backpatch to PostgreSQL 12, where TableAmRoutine was introduced.

Discussion: https://postgr.es/m/ME3P282MB3166860D4911AE82F92DF7C5B63F2%40ME3P282MB3166.AUSP282.PROD.OUTLOOK.COM
Author: Japin Li
Reviewed-by: Pavel Borisov
Backpatch-through: 12
2024-04-03 21:42:38 +03:00
Michael Paquier
7ce65c6f72 Add try_index_open(), conditional variant of index_open()
try_index_open() is able to open an index if its relkind fits, except
that it would return NULL instead of generated an error if the relation
does not exist.  This new routine will be used by an upcoming patch to
make REINDEX on partitioned relations more robust when an index in a
partition tree is dropped.

Extracted from a larger patch by the same author.

Author: Fei Changhong
Discussion: https://postgr.es/m/tencent_6A52106095ACDE55333E3AD33F304C0C3909@qq.com
Backpatch-through: 14
2024-01-18 15:04:31 +09:00
Heikki Linnakangas
9fee3232a1 Fix assertions with RI triggers in heap_update and heap_delete.
If the tuple being updated is not visible to the crosscheck snapshot,
we return TM_Updated but the assertions would not hold in that case.
Move them to before the cross-check.

Fixes bug #17893. Backpatch to all supported versions.

Author: Alexander Lakhin
Backpatch-through: 12
Discussion: https://www.postgresql.org/message-id/17893-35847009eec517b5%40postgresql.org
2023-11-28 11:59:45 +02:00
Peter Geoghegan
3fa81b62e0 Fix btmarkpos/btrestrpos array key wraparound bug.
nbtree's mark/restore processing failed to correctly handle an edge case
involving array key advancement and related search-type scan key state.
Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore
processing (for a merge join) could incorrectly conclude that an
affected array/scan key must not have advanced during the time between
marking and restoring the scan's position.

As a result of all this, array key handling within btrestrpos could skip
a required call to _bt_preprocess_keys().  This confusion allowed later
primitive index scans to overlook tuples matching the true current array
keys.  The scan's search-type scan keys would still have spurious values
corresponding to the final array element(s) -- not values matching the
first/now-current array element(s).

To fix, remember that "array key wraparound" has taken place during the
ongoing btrescan in a flag variable stored in the scan's state, and use
that information at the point where btrestrpos decides if another call
to _bt_preprocess_keys is required.

Oversight in commit 70bc5833, which taught nbtree to handle array keys
during mark/restore processing, but missed this subtlety.  That commit
was itself a bug fix for an issue in commit 9e8da0f7, which taught
nbtree to handle ScalarArrayOpExpr quals natively.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com
Backpatch: 11- (all supported branches).
2023-09-28 16:29:35 -07:00
Andres Freund
d37ab378b6 hio: Take number of prior relation extensions into account
The new relation extension logic, introduced in 00d1e02be2, could lead to
slowdowns in some scenarios. E.g., when loading narrow rows into a table using
COPY, the caller of RelationGetBufferForTuple() will only request a small
number of pages. Without concurrency, we just extended using pwritev() in that
case. However, if there is *some* concurrency, we switched between extending
by a small number of pages and a larger number of pages, depending on the
number of waiters for the relation extension logic.  However, some
filesystems, XFS in particular, do not perform well when switching between
extending files using fallocate() and pwritev().

To avoid that issue, remember the number of prior relation extensions in
BulkInsertState and extend more aggressively if there were prior relation
extensions. That not just avoids the aforementioned slowdown, but also leads
to noticeable performance gains in other situations, primarily due to
extending more aggressively when there is no concurrency. I should have done
it this way from the get go.

Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com
Backpatch: 16-, where the new relation extension code was added
2023-08-14 09:54:38 -07:00
Peter Geoghegan
d088ba5a5a nbtree: Allocate new pages in separate function.
Split nbtree's _bt_getbuf function is two: code that read locks or write
locks existing pages remains in _bt_getbuf, while code that deals with
allocating new pages is moved to a new, dedicated function called
_bt_allocbuf.  This simplifies most _bt_getbuf callers, since it is no
longer necessary for them to pass a heaprel argument.  Many of the
changes to nbtree from commit 61b313e4 can be reverted.  This minimizes
the divergence between HEAD/PostgreSQL 16 and earlier release branches.

_bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to
_bt_getbuf.  There are only 3 affected call sites, all of which continue
to pass a heaprel for recovery conflict purposes.  Note that nbtree's
use of P_NEW was superficial; nbtree never actually relied on the P_NEW
code paths in bufmgr.c, so this change is strictly mechanical.

GiST already took the same approach; it has a dedicated function for
allocating new pages called gistNewBuffer().  That factor allowed commit
61b313e4 to make much more targeted changes to GiST.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
2023-06-10 14:08:25 -07:00
Tom Lane
0245f8db36 Pre-beta mechanical code beautification.
Run pgindent, pgperltidy, and reformat-dat-files.

This set of diffs is a bit larger than typical.  We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop).  We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up.  Going
forward, that should make for fewer random-seeming changes to existing
code.

Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
2023-05-19 17:24:48 -04:00
Tomas Vondra
3581cbdcd6 Fix handling of empty ranges and NULLs in BRIN
BRIN indexes did not properly distinguish between summaries for empty
(no rows) and all-NULL ranges, treating them as essentially the same
thing. Summaries were initialized with allnulls=true, and opclasses
simply reset allnulls to false when processing the first non-NULL value.
This however produces incorrect results if the range starts with a NULL
value (or a sequence of NULL values), in which case we forget the range
contains NULL values when adding the first non-NULL value.

This happens because the allnulls flag is used for two separate
purposes - to mark empty ranges (not representing any rows yet) and
ranges containing only NULL values.

Opclasses don't know which of these cases it is, and so don't know
whether to set hasnulls=true. Setting the flag in both cases would make
it correct, but it would also make BRIN indexes useless for queries with
IS NULL clauses. All ranges start empty (and thus allnulls=true), so all
ranges would end up with either allnulls=true or hasnulls=true.

The severity of the issue is somewhat reduced by the fact that it only
happens when adding values to an existing summary with allnulls=true.
This can happen e.g. for small tables (because a summary for the first
range exists for all BRIN indexes), or for tables with large fraction of
NULL values in the indexed columns.

Bulk summarization (e.g. during CREATE INDEX or automatic summarization)
that processes all values at once is not affected by this issue. In this
case the flags were updated in a slightly different way, not forgetting
the NULL values.

To identify empty ranges we use a new flag, stored in an unused bit in
the BRIN tuple header so the on-disk format remains the same. A matching
flag is added to BrinMemTuple, into a 3B gap after bt_placeholder.
That means there's no risk of ABI breakage, although we don't actually
pass the BrinMemTuple to any public API.

We could also skip storing index tuples for empty summaries, but then
we'd have to always process such ranges - even if there are no rows in
large parts of the table (e.g. after a bulk DELETE), it would still
require reading the pages etc. So we store them, but ignore them when
building the bitmap.

Backpatch to 11. The issue exists since BRIN indexes were introduced in
9.5, but older releases are already EOL.

Backpatch-through: 11
Reviewed-by: Justin Pryzby, Matthias van de Meent, Alvaro Herrera
Discussion: https://postgr.es/m/402430e4-7d9d-6cf1-09ef-464d80afff3b@enterprisedb.com
2023-05-19 01:29:44 +02:00
Michael Paquier
8961cb9a03 Fix typos in comments
The changes done in this commit impact comments with no direct
user-visible changes, with fixes for incorrect function, variable or
structure names.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com
2023-05-02 12:23:08 +09:00
Peter Geoghegan
50547a3fae Fix wal_consistency_checking enhanced desc output.
Recent enhancements to rmgr desc routines that made the output summarize
certain block data (added by commits 7d8219a4 and 1c453cfd) dealt with
records that lack relevant block data (and so have nothing to give a
more detailed summary of) by testing !DecodedBkpBlock.has_image.  As a
result, more detailed descriptions of block data were not output when
wal_consistency_checking was enabled.

This bug affected records with summarizable block data that also
happened to have an FPI that the REDO routine isn't supposed to apply
(FPIs used for consistency checking purposes only).  The presence of
such an FPI was incorrectly taken to indicate the absence of block data.

To fix, test DecodedBkpBlock.has_data, not !DecodedBkpBlock.has_image.
This is the exact condition that we care about, not an inexact proxy.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzm5Sc9cBg1qWV_cEBfLNJCrW9FjS-SoHVt8FLA7Ldn8yg@mail.gmail.com
2023-04-19 10:42:39 -07:00
Peter Geoghegan
06e0652750 Remove useless argument from nbtree dedup function.
_bt_dedup_pass()'s heapRel argument hasn't been needed or used since
commit cf2acaf4dc made deleting any existing LP_DEAD index tuples the
caller's responsibility.
2023-04-18 10:33:15 -07:00
David Rowley
eef231e816 Fix some typos and some incorrectly duplicated words
Author: Justin Pryzby
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/ZD3D1QxoccnN8A1V@telsasoft.com
2023-04-18 14:03:49 +12:00
David Rowley
b4dbf3e924 Fix various typos
This fixes many spelling mistakes in comments, but a few references to
invalid parameter names, function names and option names too in comments
and also some in string constants

Also, fix an #undef that was undefining the incorrect definition

Author: Alexander Lakhin
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/d5f68d19-c0fc-91a9-118d-7c6a5a3f5fad@gmail.com
2023-04-18 13:23:23 +12:00
Peter Geoghegan
cd7cdc550c Fix incorrect comment about nbtree WAL record.
The nbtree VACUUM WAL record stores its page offset number payload in
blk 0 (just like the closely related nbtree DELETE WAL record).  Commit
ebd551f5 fixed a similar issue with the DELETE WAL record, but missed
this one.
2023-04-17 09:58:18 -07:00
Peter Geoghegan
c03c2eae0a Refine the guidelines for rmgrdesc authors.
Clarify the goals of the recently added guidelines for rmgrdesc authors:
to avoid gratuitous inconsistencies across resource managers, and to
make it reasonably easy to write a reusable custom parser.

Beyond that, the guidelines leave rmgrdesc authors with a significant
amount of leeway.  This even includes the leeway to invent custom
conventions (in cases where it's warranted).

Follow-up to commit 7d8219a4.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkbYuvwYKm-Y-72QEh6SPMQcAo9uONv+mR3bMGcu9E_Cg@mail.gmail.com
2023-04-11 15:26:24 -07:00
Peter Geoghegan
e944063294 Fix xl_heap_lock WAL record field's data type.
Make xl_heap_lock's infobits_set field of type uint8, not int8.  Using
int8 isn't appropriate given that the field just holds status bits.
This fixes an oversight in commit 0ac5ad5134.

In passing rename the nearby TransactionId field to "xmax" to make
things consistency with related records, such as xl_heap_lock_updated.

Deliberately avoid a bump in XLOG_PAGE_MAGIC.  No backpatch, either.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkCd3kOS8b7Rfxw7Mh1_6jvX=Nzo-CWR1VBTiOtVZkWHA@mail.gmail.com
2023-04-11 14:07:54 -07:00
Peter Geoghegan
5d6728e588 Fix nbtree posting list update desc output.
We cannot use the generic array_desc approach with per-tuple nbtree
posting list update metadata because array_desc can only deal with fixed
width elements (e.g., page offset numbers).  Using array_desc led to
incorrect rmgr descriptions for updates from nbtree DELETE/VACUUM WAL
records.

To fix, add specialized code to describe the update metadata as array
elements in desc output.  We now iterate over the update metadata using
an approach that matches related REDO routines.

Also stop showing the updates offset number array separately in nbtree
DELETE/VACUUM desc output.  It's redundant information, since the same
page offset numbers appear in the description of each individual update
element.  Also make some small tweaks to the way that we format arrays
in all desc routines (not just nbtree desc routines) to make arrays a
little less verbose.

Oversight in commit 1c453cfd, which enhanced the nbtree rmgr desc
routines.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkbYuvwYKm-Y-72QEh6SPMQcAo9uONv+mR3bMGcu9E_Cg@mail.gmail.com
2023-04-10 11:15:41 -07:00
Andres Freund
0fdab27ad6 Allow logical decoding on standbys
Unsurprisingly, this requires wal_level = logical to be set on the primary and
standby. The infrastructure added in 26669757b6 ensures that slots are
invalidated if the primary's wal_level is lowered.

Creating a slot on a standby waits for a xl_running_xact record to be
processed. If the primary is idle (and thus not emitting xl_running_xact
records), that can take a while.  To make that faster, this commit also
introduces the pg_log_standby_snapshot() function. By executing it on the
primary, completion of slot creation on the standby can be accelerated.

Note that logical decoding on a standby does not itself enforce that required
catalog rows are not removed. The user has to use physical replication slots +
hot_standby_feedback or other measures to prevent that. If catalog rows
required for a slot are removed, the slot is invalidated.

See 6af1793954 for an overall design of logical decoding on a standby.

Bumps catversion, for the addition of the pg_log_standby_snapshot() function.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de> (in an older version)
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
2023-04-08 02:20:05 -07:00
Peter Geoghegan
1c453cfd89 Show more detail in nbtree rmgr descriptions.
Show a detailed description of the page offset number arrays that appear
in certain nbtree WAL records.

Also brings nbtree desc routines in line with the guidelines established
by recent commit 7d8219a4.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
2023-04-07 16:46:23 -07:00
Peter Geoghegan
7d8219a444 Show more detail in heapam rmgr descriptions.
Add helper functions that output arrays in a standard format, and use
the functions inside heapdesc routines.  This allows tools like
pg_walinspect to show a detailed description of the page offset number
arrays for records like PRUNE and VACUUM (unless there was an FPI).

Also document the conventions that desc routines should follow.  Only
the heapdesc routines follow the conventions for now, so they're just
guidelines for the time being.

Based on a suggestion from Andres Freund.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
2023-04-07 16:08:52 -07:00
Michael Paquier
8fcb32db98 Add more protections in WAL record APIs against overflows
This commit adds a limit to the size of an XLogRecord at 1020MB, based
on a suggestion by Heikki Linnakangas.  This counts for the overhead
needed by the XLogReader when allocating the memory it needs to read a
record in DecodeXLogRecordRequiredSpace(), based on the record size.  An
assertion based on that is added to detect that any additions in the
XLogReader facilities would not cause any overflows.  If that's ever the
case, the upper bound allowed would need to be adjusted.

Before this, it was possible for an external module to create WAL
records large enough to be assembled but not replayable, causing
failures when replaying such WAL records on standbys.  One case
mentioned where this is possible is the in-core function
pg_logical_emit_message() (wrapper for LogLogicalMessage), that allows
to emit WAL records with an arbitrary amount of data potentially higher
than the replay limit of approximately 1GB (limit of a palloc, minus the
overhead needed by a XLogReader).

This commit is a follow-up of ffd1b6b that has added similar protections
for the block-level data.  Here, the checks are extended to the whole
record length, mainrdata_len being extended from uint32 to uint64 with
the routines registering buffer and record data still limited to uint32
to minimize the checks when assembling a record.  All the error messages
related to overflow checks are improved to provide more context about
the error happening.

Author: Matthias van de Meent
Reviewed-by: Andres Freund, Heikki Linnakangas, Michael Paquier
Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
2023-04-07 10:10:17 +09:00
Andres Freund
00d1e02be2 hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.

To address this, use ExtendBufferedRelBy(), introduced in 31966b151e, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:

1) The caller tells RelationGetBufferForTuple() that it will need multiple
   pages. For now that's only used by heap_multi_insert(), see commit FIXME.

2) If there is contention on the extension lock, use the number of waiters for
   the lock as a multiplier for the number of blocks to extend by. This is
   similar to what we already did. Previously we additionally multiplied the
   numbers of waiters by 20, but with the new relation extension
   infrastructure I could not see a benefit in doing so.

Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.

One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.

This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 16:53:17 -07:00
Andres Freund
5279e9db8e heapam: Pass number of required pages to RelationGetBufferForTuple()
A future commit will use this information to determine how aggressively to
extend the relation by. In heap_multi_insert() we know accurately how many
pages we need once we need to extend the relation, providing an accurate lower
bound for how much to extend.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 16:17:16 -07:00
Peter Geoghegan
e48c817395 Recycle deleted nbtree pages more aggressively.
Commit 61b313e4 made nbtree consistently pass down a heaprel to low
level routines like _bt_getbuf().  Although this was primarily intended
as preparation for logical decoding on standbys, it also made it easy to
correct an old deficiency in how nbtree VACUUM determines whether or not
it's now safe to recycle deleted pages.

Pass the heaprel to GlobalVisTestFor() in nbtree routines that deal with
recycle safety.  nbtree now makes less pessimistic assumptions about
recycle safety within non-catalog relations.  This enhancement
complements the recycling enhancement added by commit 9dd963ae25.

nbtree remains just as pessimistic as ever when it comes to recycle
safety within indexes on catalog relations.  There is no fundamental
reason why we need to treat catalog relations differently, though.  The
behavioral inconsistency is a consequence of the way that nbtree uses
nextXID values to implement what Lanin and Shasha call "the drain
technique".  Note in particular that it has nothing to do with whether
or not index tuples might still be required for an older MVCC snapshot.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkaiDxCje0yPuH=3Uh2p1V_2pFGY==xfbZoZu7Ax_NB8g@mail.gmail.com
2023-04-03 11:31:43 -07:00
Peter Geoghegan
a349b86603 Move heaprel struct field next to index rel field.
Commit 61b313e4 added a heaprel struct member to IndexVacuumInfo, but
placed it last.  Move the heaprel struct member next to the index struct
member to improve the code's readability.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznG=TV6S9d3VA=y0vBHbXwnLs9_LLdiML=aNJuHeriwxg@mail.gmail.com
2023-04-03 11:01:11 -07:00
Alexander Korotkov
2b65bf046d Revert 11470f544e
Discussion: https://postgr.es/m/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de
2023-04-03 16:54:31 +03:00
Andres Freund
6af1793954 Add info in WAL records in preparation for logical slot conflict handling
This commit only implements one prerequisite part for allowing logical
decoding. The commit message contains an explanation of the overall design,
which later commits will refer back to.

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignment.
It's not needed on the others struct where isCatalogRel has
been added.

This commit just introduces the WAL format changes mentioned above. Handling
the actual conflicts will follow in future commits.

Bumps XLOG_PAGE_MAGIC as the several WAL records are changed.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de> (in an older version)
Author: Amit Khandekar <amitdkhan.pg@gmail.com>  (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
2023-04-02 12:32:19 -07:00
Andres Freund
61b313e47e Pass down table relation into more index relation functions
This is done in preparation for logical decoding on standby, which needs to
include whether visibility affecting WAL records are about a (user) catalog
table. Which is only known for the table, not the indexes.

It's also nice to be able to pass the heap relation to GlobalVisTestFor() in
vacuumRedirectAndPlaceholder().

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
2023-04-01 20:18:29 -07:00
Alexander Korotkov
11470f544e Allow locking updated tuples in tuple_update() and tuple_delete()
Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This patch eliminates step 2 by taking the lock during first
tuple_update()/tuple_delete() call.  Heap table access method saves some
efforts by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
2023-03-23 00:26:59 +03:00
Tomas Vondra
19d8e2308b Ignore BRIN indexes when checking for HOT updates
When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed by block summarizing indexes without
references to individual tuples that need to be cleaned up.

A new type TU_UpdateIndexes provides a signal to the executor to
determine which indexes to update - no indexes, all indexes, or only the
summarizing indexes.

This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.

This was originally committed as 5753d4ee32, but then got reverted by
e3fcca0d0d because of correctness issues.

Original patch by Josef Simanek, various fixes and improvements by Tomas
Vondra and me.

Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra
Reviewed-by: Tomas Vondra, Alvaro Herrera
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
2023-03-20 11:02:42 +01:00
Robert Haas
ebd551f586 Update some incorrect comments about xlog records.
The comments claim that certain pieces of data are part of the main
WAL record data when in reality they are part of the data for
block 0. Repair.

Bertrand Drouvot, reviewed by Amit Kapila. Originally reported by me.

Discussion: http://postgr.es/m/80db7836-4415-d54a-64c3-66b88b1430e7@gmail.com
2023-03-03 12:52:04 -05:00
Amit Kapila
a6cd1fc692 Change xl_hash_vacuum_one_page.ntuples from int to uint16.
This will create two bytes of padding space in xl_hash_vacuum_one_page which
can be used for future patches. This makes the datatype of
xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which is
advisable as both are used for the same purpose.

Author: Bertrand Drouvot
Reviewed-by: Nathan Bossart
Discussion: https://postgr.es/m/b0e20c40-cb7a-fc1c-c607-2a78dac5021e@gmail.com
2023-02-27 08:32:45 +05:30
David Rowley
9ed50ab349 Remove stray duplicated comment in heapam.h
This is just the same as what's written under the rs_numblocks field.

Reported-by: Melanie Plageman
Discussion: https://postgr.es/m/20230207204127.7vs6krqjqn5farr7@liskov
2023-02-08 16:03:26 +13:00
Michael Paquier
2f6e15ac93 Revert refactoring of restore command code to shell_restore.c
This reverts commits 24c35ec and 57169ad.  PreRestoreCommand() and
PostRestoreCommand() need to be put closer to the system() call calling
a restore_command, as they enable in_restore_command for the startup
process which would in turn trigger an immediate proc_exit() in the
SIGTERM handler.  Perhaps we could get rid of this behavior entirely,
but 24c35ec has made the window where the flag is enabled much larger
than it was, and any Postgres-like actions (palloc, etc.) taken by code
paths while the flag is enabled could lead to more severe issues in the
shutdown processing.

Note that curculio has showed that there are much more problems in this
area, unrelated to this change, actually, hence the issues related to
that had better be addressed first.  Keeping the code of HEAD in line
with the stable branches should make that a bit easier.

Per discussion with Andres Freund and Nathan Bossart.

Discussion: https://postgr.es/m/Y979NR3U5VnWrTwB@paquier.xyz
2023-02-06 08:28:42 +09:00
David Rowley
f9bc34fcb6 Further refactor of heapgettup and heapgettup_pagemode
Backward and forward scans share much of the same page acquisition code.
Here we consolidate that code to reduce some duplication.

Additionally, add a new rs_coffset field to HeapScanDescData to track the
offset of the current tuple.  The new field fits nicely into the padding
between a bool and BlockNumber field and saves having to look at the last
returned tuple to figure out which offset we should be looking at for the
current tuple.

Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-03 11:48:39 +13:00
David Rowley
e9aaf06328 Remove dead NoMovementScanDirection code
Here remove some dead code from heapgettup() and heapgettup_pagemode()
which was trying to support NoMovementScanDirection scans.  This code can
never be reached as standard_ExecutorRun() never calls ExecutePlan with
NoMovementScanDirection.

Additionally, plans which were scanning an unordered index would use
NoMovementScanDirection rather than ForwardScanDirection.  There was no
real need for this, so here we adjust this so we use ForwardScanDirection
for unordered index scans.  A comment in pathnodes.h claimed that
NoMovementScanDirection was used for PathKey reasons, but if that was
true, it no longer is, per code in build_index_paths().

This does change the non-text format of the EXPLAIN output so that
unordered index scans now have a "Forward" scan direction rather than
"NoMovement".  The text format of EXPLAIN has not changed.

Author: Melanie Plageman
Reviewed-by: Tom Lane, David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-01 10:52:41 +13:00
Michael Paquier
9a740f81eb Refactor code in charge of running shell-based recovery commands
The code specific to the execution of archive_cleanup_command,
recovery_end_command and restore_command is moved to a new file named
shell_restore.c.  The code is split into three functions:
- shell_restore(), that attempts the execution of a shell-based
restore_command.
- shell_archive_cleanup(), for archive_cleanup_command.
- shell_recovery_end(), for recovery_end_command.

This introduces no functional changes, with failure patterns and logs
generated in consequence being the same as before (one case actually
generates one less DEBUG2 message "could not restore" when a restore
command succeeds but the follow-up stat() to check the size fails, but
that only matters with a elevel high enough).

This is preparatory work for allowing recovery modules, a facility
similar to archive modules, with callbacks shaped similarly to the
functions introduced here.

Author: Nathan Bossart
Reviewed-by: Andres Freund, Michael Paquier
Discussion: https://postgr.es/m/20221227192449.GA3672473@nathanxps13
2023-01-16 16:31:43 +09:00
Michael Paquier
69fb29d1af Remove function declarations from headers for some undefined functions
The functions whose declarations are removed here have been removed in
the past, but their respective headers forgot the call.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20230110045722.GD9837@telsasoft.com
2023-01-11 11:54:55 +09:00
Peter Eisentraut
d952373a98 New header varatt.h split off from postgres.h
This new header contains all the variable-length data types support
(TOAST support) from postgres.h, which isn't needed by large parts of
the backend code.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/ddcce239-0f29-6e62-4b47-1f8ca742addf%40enterprisedb.com
2023-01-10 05:54:36 +01:00
Peter Geoghegan
79d4bf4eff Delay commit status checks until freezing executes.
pg_xact lookups are relatively expensive.  Move the xmin/xmax commit
status checks from the point that freeze plans are prepared to the point
that they're actually executed.  Otherwise we'll repeat many commit
status checks whenever multiple successive VACUUM operations scan the
same pages and decide against freezing each time, which is a waste of
cycles.

Oversight in commit 1de58df4, which added page-level freezing.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkZpe4K6qMfEt8H4qYJCKc2R7TPvKsBva7jc9w7iGXQSw@mail.gmail.com
2023-01-03 11:22:36 -08:00
Peter Geoghegan
b37a083239 Refine the definition of page-level freezing.
Improve comments added by commit 1de58df4 which describe the
lazy_scan_prune "freeze the page" path.  These newly revised comments
are based on suggestions from Jeff Davis.

In passing, remove nearby visibility_cutoff_xid comments left over from
commit 6daeeb1f.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/ebc857107fe3edd422ef8a65191ca4a8da568b9b.camel@j-davis.com
2023-01-03 10:08:55 -08:00
Bruce Momjian
c8e1ba736b Update copyright for 2023
Backpatch-through: 11
2023-01-02 15:00:37 -05:00
Peter Geoghegan
1de58df4fe Add page-level freezing to VACUUM.
Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages.  Individual XIDs and MXIDs fields from tuple
headers now trigger freezing of whole pages, rather than independently
triggering freezing of each individual tuple header field.

Managing the cost of freezing over time now significantly influences
when and how VACUUM freezes.  The overall amount of WAL written is the
single most important freezing related cost, in general.  Freezing each
page's tuples together in batch allows VACUUM to take full advantage of
the freeze plan WAL deduplication optimization added by commit 9e540599.

Also teach VACUUM to trigger page-level freezing whenever it detects
that heap pruning generated an FPI.  We'll have already written a large
amount of WAL just to do that much, so it's very likely a good idea to
get freezing out of the way for the page early.  This only happens in
cases where it will directly lead to marking the page all-frozen in the
visibility map.

In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact.  It doesn't quite work that way in certain rare
cases involving MultiXacts, though.  It is convenient to define "freeze
the page" in a way that gives FreezeMultiXactId the leeway to put off
the work of processing an individual tuple's xmax whenever it happens to
be a MultiXactId that would require an expensive second pass to process
aggressively (allocating a new multi is especially worth avoiding here).
FreezeMultiXactId is eager when processing is cheap (as it usually is),
and lazy in the event of an individual multi that happens to require
expensive second pass processing.  This avoids regressions related to
processing of multis that page-level freezing might otherwise cause.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
2022-12-28 08:50:47 -08:00