Commit 19890a064e added the option to enable two_phase commits via
pg_create_logical_replication_slot but didn't extend the support of same
in replication protocol. However, by mistake, it added the two_phase
variable in CreateReplicationSlotCmd which is required only when we extend
the replication protocol.
Reported-by: Jeff Davis
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/64b9f783c6e125f18f88fbc0c0234e34e71d8639.camel@j-davis.com
The documented intent is for all columns except subconninfo to be
publicly readable. However, this has been overlooked twice.
subsynccommit has never been readable since it was introduced,
nor has the oid column (which is important for joining).
Given the lack of previous complaints, it's not clear that it's
worth doing anything about this in the back branches. But there's
still time to fix it inexpensively for v14.
Per report from Israel Barth (via Euler Taveira).
Patch by Euler Taveira, possibly-vain comment updates by me.
Discussion: https://postgr.es/m/b8f7c17c-0041-46b6-acfe-2d1f5a985ab4@www.fastmail.com
The reference here to different physical column numbers in inherited
UPDATE/DELETE plans is obsolete as of 86dc90056; remove it. Also
rework the text about inheritance cases to make it clearer.
Commit ab596105b increased FirstBootstrapObjectId from 12000 to 13000,
but we've had some push-back about that. It's worrisome to reduce the
daylight between there and FirstNormalObjectId, because the number of
OIDs consumed during initdb for collation objects is hard to predict.
We can improve the situation by abandoning the assumption that these
OIDs must be globally unique. It should be sufficient for them to be
unique per-catalog. (Any code that's unhappy about that is broken
anyway, since no more than per-catalog uniqueness can be guaranteed
once the OID counter wraps around.) With that change, the largest OID
assigned during genbki.pl (starting from a base of 10000) is a bit
under 11000. This allows reverting FirstBootstrapObjectId to 12000
with reasonable confidence that that will be sufficient for many years
to come.
We are not, at this time, abandoning the expectation that
hand-assigned OIDs (below 10000) are globally unique. Someday that'll
likely be necessary, but the need seems years away still.
This is late for v14, but it seems worth doing it now so that
downstream software doesn't have to deal with the consequences of
a change in FirstBootstrapObjectId. In any case, we already
bought into forcing an initdb for beta2, so another catversion
bump won't hurt.
Discussion: https://postgr.es/m/1665197.1622065382@sss.pgh.pa.us
Redefine '\0' (InvalidCompressionMethod) as meaning "if we need to
compress, use the current setting of default_toast_compression".
This allows '\0' to be a suitable default choice regardless of
datatype, greatly simplifying code paths that initialize tupledescs
and the like. It seems like a more user-friendly approach as well,
because now the default compression choice doesn't migrate into table
definitions, meaning that changing default_toast_compression is
usually sufficient to flip an installation's behavior; one needn't
tediously issue per-column ALTER SET COMPRESSION commands.
Along the way, fix a few minor bugs and documentation issues
with the per-column-compression feature. Adopt more robust
APIs for SetIndexStorageProperties and GetAttributeCompression.
Bump catversion because typical contents of attcompression will now
be different. We could get away without doing that, but it seems
better to ensure v14 installations all agree on this. (We already
forced initdb for beta2, anyway.)
Discussion: https://postgr.es/m/626613.1621787110@sss.pgh.pa.us
While decoding the multi-insert WAL we can't clean the toast untill we get
the last insert of that WAL record. Now if we stream the changes before we
get the last change, the memory for toast chunks won't be released and we
expect the txn to have streamed all changes after streaming. This
restriction is mainly to ensure the correctness of streamed transactions
and it doesn't seem worth uplifting such a restriction just to allow this
case because anyway we will stream the transaction once such an insert is
complete.
Previously we were using two different flags (one for toast tuples and
another for speculative inserts) to indicate partial changes. Now instead
we replaced both of them with a single flag to indicate partial changes.
Reported-by: Pavan Deolasee
Author: Dilip Kumar
Reviewed-by: Pavan Deolasee, Amit Kapila
Discussion: https://postgr.es/m/CABOikdN-_858zojYN-2tNcHiVTw-nhxPwoQS4quExeweQfG1Ug@mail.gmail.com
Now that attcompression is just a char, there's a lot of wasted
padding space after it. Move it into the group of char-wide
columns to save a net of 4 bytes per pg_attribute entry. While
we're at it, swap the order of attstorage and attalign to make for
a more logical grouping of these columns.
Also re-order actions in related code to match the new field ordering.
This patch also fixes one outright bug: equalTupleDescs() failed to
compare attcompression. That could, for example, cause relcache
reload to fail to adopt a new value following a change.
Michael Paquier and Tom Lane, per a gripe from Andres Freund.
Discussion: https://postgr.es/m/20210517204803.iyk5wwvwgtjcmc5w@alap3.anarazel.de
COMMIT/ROLLBACK necessarily destroys all snapshots within the session.
The original implementation of intra-procedure transactions just
cavalierly did that, ignoring the fact that this left us executing in
a rather different environment than normal. In particular, it turns
out that handling of toasted datums depends rather critically on there
being an outer ActiveSnapshot: otherwise, when SPI or the core
executor pop whatever snapshot they used and return, it's unsafe to
dereference any toasted datums that may appear in the query result.
It's possible to demonstrate "no known snapshots" and "missing chunk
number N for toast value" errors as a result of this oversight.
Historically this outer snapshot has been held by the Portal code,
and that seems like a good plan to preserve. So add infrastructure
to pquery.c to allow re-establishing the Portal-owned snapshot if it's
not there anymore, and add enough bookkeeping support that we can tell
whether it is or not.
We can't, however, just re-establish the Portal snapshot as part of
COMMIT/ROLLBACK. As in normal transaction start, acquiring the first
snapshot should wait until after SET and LOCK commands. Hence, teach
spi.c about doing this at the right time. (Note that this patch
doesn't fix the problem for any PLs that try to run intra-procedure
transactions without using SPI to execute SQL commands.)
This makes SPI's no_snapshots parameter rather a misnomer, so in HEAD,
rename that to allow_nonatomic.
replication/logical/worker.c also needs some fixes, because it wasn't
careful to hold a snapshot open around AFTER trigger execution.
That code doesn't use a Portal, which I suspect someday we're gonna
have to fix. But for now, just rearrange the order of operations.
This includes back-patching the recent addition of finish_estate()
to centralize the cleanup logic there.
This also back-patches commit 2ecfeda3e into v13, to improve the
test coverage for worker.c (it was that test that exposed that
worker.c's snapshot management is wrong).
Per bug #15990 from Andreas Wicht. Back-patch to v11 where
intra-procedure COMMIT was added.
Discussion: https://postgr.es/m/15990-eee2ac466b11293d@postgresql.org
1) Previously there were both pgstat_send_wal() and pgstat_report_wal()
in order to send WAL activity to the stats collector. With the former being
used by wal writer, the latter by most other processes. They were a bit
redundant and so this commit merges them into pgstat_send_wal() to
simplify the code.
2) Previously WAL global statistics counters were calculated and then
compared with zero-filled buffer in order to determine whether any WAL
activity has happened since the last submission. These calculation and
comparison were not cheap. This was regularly exercised even in read-only
workloads. This commit fixes the issue by making some WAL activity
counters directly be checked to determine if there's WAL activity stats
to send.
3) Previously pgstat_report_stat() did not check if there's WAL activity
stats to send as part of the "Don't expend a clock check if nothing to do"
check at the top. It's probably rare to have pending WAL stats without
also passing one of the other conditions, but for safely this commit
changes pgstat_report_stats() so that it checks also some WAL activity
counters at the top.
This commit also adds the comments about the design of WAL stats.
Reported-by: Andres Freund
Author: Masahiro Ikeda
Reviewed-by: Kyotaro Horiguchi, Atsushi Torikoshi, Andres Freund, Fujii Masao
Discussion: https://postgr.es/m/20210324232224.vrfiij2rxxwqqjjb@alap3.anarazel.de
Allowing only on/off meant that all either all existing configuration
guides would become obsolete if we disabled it by default, or that we
would have to accept a performance loss in the default config if we
enabled it by default. By allowing 'auto' as a middle ground, the
performance cost is only paid by those who enable pg_stat_statements and
similar modules.
I only edited the release notes to comment-out a paragraph that is now
factually wrong; further edits are probably needed to describe the
related change in more detail.
Author: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20210513002623.eugftm4nk2lvvks3@nol
Split up CHECK_FOR_INTERRUPTS() to provide an additional macro
INTERRUPTS_PENDING_CONDITION(), which just tests whether an
interrupt is pending without attempting to service it. This is
useful in situations where the caller knows that interrupts are
blocked, and would like to find out if it's worth the trouble
to unblock them.
Also add INTERRUPTS_CAN_BE_PROCESSED(), which indicates whether
CHECK_FOR_INTERRUPTS() can be relied on to clear the pending interrupt.
This commit doesn't actually add any uses of the new macros,
but a follow-on bug fix will do so. Back-patch to all supported
branches to provide infrastructure for that fix.
Alvaro Herrera and Tom Lane
Discussion: https://postgr.es/m/20210513155351.GA7848@alvherre.pgsql
The worker.c global wrconn is only meant to be used by logical apply/
tablesync workers, but there are other variables with the same name. To
reduce future confusion rename the global from "wrconn" to
"LogRepWorkerWalRcvConn".
While this is just cosmetic, it seems better to backpatch it all the way
back to 10 where this code appeared, to avoid future backpatching
issues.
Author: Peter Smith <smithpb2250@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pu7Jv9L2BOEx_Z0UtJxfDevQSAUW2mJqWU+CtmDrEZVAg@mail.gmail.com
Run renumber_oids.pl to move high-numbered OIDs down, as per pre-beta
tasks specified by RELEASE_CHANGES. For reference, the command was
./renumber_oids.pl --first-mapped-oid 8000 --target-oid 6150
Also "make reformat-dat-files".
The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
EXPLAIN ANALYZE for an async-capable ForeignScan node associated with
postgres_fdw is done just by using instrumentation for ExecProcNode()
called from the node's callbacks, causing the following problems:
1) If the remote table to scan is empty, the node is incorrectly
considered as "never executed" by the command even if the node is
executed, as ExecProcNode() isn't called from the node's callbacks at
all in that case.
2) The command fails to collect timings for things other than
ExecProcNode() done in the node, such as creating a cursor for the
node's remote query.
To fix these problems, add instrumentation for async-capable nodes, and
modify postgres_fdw accordingly.
My oversight in commit 27e1f1456.
While at it, update a comment for the AsyncRequest struct in execnodes.h
and the documentation for the ForeignAsyncRequest API in fdwhandler.sgml
to match the code in ExecAsyncAppendResponse() in nodeAppend.c, and fix
typos in comments in nodeAppend.c.
Per report from Andrey Lepikhov, though I didn't use his patch.
Reviewed-by: Andrey Lepikhov
Discussion: https://postgr.es/m/2eb662bb-105d-fc20-7412-2f027cc3ca72%40postgrespro.ru
Previously long was used as the data type for some counters in BufferUsage
and WalUsage. But long is only four byte, e.g., on Windows, and it's entirely
possible to wrap a four byte counter. For example, emitting more than
four billion WAL records in one transaction isn't actually particularly rare.
To avoid the overflows of those counters, this commit changes the data type
of them from long to int64.
Suggested-by: Andres Freund
Author: Masahiro Ikeda
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/20201221211650.k7b53tcnadrciqjo@alap3.anarazel.de
Discussion: https://postgr.es/m/af0964ac-7080-1984-dc23-513754987716@oss.nttdata.com
It's unusual to have any resjunk columns in an ON CONFLICT ... UPDATE
list, but it can happen when MULTIEXPR_SUBLINK SubPlans are present.
If it happens, the ON CONFLICT UPDATE code path would end up storing
tuples that include the values of the extra resjunk columns. That's
fairly harmless in the short run, but if new columns are added to
the table then the values would become accessible, possibly leading
to malfunctions if they don't match the datatypes of the new columns.
This had escaped notice through a confluence of missing sanity checks,
including
* There's no cross-check that a tuple presented to heap_insert or
heap_update matches the table rowtype. While it's difficult to
check that fully at reasonable cost, we can easily add assertions
that there aren't too many columns.
* The output-column-assignment cases in execExprInterp.c lacked
any sanity checks on the output column numbers, which seems like
an oversight considering there are plenty of assertion checks on
input column numbers. Add assertions there too.
* We failed to apply nodeModifyTable's ExecCheckPlanOutput() to
the ON CONFLICT UPDATE tlist. That wouldn't have caught this
specific error, since that function is chartered to ignore resjunk
columns; but it sure seems like a bad omission now that we've seen
this bug.
In HEAD, the right way to fix this is to make the processing of
ON CONFLICT UPDATE tlists work the same as regular UPDATE tlists
now do, that is don't add "SET x = x" entries, and use
ExecBuildUpdateProjection to evaluate the tlist and combine it with
old values of the not-set columns. This adds a little complication
to ExecBuildUpdateProjection, but allows removal of a comparable
amount of now-dead code from the planner.
In the back branches, the most expedient solution seems to be to
(a) use an output slot for the ON CONFLICT UPDATE projection that
actually matches the target table, and then (b) invent a variant of
ExecBuildProjectionInfo that can be told to not store values resulting
from resjunk columns, so it doesn't try to store into nonexistent
columns of the output slot. (We can't simply ignore the resjunk columns
altogether; they have to be evaluated for MULTIEXPR_SUBLINK to work.)
This works back to v10. In 9.6, projections work much differently and
we can't cheaply give them such an option. The 9.6 version of this
patch works by inserting a JunkFilter when it's necessary to get rid
of resjunk columns.
In addition, v11 and up have the reverse problem when trying to
perform ON CONFLICT UPDATE on a partitioned table. Through a
further oversight, adjust_partition_tlist() discarded resjunk columns
when re-ordering the ON CONFLICT UPDATE tlist to match a partition.
This accidentally prevented the storing-bogus-tuples problem, but
at the cost that MULTIEXPR_SUBLINK cases didn't work, typically
crashing if more than one row has to be updated. Fix by preserving
resjunk columns in that routine. (I failed to resist the temptation
to add more assertions there too, and to do some minor code
beautification.)
Per report from Andres Freund. Back-patch to all supported branches.
Security: CVE-2021-32028
While we were (mostly) careful about ensuring that the dimensions of
arrays aren't large enough to cause integer overflow, the lower bound
values were generally not checked. This allows situations where
lower_bound + dimension overflows an integer. It seems that that's
harmless so far as array reading is concerned, except that array
elements with subscripts notionally exceeding INT_MAX are inaccessible.
However, it confuses various array-assignment logic, resulting in a
potential for memory stomps.
Fix by adding checks that array lower bounds aren't large enough to
cause lower_bound + dimension to overflow. (Note: this results in
disallowing cases where the last subscript position would be exactly
INT_MAX. In principle we could probably allow that, but there's a lot
of code that computes lower_bound + dimension and would need adjustment.
It seems doubtful that it's worth the trouble/risk to allow it.)
Somewhat independently of that, array_set_element() was careless
about possible overflow when checking the subscript of a fixed-length
array, creating a different route to memory stomps. Fix that too.
Security: CVE-2021-32027
This set of commits has some bugs with known fixes, but at this late
stage in the release cycle it seems best to revert and resubmit next
time, along with some new automated test coverage for this whole area.
Commits reverted:
dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery."
1d257577: Optionally prefetch referenced data in recovery.
f003d9f8: Add circular WAL decoding buffer.
323cbe7c: Remove read_page callback from XLogReader.
Remove the new GUC group WAL_RECOVERY recently added by a55a9847, as the
corresponding section of config.sgml is now reverted.
Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com
These comments left the impression that USE_VALGRIND isn't really
essential for valgrind testing. But that's wrong, as I learned
the hard way today.
Discussion: https://postgr.es/m/512778.1620588546@sss.pgh.pa.us
It seems that various people have moved GUCs around in the config.sgml
listing without bothering to make the code agree. Ensure that the
config_group codes assigned to GUCs match where they are listed in
config.sgml. Likewise ensure that postgresql.conf.sample lists GUCs
in the same sub-section and same ordering as they appear in config.sgml.
(I've got some doubts about some of these choices, but for the purposes
of this patch, we'll treat config.sgml as gospel.)
Notably, this requires adding a WAL_RECOVERY config_group value,
because 1d257577e didn't. As long as we're renumbering that enum
anyway, let's take out the values corresponding to major groups
that are divided into sub-groups. No GUC should be assigned to the
major group itself, so those values just create a temptation to
do the wrong thing, while adding work for translators.
In passing, adjust the short_desc strings for PRESET_OPTIONS GUCs
to uniformly use the phrasing "Shows XYZ.", removing the impression
some of these strings left that you can set the value.
While some of these errors are old, no back-patch, as changing the
contents of the pg_settings view in stable branches seems more likely
to be seen as a compatibility break than anything helpful.
Bharath Rupireddy, Justin Pryzby, Tom Lane
Discussion: https://postgr.es/m/16997-ff16127f6e0d1390@postgresql.org
Discussion: https://postgr.es/m/20210413123139.GE6091@telsasoft.com
Design problems were discovered in the handling of composite types and
record types that would cause some relevant versions not to be recorded.
Misgivings were also expressed about the use of the pg_depend catalog
for this purpose. We're out of time for this release so we'll revert
and try again.
Commits reverted:
1bf946bd: Doc: Document known problem with Windows collation versions.
cf002008: Remove no-longer-relevant test case.
ef387bed: Fix bogus collation-version-recording logic.
0fb0a050: Hide internal error for pg_collation_actual_version(<bad OID>).
ff942057: Suppress "warning: variable 'collcollate' set but not used".
d50e3b1f: Fix assertion in collation version lookup.
f24b1569: Rethink extraction of collation dependencies.
257836a7: Track collation versions for indexes.
cd6f479e: Add pg_depend.refobjversion.
7d1297df: Remove pg_collation.collversion.
Discussion: https://postgr.es/m/CA%2BhUKGLhj5t1fcjqAu8iD9B3ixJtsTNqyCCD4V0aTO9kAKAjjA%40mail.gmail.com
In d6b8d29419 I (Álvaro) was sloppy about recording whether a
partition descripor does or does not include detached partitions, when
the snapshot checking does not see the pg_inherits row marked detached.
In that case no partition was omitted, yet in the relcache entry we were
saving the partdesc as omitting partitions. Flip that (so we save it as
a partdesc not omitting partitions, which indeed it doesn't), which
hopefully makes the code easier to reason about.
Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqE7GxGU4VdzwZzfiz+Ont5SsopoFkgtrZGEdPqWRL+biA@mail.gmail.com
Currently, replication slot statistics are updated at prepare, commit, and
rollback. Now, if the transaction is interrupted the stats might not get
updated. Fixed this by updating replication statistics after every
stream/spill.
In passing update the docs to change the description of some of the slot
stats.
Author: Vignesh C, Sawada Masahiko
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
We had a report of confusing server behavior caused by a client bug
that sent junk to the server: the server thought the junk was a
very long message length and waited patiently for data that would
never come. We can reduce the risk of that by being less trusting
about message lengths.
For a long time, libpq has had a heuristic rule that it wouldn't
believe large message size words, except for a small number of
message types that are expected to be (potentially) long. This
provides some defense against loss of message-boundary sync and
other corrupted-data cases. The server does something similar,
except that up to now it only limited the lengths of messages
received during the connection authentication phase. Let's
do the same as in libpq and put restrictions on the allowed
length of all messages, while distinguishing between message
types that are expected to be long and those that aren't.
I used a limit of 10000 bytes for non-long messages. (libpq's
corresponding limit is 30000 bytes, but given the asymmetry of
the FE/BE protocol, there's no good reason why the numbers should
be the same.) Experimentation suggests that this is at least a
factor of 10, maybe a factor of 100, more than we really need;
but plenty of daylight seems desirable to avoid false positives.
In any case we can adjust the limit based on beta-test results.
For long messages, set a limit of MaxAllocSize - 1, which is the
most that we can absorb into the StringInfo buffer that the message
is collected in. This just serves to make sure that a bogus message
size is reported as such, rather than as a confusing gripe about
not being able to enlarge a string buffer.
While at it, make sure that non-mainline code paths (such as
COPY FROM STDIN) are as paranoid as SocketBackend is, and validate
the message type code before believing the message length.
This provides an additional guard against getting stuck on corrupted
input.
Discussion: https://postgr.es/m/2003757.1619373089@sss.pgh.pa.us
Makes partition descriptor acquisition faster during the transient
period in which a partition is in the process of being detached.
This also adds the restriction that only one partition can be in
pending-detach state for a partitioned table.
While at it, return find_inheritance_children() API to what it was
before 71f4c8c6f7, and create a separate
find_inheritance_children_extended() that returns detailed info about
detached partitions.
(This incidentally fixes a bug in 8aba932251 whereby a memory context
holding a transient partdesc is reparented to a NULL PortalContext,
leading to permanent leak of that memory. The fix is to no longer rely
on reparenting contexts to PortalContext. Reported by Amit Langote.)
Per gripe from Amit Langote
Discussion: https://postgr.es/m/CA+HiwqFgpP1LxJZOBYGt9rpvTjXXkg5qG2+Xch2Z1Q7KrqZR1A@mail.gmail.com
Commit 8ff1c94649 allowed TRUNCATE command to truncate foreign tables.
Previously the information about "ONLY" options specified in TRUNCATE
command were passed to the foreign data wrapper. Then postgres_fdw
constructed the TRUNCATE command to issue the remote server and
included "ONLY" options in it based on the passed information.
On the other hand, "ONLY" options specified in SELECT, UPDATE or DELETE
have no effect when accessing or modifying the remote table, i.e.,
are not passed to the foreign data wrapper. So it's inconsistent to
make only TRUNCATE command pass the "ONLY" options to the foreign data
wrapper. Therefore this commit changes the TRUNCATE command so that
it doesn't pass the "ONLY" options to the foreign data wrapper,
for the consistency with other statements. Also this commit changes
postgres_fdw so that it always doesn't include "ONLY" options in
the TRUNCATE command that it constructs.
Author: Fujii Masao
Reviewed-by: Bharath Rupireddy, Kyotaro Horiguchi, Justin Pryzby, Zhihong Yu
Discussion: https://postgr.es/m/551ed8c1-f531-818b-664a-2cecdab99cd8@oss.nttdata.com
Previously, we used to use the array of size max_replication_slots to
store stats for replication slots. But that had two problems in the cases
where a message for dropping a slot gets lost: 1) the stats for the new
slot are not recorded if the array is full and 2) writing beyond the end
of the array if the user reduces the max_replication_slots.
This commit uses HTAB for replication slot statistics, resolving both
problems. Now, pgstat_vacuum_stat() search for all the dead replication
slots in stats hashtable and tell the collector to remove them. To avoid
showing the stats for the already-dropped slots, pg_stat_replication_slots
view searches slot stats by the slot name taken from pg_replication_slots.
Also, we send a message for creating a slot at slot creation, initializing
the stats. This reduces the possibility that the stats are accumulated
into the old slot stats when a message for dropping a slot gets lost.
Reported-by: Andres Freund
Author: Sawada Masahiko, test case by Vignesh C
Reviewed-by: Amit Kapila, Vignesh C, Dilip Kumar
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
The Truncate operation acquires an exclusive lock on the target relation
and indexes. It then waits for logical replication of the operation to
finish at commit. Now because we are acquiring the shared lock on the
target index to get index attributes in pgoutput while sending the
changes for the Truncate operation, it leads to a deadlock.
Actually, we don't need to acquire a lock on the target index as we build
the cache entry using a historic snapshot and all the later changes are
absorbed while decoding WAL. So, we wrote a special purpose function for
logical replication to get a bitmap of replica identity attribute numbers
where we get that information without locking the target index.
We decided not to backpatch this as there doesn't seem to be any field
complaint about this issue since it was introduced in commit 5dfd1e5a in
v11.
Reported-by: Haiying Tang
Author: Takamichi Osumi, test case by Li Japin
Reviewed-by: Amit Kapila, Ajin Cherian
Discussion: https://postgr.es/m/OS0PR01MB6113C2499C7DC70EE55ADB82FB759@OS0PR01MB6113.jpnprd01.prod.outlook.com
These functions shouldn't receive null arguments: multirange_constructor0()
doesn't have any arguments while multirange_constructor2() has a single array
argument, which is never null.
But mark them strict anyway for the sake of uniformity.
Also, make checks for null arguments use elog() instead of ereport() as these
errors should normally be never thrown. And adjust corresponding comments.
Catversion is bumped.
Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/0f783a96-8d67-9e71-996b-f34a7352eeef%40enterprisedb.com
Commit bbe0a81db6 introduced "INCLUDING COMPRESSION" option
in CREATE TABLE command, but previously TableLikeOption in gram.y and
parsenodes.h didn't classify this new option in alphabetical order
with the rest.
Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/YHerAixOhfR1ryXa@paquier.xyz
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
Since pg_depend can contain duplicate entries, we need to eliminate
those in information schema views that build on pg_depend, using
DISTINCT. Some of the older views already did that correctly, but
some of the more recently added ones didn't. (In some of these views,
it might not be possible to reproduce the issue because of how the
implementation happens to deduplicate dependencies while recording
them, but it seems better to keep this consistent in all cases.)
Commit f003d9f87 left this macro with inadequate (or, one could say,
too much) parenthesization. Which was catastrophic to the correctness
of calls such as "if (!XLogRecHasBlockRef(record, 1)) ...". There
are only a few of those, which perhaps explains why we didn't notice
immediately (with our general weakness of WAL replay testing being
another factor). I found it by debugging intermittent replay failures
like
2021-04-08 14:33:30.191 EDT [29463] PANIC: failed to locate backup block with ID 1
2021-04-08 14:33:30.191 EDT [29463] CONTEXT: WAL redo at 0/95D3438 for SPGist/ADD_NODE: off 1; blkref #0: rel 1663/16384/25998, blk 1
Previously, it was pg_stat_activity.queryid to match the
pg_stat_statements queryid column. This is an adjustment to patch
4f0b0966c8. This also adjusts some of the internal function calls to
match. Catversion bumped.
Reported-by: Álvaro Herrera, Julien Rouhaud
Discussion: https://postgr.es/m/20210408032704.GA7498@alvherre.pgsql
I didn't particularly like this function name, as it fails to
express what's going on. Also, returning the sort expression
alone isn't too helpful --- typically, a caller would also
need some other fields of the EquivalenceMember. But the
sole caller really only needs a bool result, so let's make
it "bool relation_can_be_sorted_early()".
Discussion: https://postgr.es/m/91f3ec99-85a4-fa55-ea74-33f85a5c651f@swarm64.com
An oversight introduced by the incremental-sort patches caused
"could not find pathkey item to sort" errors in some situations
where a sort key involves an aggregate or window function.
The basic problem here is that find_em_expr_usable_for_sorting_rel
isn't properly modeling what prepare_sort_from_pathkeys will do
later. Rather than hoping we can keep those functions in sync,
let's refactor so that they actually share the code for
identifying a suitable sort expression.
With this refactoring, tlist.c's tlist_member_ignore_relabel
is unused. I removed it in HEAD but left it in place in v13,
in case any extensions are using it.
Per report from Luc Vlaming. Back-patch to v13 where the
problem arose.
James Coleman and Tom Lane
Discussion: https://postgr.es/m/91f3ec99-85a4-fa55-ea74-33f85a5c651f@swarm64.com
In the new test after resetting the stats, we were not waiting for the
stats message to be delivered. Also, we need to decode the results for
the new test, otherwise, it will show the old stats.
In passing,
a. Change docs added by commit f5fc2f5b23 as per suggestion by
Justin Pryzby.
b. Bump the PGSTAT_FILE_FORMAT_ID as commit f5fc2f5b23 changes the file
format of stats.
Reported-by: Tom Lane based on buildfarm reports
Author: Vignesh C, Justin Pryzby
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
Ooops, forgot to s/system_views.sql/system_functions.sql/g
in this part of 767982e36.
No need for an additional catversion bump, I think, since
these strings are gone by the time initdb finishes.
Discussion: https://postgr.es/m/3956760.1618529139@sss.pgh.pa.us
Adopt the new pre-parsed representation for all built-in and
information_schema SQL-language functions, except for a small
number that can't presently be converted because they have
polymorphic arguments.
This eliminates residual hazards around search-path safety of
these functions, and might provide some small performance benefits
by reducing parsing costs. It seems useful also to provide more
test coverage for the SQL-standard-body feature.
Discussion: https://postgr.es/m/3956760.1618529139@sss.pgh.pa.us
This adds the statistics about total transactions count and total
transaction data logically sent to the decoding output plugin from
ReorderBuffer. Users can query the pg_stat_replication_slots view to check
these stats.
Suggested-by: Andres Freund
Author: Vignesh C and Amit Kapila
Reviewed-by: Sawada Masahiko, Amit Kapila
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
Commit e717a9a18 changed the longstanding rule that prosrc is NOT NULL
because when a SQL-language function is written in SQL-standard style,
we don't currently have anything useful to put there. This seems a poor
decision though, as it could easily have negative impacts on external
PLs (opening them to crashes they didn't use to have, for instance).
SQL-function-related code can just as easily test "is prosqlbody not
null" as "is prosrc null", so there's no real gain there either.
Hence, revert the NOT NULL marking removal and adjust related logic.
For now, we just put an empty string into prosrc for SQL-standard
functions. Maybe we'll have a better idea later, although the
history of things like pg_attrdef.adsrc suggests that it's not
easy to maintain a string equivalent of a node tree.
This also adds an assertion that queryDesc->sourceText != NULL
to standard_ExecutorStart. We'd been silently relying on that
for awhile, so let's make it less silent.
Also fix some overlooked documentation and test cases.
Discussion: https://postgr.es/m/2197698.1617984583@sss.pgh.pa.us
This will make it consistent with the other usage of slotname in the code.
In the passing, change pgstat_report_replslot signature to use a structure
rather than multiple parameters.
Reported-by: Andres Freund
Author: Vignesh C
Reviewed-by: Sawada Masahiko, Amit Kapila
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
Previously, get_cached_rowtype() cached a pointer to a reference-counted
tuple descriptor from the typcache, relying on the ExprContextCallback
mechanism to release the tupdesc refcount when the expression tree
using the tupdesc was destroyed. This worked fine when it was designed,
but the introduction of within-DO-block COMMITs broke it. The refcount
is logged in a transaction-lifespan resource owner, but plpgsql won't
destroy simple expressions made within the DO block (before its first
commit) until the DO block is exited. That results in a warning about
a leaked tupdesc refcount when the COMMIT destroys the original resource
owner, and then an error about the active resource owner not holding a
matching refcount when the expression is destroyed.
To fix, get rid of the need to have a shutdown callback at all, by
instead caching a pointer to the relevant typcache entry. Those
survive for the life of the backend, so we needn't worry about the
pointer becoming stale. (For registered RECORD types, we can still
cache a pointer to the tupdesc, knowing that it won't change for the
life of the backend.) This mechanism has been in use in plpgsql
and expandedrecord.c since commit 4b93f5799, and seems to work well.
This change requires modifying the ExprEvalStep structs used by the
relevant expression step types, which is slightly worrisome for
back-patching. However, there seems no good reason for extensions
to be familiar with the details of these particular sub-structs.
Per report from Rohit Bhogate. Back-patch to v11 where within-DO-block
COMMITs became a thing.
Discussion: https://postgr.es/m/CAAV6ZkQRCVBh8qAY+SZiHnz+U+FqAGBBDaDTjF2yiKa2nJSLKg@mail.gmail.com
We'd previously noted the need for coping with Windows headers
that provide some other definition of macro "ERROR" than elog.h
does. It turns out that R also wants to define ERROR, and
WARNING too. PL/R has been working around this in a hacky way
that broke when we recently changed the numeric value of ERROR.
To let them have a more future-proof solution, provide an
alternate macro PGWARNING for WARNING, and make PGERROR visible
always, not only when #ifdef WIN32.
Discussion: https://postgr.es/m/CADK3HHK6iMChd1yoOqssxBn5Z14Zar8Ztr3G-N_fuG7F8YTP3w@mail.gmail.com
Commit 438fc4a39c prevented the WAL replay from writing
COMMIT_TS_SETTS record. By this change there is no code that
generates COMMIT_TS_SETTS record in PostgreSQL core.
Also we can think that there are no extensions using the record
because we've not received so far any complaints about the issue
that commit 438fc4a39c fixed. Therefore this commit removes
COMMIT_TS_SETTS record and its related code. Even without
this record, the timestamp required for commit timestamp feature
can be acquired from the COMMIT record.
Bump WAL page magic.
Reported-by: lx zou <zoulx1982@163.com>
Author: Fujii Masao
Reviewed-by: Alvaro Herrera
Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
Commit c9c41c7a33 used two different
naming patterns. Standardize on the majority pattern, which was the
only pattern in the last reviewed version of that commit.
Comment fixes are applied on HEAD, and documentation improvements are
applied on back-branches where needed.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20210408164008.GJ6592@telsasoft.com
Backpatch-through: 9.6
This commit introduces new foreign data wrapper API for TRUNCATE.
It extends TRUNCATE command so that it accepts foreign tables as
the targets to truncate and invokes that API. Also it extends postgres_fdw
so that it can issue TRUNCATE command to foreign servers, by adding
new routine for that TRUNCATE API.
The information about options specified in TRUNCATE command, e.g.,
ONLY, CACADE, etc is passed to FDW via API. The list of foreign tables to
truncate is also passed to FDW. FDW truncates the foreign data sources
that the passed foreign tables specify, based on those information.
For example, postgres_fdw constructs TRUNCATE command using them
and issues it to the foreign server.
For performance, TRUNCATE command invokes the FDW routine for
TRUNCATE once per foreign server that foreign tables to truncate belong to.
Author: Kazutaka Onishi, Kohei KaiGai, slightly modified by Fujii Masao
Reviewed-by: Bharath Rupireddy, Michael Paquier, Zhihong Yu, Alvaro Herrera, Stephen Frost, Ashutosh Bapat, Amit Langote, Daniel Gustafsson, Ibrar Ahmed, Fujii Masao
Discussion: https://postgr.es/m/CAOP8fzb_gkReLput7OvOK+8NHgw-RKqNv59vem7=524krQTcWA@mail.gmail.com
Discussion: https://postgr.es/m/CAJuF6cMWDDqU-vn_knZgma+2GMaout68YUgn1uyDnexRhqqM5Q@mail.gmail.com
ScalarArrayOpExprs with "useOr=true" and a set of Consts on the righthand
side have traditionally been evaluated by using a linear search over the
array. When these arrays contain large numbers of elements then this
linear search could become a significant part of execution time.
Here we add a new method of evaluating ScalarArrayOpExpr expressions to
allow them to be evaluated by first building a hash table containing each
element, then on subsequent evaluations, we just probe that hash table to
determine if there is a match.
The planner is in charge of determining when this optimization is possible
and it enables it by setting hashfuncid in the ScalarArrayOpExpr. The
executor will only perform the hash table evaluation when the hashfuncid
is set.
This means that not all cases are optimized. For example CHECK constraints
containing an IN clause won't go through the planner, so won't get the
hashfuncid set. We could maybe do something about that at some later
date. The reason we're not doing it now is from fear that we may slow
down cases where the expression is evaluated only once. Those cases can
be common, for example, a single row INSERT to a table with a CHECK
constraint containing an IN clause.
In the planner, we enable this when there are suitable hash functions for
the ScalarArrayOpExpr's operator and only when there is at least
MIN_ARRAY_SIZE_FOR_HASHED_SAOP elements in the array. The threshold is
currently set to 9.
Author: James Coleman, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Heikki Linnakangas
Discussion: https://postgr.es/m/CAAaqYe8x62+=wn0zvNKCj55tPpg-JBHzhZFFc6ANovdqFw7-dA@mail.gmail.com
Introduce a new GUC recovery_prefetch, disabled by default. When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.
The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.
The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts)
Reviewed-by: Andres Freund <andres@anarazel.de> (parts)
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com>
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
Teach xlogreader.c to decode its output into a circular buffer, to
support optimizations based on looking ahead.
* XLogReadRecord() works as before, consuming records one by one, and
allowing them to be examined via the traditional XLogRecGetXXX()
macros.
* An alternative new interface XLogNextRecord() is added that returns
pointers to DecodedXLogRecord structs that can be examined directly.
* XLogReadAhead() provides a second cursor that lets you see
further ahead, as long as data is available and there is enough space
in the decoding buffer. This returns DecodedXLogRecord pointers to the
caller, but also adds them to a queue of records that will later be
consumed by XLogNextRecord()/XLogReadRecord().
The buffer's size is controlled with wal_decode_buffer_size. The buffer
could potentially be placed into shared memory, for future projects.
Large records that don't fit in the circular buffer are called
"oversized" and allocated separately with palloc().
Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
Previously, the XLogReader module would fetch new input data using a
callback function. Redesign the interface so that it tells the caller
to insert more data with a special return value instead. This API suits
later patches for prefetching, encryption and maybe other future
projects that would otherwise require continually extending the callback
interface.
As incidental cleanup work, move global variables readOff, readLen and
readSegNo inside XlogReaderState.
Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version)
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
There was some code in gen_prune_steps_from_opexps that needlessly
checked a list was not empty when it clearly had to contain at least one
item. This prompted a further cleanup operation in partprune.c.
Additionally, the previous code could end up adding additional needless
INTERSECT steps. However, those do not appear to be able to cause any
misbehavior.
gen_prune_steps_from_opexps is now no longer in charge of generating
combine pruning steps. Instead, gen_partprune_steps_internal, which
already does some combine step creation has been given the sole
responsibility of generating all combine steps. This means that when
we recursively call gen_partprune_steps_internal, since it always now adds
a combine step when it produces multiple steps, we can just pay attention
to the final step returned.
In passing, do quite a bit of work on the comments to try to more clearly
explain the role of both gen_partprune_steps_internal and
gen_prune_steps_from_opexps. This is fairly complex code so some extra
effort to give any new readers an overview of how things work seems like
a good idea.
Author: Amit Langote
Reported-by: Andy Fan
Reviewed-by: Kyotaro Horiguchi, Andy Fan, Ryan Lambert, David Rowley
Discussion: https://postgr.es/m/CAKU4AWqWoVii+bRTeBQmeVW+PznkdO8DfbwqNsu9Gj4ubt9A6w@mail.gmail.com
This adds a function, pg_wait_for_backend_termination(), and a new
timeout argument to pg_terminate_backend(), which will wait for the
backend to actually terminate (with or without signaling it to do so
depending on which function is called). The default behaviour of
pg_terminate_backend() remains being timeout=0 which does not waiting.
For pg_wait_for_backend_termination() the default wait is 5 seconds.
Author: Bharath Rupireddy
Reviewed-By: Fujii Masao, David Johnston, Muhammad Usama,
Hou Zhijie, Magnus Hagander
Discussion: https://postgr.es/m/CALj2ACUBpunmyhYZw-kXCYs5NM+h6oG_7Df_Tn4mLmmUQifkqA@mail.gmail.com
If you know the ID of a buffer that recently held a block that you would
like to pin, this function can be used check if it's still there. It
can be used to avoid a second lookup in the buffer mapping table after
PrefetchBuffer() reports a cache hit.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
Previously, autovacuum would completely ignore partitioned tables, which
is not good regarding analyze -- failing to analyze those tables means
poor plans may be chosen. Make autovacuum aware of those tables by
propagating "changes since analyze" counts from the leaf partitions up
the partitioning hierarchy.
This also introduces necessary reloptions support for partitioned tables
(autovacuum_enabled, autovacuum_analyze_scale_factor,
autovacuum_analyze_threshold). It's unclear how best to document this
aspect.
Author: Yuzuko Hosoya <yuzukohosoya@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA@mail.gmail.com
This adds support for writing CREATE FUNCTION and CREATE PROCEDURE
statements for language SQL with a function body that conforms to the
SQL standard and is portable to other implementations.
Instead of the PostgreSQL-specific AS $$ string literal $$ syntax,
this allows writing out the SQL statements making up the body
unquoted, either as a single statement:
CREATE FUNCTION add(a integer, b integer) RETURNS integer
LANGUAGE SQL
RETURN a + b;
or as a block
CREATE PROCEDURE insert_data(a integer, b integer)
LANGUAGE SQL
BEGIN ATOMIC
INSERT INTO tbl VALUES (a);
INSERT INTO tbl VALUES (b);
END;
The function body is parsed at function definition time and stored as
expression nodes in a new pg_proc column prosqlbody. So at run time,
no further parsing is required.
However, this form does not support polymorphic arguments, because
there is no more parse analysis done at call time.
Dependencies between the function and the objects it uses are fully
tracked.
A new RETURN statement is introduced. This can only be used inside
function bodies. Internally, it is treated much like a SELECT
statement.
psql needs some new intelligence to keep track of function body
boundaries so that it doesn't send off statements when it sees
semicolons that are inside a function body.
Tested-by: Jaime Casanova <jcasanov@systemguards.com.ec>
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/1c11f1eb-f00c-43b7-799d-2d44132c02d7@2ndquadrant.com
Add a failsafe mechanism that is triggered by VACUUM when it notices
that the table's relfrozenxid and/or relminmxid are dangerously far in
the past. VACUUM checks the age of the table dynamically, at regular
intervals.
When the failsafe triggers, VACUUM takes extraordinary measures to
finish as quickly as possible so that relfrozenxid and/or relminmxid can
be advanced. VACUUM will stop applying any cost-based delay that may be
in effect. VACUUM will also bypass any further index vacuuming and heap
vacuuming -- it only completes whatever remaining pruning and freezing
is required. Bypassing index/heap vacuuming is enabled by commit
8523492d, which made it possible to dynamically trigger the mechanism
already used within VACUUM when it is run with INDEX_CLEANUP off.
It is expected that the failsafe will almost always trigger within an
autovacuum to prevent wraparound, long after the autovacuum began.
However, the failsafe mechanism can trigger in any VACUUM operation.
Even in a non-aggressive VACUUM, where we're likely to not advance
relfrozenxid, it still seems like a good idea to finish off remaining
pruning and freezing. An aggressive/anti-wraparound VACUUM will be
launched immediately afterwards. Note that the anti-wraparound VACUUM
that follows will itself trigger the failsafe, usually before it even
begins its first (and only) pass over the heap.
The failsafe is controlled by two new GUCs: vacuum_failsafe_age, and
vacuum_multixact_failsafe_age. There are no equivalent reloptions,
since that isn't expected to be useful. The GUCs have rather high
defaults (both default to 1.6 billion), and are expected to generally
only be used to make the failsafe trigger sooner/more frequently.
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmgH3ySGYeC-m-eOBsa2=sDwa292-CFghV4rESYo39FsQ@mail.gmail.com
Use the in-core query id computation for pg_stat_activity,
log_line_prefix, and EXPLAIN VERBOSE.
Similar to other fields in pg_stat_activity, only the queryid from the
top level statements are exposed, and if the backends status isn't
active then the queryid from the last executed statements is displayed.
Add a %Q placeholder to include the queryid in log_line_prefix, which
will also only expose top level statements.
For EXPLAIN VERBOSE, if a query identifier has been computed, either by
enabling compute_query_id or using a third-party module, display it.
Bump catalog version.
Discussion: https://postgr.es/m/20210407125726.tkvjdbw76hxnpwfi@nol
Author: Julien Rouhaud
Reviewed-by: Alvaro Herrera, Nitin Jadhav, Zhihong Yu
Add compute_query_id GUC to control whether a query identifier should be
computed by the core (off by default). It's thefore now possible to
disable core queryid computation and use pg_stat_statements with a
different algorithm to compute the query identifier by using a
third-party module.
To ensure that a single source of query identifier can be used and is
well defined, modules that calculate a query identifier should throw an
error if compute_query_id specified to compute a query id and if a query
idenfitier was already calculated.
Discussion: https://postgr.es/m/20210407125726.tkvjdbw76hxnpwfi@nol
Author: Julien Rouhaud
Reviewed-by: Alvaro Herrera, Nitin Jadhav, Zhihong Yu
Teach VACUUM to truncate the line pointer array of each heap page when a
contiguous group of LP_UNUSED line pointers appear at the end of the
array -- these unused and unreferenced items are excluded. This process
occurs during VACUUM's second pass over the heap, right after LP_DEAD
line pointers on the page (those encountered/pruned during the first
pass) are marked LP_UNUSED.
Truncation avoids line pointer bloat with certain workloads,
particularly those involving continual range DELETEs and bulk INSERTs
against the same table.
Also harden heapam code to check for an out-of-range page offset number
in places where we weren't already doing so.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
The previous implementation (from 9afffcb833) had an unnecessary check
on the boundaries of the enum which trigtered compile warnings. To clean
it up, move the pre-existing static assert to a central location and
call that.
Reported-By: Erik Rijkers
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/1056399262.13159.1617793249020@webmailclassic.xs4all.nl
Documentation and comments in code and tests have been using the terms
sensitive/insensitive cursor incorrectly relative to the SQL standard.
(Cursor sensitivity is only relevant for changes made in the same
transaction as the cursor, not for concurrent changes in other
sessions.) Moreover, some of the behavior of PostgreSQL is incorrect
according to the SQL standard, confusing the issue further. (WHERE
CURRENT OF changes are not visible in insensitive cursors, but they
should be.)
This change corrects the terminology and removes the claim that
sensitive cursors are supported. It also adds a test case that checks
the insensitive behavior in a "correct" way, using a change command
not using WHERE CURRENT OF. Finally, it adds the ASENSITIVE cursor
option to select the default asensitive behavior, per SQL standard.
There are no changes to cursor behavior in this patch.
Discussion: https://www.postgresql.org/message-id/flat/96ee8b30-9889-9e1b-b053-90e10c050e85%40enterprisedb.com
The "authenticated identity" is the string used by an authentication
method to identify a particular user. In many common cases, this is the
same as the PostgreSQL username, but for some third-party authentication
methods, the identifier in use may be shortened or otherwise translated
(e.g. through pg_ident user mappings) before the server stores it.
To help administrators see who has actually interacted with the system,
this commit adds the capability to store the original identity when
authentication succeeds within the backend's Port, and generates a log
entry when log_connections is enabled. The log entries generated look
something like this (where a local user named "foouser" is connecting to
the database as the database user called "admin"):
LOG: connection received: host=[local]
LOG: connection authenticated: identity="foouser" method=peer (/data/pg_hba.conf:88)
LOG: connection authorized: user=admin database=postgres application_name=psql
Port->authn_id is set according to the authentication method:
bsd: the PostgreSQL username (aka the local username)
cert: the client's Subject DN
gss: the user principal
ident: the remote username
ldap: the final bind DN
pam: the PostgreSQL username (aka PAM username)
password (and all pw-challenge methods): the PostgreSQL username
peer: the peer's pw_name
radius: the PostgreSQL username (aka the RADIUS username)
sspi: either the down-level (SAM-compatible) logon name, if
compat_realm=1, or the User Principal Name if compat_realm=0
The trust auth method does not set an authenticated identity. Neither
does clientcert=verify-full.
Port->authn_id could be used for other purposes, like a superuser-only
extra column in pg_stat_activity, but this is left as future work.
PostgresNode::connect_{ok,fails}() have been modified to let tests check
the backend log files for required or prohibited patterns, using the
new log_like and log_unlike parameters. This uses a method based on a
truncation of the existing server log file, like issues_sql_like().
Tests are added to the ldap, kerberos, authentication and SSL test
suites.
Author: Jacob Champion
Reviewed-by: Stephen Frost, Magnus Hagander, Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/c55788dd1773c521c862e8e0dddb367df51222be.camel@vmware.com
Delay creation of the projections for INSERT and UPDATE tuples
until they're needed. This saves a pretty fair amount of work
when only some of the partitions are actually touched.
The logic associated with identifying junk columns in UPDATE/DELETE
is moved to another loop, allowing removal of one loop over the
target relations; but it didn't actually change at all.
Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.
Amit Langote, reviewed at different times by Heikki Linnakangas and
myself
Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
Arrange to do some things on-demand, rather than immediately during
executor startup, because there's a fair chance of never having to do
them at all:
* Don't open result relations' indexes until needed.
* Don't initialize partition tuple routing, nor the child-to-root
tuple conversion map, until needed.
This wins in UPDATEs on partitioned tables when only some of the
partitions will actually receive updates; with larger partition
counts the savings is quite noticeable. Also, we can remove some
sketchy heuristics in ExecInitModifyTable about whether to set up
tuple routing.
Also, remove execPartition.c's private hash table tracking which
partitions were already opened by the ModifyTable node. Instead
use the hash added to ModifyTable itself by commit 86dc90056.
To allow lazy computation of the conversion maps, we now set
ri_RootResultRelInfo in all child ResultRelInfos. We formerly set it
only in some, not terribly well-defined, cases. This has user-visible
side effects in that now more error messages refer to the root
relation instead of some partition (and provide error data in the
root's column order, too). It looks to me like this is a strict
improvement in consistency, so I don't have a problem with the
output changes visible in this commit.
Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.
Amit Langote, reviewed at different times by Heikki Linnakangas and
myself
Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
Retry the call to heap_prune_page() in rare cases where there is
disagreement between the heap_prune_page() call and the call to
HeapTupleSatisfiesVacuum() that immediately follows. Disagreement is
possible when a concurrently-aborted transaction makes a tuple DEAD
during the tiny window between each step. This was the only case where
a tuple considered DEAD by VACUUM still had storage following pruning.
VACUUM's definition of dead tuples is now uniformly simple and
unambiguous: dead tuples from each page are always LP_DEAD line pointers
that were encountered just after we performed pruning (and just before
we considered freezing remaining items with tuple storage).
Eliminating the tupgone=true special case enables INDEX_CLEANUP=off
style skipping of index vacuuming that takes place based on flexible,
dynamic criteria. The INDEX_CLEANUP=off case had to know about skipping
indexes up-front before now, due to a subtle interaction with the
special case (see commit dd695979) -- this was a special case unto
itself. Now there are no special cases. And so now it won't matter
when or how we decide to skip index vacuuming: it won't affect how
pruning behaves, and it won't be affected by any of the implementation
details of pruning or freezing.
Also remove XLOG_HEAP2_CLEANUP_INFO records. These are no longer
necessary because we now rely entirely on heap pruning taking care of
recovery conflicts. There is no longer any need to generate recovery
conflicts for DEAD tuples that pruning just missed. This also means
that heap vacuuming now uses exactly the same strategy for recovery
conflicts as index vacuuming always has: REDO routines never need to
process a latestRemovedXid from the WAL record, since earlier REDO of
the WAL record from pruning is sufficient in all cases. The generic
XLOG_HEAP2_CLEAN record type is now split into two new record types to
reflect this new division (these are called XLOG_HEAP2_PRUNE and
XLOG_HEAP2_VACUUM).
Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass. A regular exclusive lock is
enough. This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED. No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.
Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning. Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with HOT chains). Index vacuuming and
heap vacuuming are now only concerned with recycling garbage items from
physical data structures that back the logical database.
Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.
Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
At present, if we want to update publications in a subscription, we
can use SET PUBLICATION. However, it requires supplying all
publications that exists and the new publications. If we want to add
new publications, it's inconvenient. The new syntax only supplies the
new publications. When the refresh is true, it only refreshes the new
publications.
Author: Japin Li <japinli@hotmail.com>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/MEYP282MB166939D0D6C480B7FBE7EFFBB6BC0@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
The previous implementation of EXTRACT mapped internally to
date_part(), which returned type double precision (since it was
implemented long before the numeric type existed). This can lead to
imprecise output in some cases, so returning numeric would be
preferrable. Changing the return type of an existing function is a
bit risky, so instead we do the following: We implement a new set of
functions, which are now called "extract", in parallel to the existing
date_part functions. They work the same way internally but use
numeric instead of float8. The EXTRACT construct is now mapped by the
parser to these new extract functions. That way, dumps of views
etc. from old versions (which would use date_part) continue to work
unchanged, but new uses will map to the new extract functions.
Additionally, the reverse compilation of EXTRACT now reproduces the
original syntax, using the new mechanism introduced in
40c24bfef9.
The following minor changes of behavior result from the new
implementation:
- The column name from an isolated EXTRACT call is now "extract"
instead of "date_part".
- Extract from date now rejects inappropriate field names such as
HOUR. It was previously mapped internally to extract from
timestamp, so it would silently accept everything appropriate for
timestamp.
- Return values when extracting fields with possibly fractional
values, such as second and epoch, now have the full scale that the
value has internally (so, for example, '1.000000' instead of just
'1').
Reported-by: Petr Fedorov <petr.fedorov@phystech.edu>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/42b73d2d-da12-ba9f-570a-420e0cce19d9@phystech.edu
Commit 3e98c0bafb added pg_backend_memory_contexts view to display
the memory contexts of the backend process. However its target process
is limited to the backend that is accessing to the view. So this is
not so convenient when investigating the local memory bloat of other
backend process. To improve this situation, this commit adds
pg_log_backend_memory_contexts() function that requests to log
the memory contexts of the specified backend process.
This information can be also collected by calling
MemoryContextStats(TopMemoryContext) via a debugger. But
this technique cannot be used in some environments because no debugger
is available there. So, pg_log_backend_memory_contexts() allows us to
see the memory contexts of specified backend more easily.
Only superusers are allowed to request to log the memory contexts
because allowing any users to issue this request at an unbounded rate
would cause lots of log messages and which can lead to denial of service.
On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its memory contexts at LOG_SERVER_ONLY level,
so that these memory contexts will appear in the server log but not
be sent to the client. It logs one message per memory context.
Because if it buffers all memory contexts into StringInfo to log them
as one message, which may require the buffer to be enlarged very much
and lead to OOM error since there can be a large number of memory
contexts in a backend.
When a backend process is consuming huge memory, logging all its
memory contexts might overrun available disk space. To prevent this,
now this patch limits the number of child contexts to log per parent
to 100. As with MemoryContextStats(), it supposes that practical cases
where the log gets long will typically be huge numbers of siblings
under the same parent context; while the additional debugging value
from seeing details about individual siblings beyond 100 will not be large.
There was another proposed patch to add the function to return
the memory contexts of specified backend as the result sets,
instead of logging them, in the discussion. However that patch is
not included in this commit because it had several issues to address.
Thanks to Tatsuhito Kasahara, Andres Freund, Tom Lane, Tomas Vondra,
Michael Paquier, Kyotaro Horiguchi and Zhihong Yu for the discussion.
Bump catalog version.
Author: Atsushi Torikoshi
Reviewed-by: Kyotaro Horiguchi, Zhihong Yu, Fujii Masao
Discussion: https://postgr.es/m/0271f440ac77f2a4180e0e56ebd944d1@oss.nttdata.com
The output plugin accepts a new parameter (messages) that controls if
logical decoding messages are written into the replication stream. It is
useful for those clients that use pgoutput as an output plugin and needs
to process messages that were written by pg_logical_emit_message().
Although logical streaming replication protocol supports logical
decoding messages now, logical replication does not use this feature yet.
Author: David Pirotte, Euler Taveira
Reviewed-by: Euler Taveira, Andres Freund, Ashutosh Bapat, Amit Kapila
Discussion: https://postgr.es/m/CADK3HHJ-+9SO7KuRLH=9Wa1rAo60Yreq1GFNkH_kd0=CdaWM+A@mail.gmail.com
Instead of using multiple parameters in parse_ouput_parameters function
signature, use the struct PGOutputData that encapsulates all pgoutput
options. It will be useful for future work where we need to add other
options in pgoutput.
Author: Euler Taveira
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CADK3HHJ-+9SO7KuRLH=9Wa1rAo60Yreq1GFNkH_kd0=CdaWM+A@mail.gmail.com
Not much to say here: does what it says on the tin.
We steal a previously-always-zero bit from the nextOffset
field of leaf index tuples in order to track whether there
is a nulls bitmap. Otherwise it works about like included
columns in other index types.
Pavel Borisov, reviewed by Andrey Borodin and Anastasia Lubennikova,
and rather heavily editorialized on by me
Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
Reorganize the state struct used by VACUUM -- group related items
together to make it easier to understand. Also stop relying on stack
variables inside lazy_scan_heap() -- move those into the state struct
instead. Doing things this way simplifies large groups of related
functions whose function signatures had a lot of unnecessary redundancy.
Switch over to using int64 for the struct fields used to count things
that are reported to the user via log_autovacuum and VACUUM VERBOSE
output. We were using double, but that doesn't seem to have any
advantages. Using int64 makes it possible to add assertions that verify
that the first pass over the heap (pruning) encounters precisely the
same number of LP_DEAD items that get deleted from indexes later on, in
the second pass over the heap. These assertions will be added in later
commits.
Finally, adjust the signatures of functions with IndexBulkDeleteResult
pointer arguments in cases where there was ambiguity about whether or
not the argument relates to a single index or all indexes. Functions
now use the idiom that both ambulkdelete() and amvacuumcleanup() have
always used (where appropriate): accept a mutable IndexBulkDeleteResult
pointer argument, and return a result IndexBulkDeleteResult pointer to
caller.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com
A commonly requested use-case is to have a role who can run an
unfettered pg_dump without having to explicitly GRANT that user access
to all tables, schemas, et al, without that role being a superuser.
This address that by adding a "pg_read_all_data" role which implicitly
gives any member of this role SELECT rights on all tables, views and
sequences, and USAGE rights on all schemas.
As there may be cases where it's also useful to have a role who has
write access to all objects, pg_write_all_data is also introduced and
gives users implicit INSERT, UPDATE and DELETE rights on all tables,
views and sequences.
These roles can not be logged into directly but instead should be
GRANT'd to a role which is able to log in. As noted in the
documentation, if RLS is being used then an administrator may (or may
not) wish to set BYPASSRLS on the login role which these predefined
roles are GRANT'd to.
Reviewed-by: Georgios Kokolatos
Discussion: https://postgr.es/m/20200828003023.GU29590@tamriel.snowman.net
Move the planner-control flags up so that there is more room for parse
options. Some pending patches need some room there, so do this
renumbering separately so that there is less potential for conflicts.
According to the documentation, the attType passed to the opclass
config function (and also relied on by the core code) is the type
of the heap column or expression being indexed. But what was
actually being passed was the type stored for the index column.
This made no difference for user-defined SP-GiST opclasses,
because we weren't allowing the STORAGE clause of CREATE OPCLASS
to be used, so the two types would be the same. But it's silly
not to allow that, seeing that the built-in poly_ops opclass
has a different value for opckeytype than opcintype, and that if you
want to do lossy storage then the types must really be different.
(Thus, user-defined opclasses doing lossy storage had to lie about
what type is in the index.) Hence, remove the restriction, and make
sure that we use the input column type not opckeytype where relevant.
For reasons of backwards compatibility with existing user-defined
opclasses, we can't quite insist that the specified leafType match
the STORAGE clause; instead just add an amvalidate() warning if
they don't match.
Also fix some bugs that would only manifest when trying to return
index entries when attType is different from attLeafType. It's not
too surprising that these have not been reported, because the only
usual reason for such a difference is to store the leaf value
lossily, rendering index-only scans impossible.
Add a src/test/modules module to exercise cases where attType is
different from attLeafType and yet index-only scan is supported.
Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
pgstat_report_wait_start() and pgstat_report_wait_end() required two
conditional branches so far. One to check if MyProc is NULL, the other to
check if pgstat_track_activities is set. As wait events are used around
comparatively lightweight operations, and are inlined (reducing branch
predictor effectiveness), that's not great.
The dependency on MyProc has a second disadvantage: Low-level subsystems, like
storage/file/fd.c, report wait events, but architecturally it is preferable
for them to not depend on inter-process subsystems like proc.h (defining
PGPROC). After this change including pgstat.h (nor obviously its
sub-components like backend_status.h, wait_event.h, ...) does not pull in IPC
related headers anymore.
These goals, efficiency and abstraction, are achieved by having
pgstat_report_wait_start/end() not interact with MyProc, but instead a new
my_wait_event_info variable. At backend startup it points to a local variable,
removing the need to check for MyProc being NULL. During process
initialization my_wait_event_info is redirected to MyProc->wait_event_info. At
shutdown this is reversed. Because wait event reporting now does not need to
know about where the wait event is stored, it does not need to know about
PGPROC anymore.
The removal of the branch for checking pgstat_track_activities is simpler:
Don't check anymore. The cost due to the branch are often higher than the
store - and even if not, pgstat_track_activities is rarely disabled.
The main motivator to commit this work now is that removing the (indirect)
pgproc.h include from pgstat.h simplifies a patch to move statistics reporting
to shared memory (which still has a chance to get into 14).
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20210402194458.2vu324hkk2djq6ce@alap3.anarazel.de
Backend status (supporting pg_stat_activity) and command
progress (supporting pg_stat_progress*) related code is largely
independent from the rest of pgstat.[ch] (supporting views like
pg_stat_all_tables that accumulate data over time). See also
a333476b92.
This commit doesn't rename the function names to make the distinction
from the rest of pgstat_ clearer - that'd be more invasive and not
clearly beneficial. If we were to decide to do such a rename at some
point, it's better done separately from moving the code as well.
Robert's review was of an earlier version.
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/20210316195440.twxmlov24rr2nxrg@alap3.anarazel.de
Similarly to the cryptohash implementations, this refactors the existing
HMAC code into a single set of APIs that can be plugged with any crypto
libraries PostgreSQL is built with (only OpenSSL currently). If there
is no such libraries, a fallback implementation is available. Those new
APIs are designed similarly to the existing cryptohash layer, so there
is no real new design here, with the same logic around buffer bound
checks and memory handling.
HMAC has a dependency on cryptohashes, so all the cryptohash types
supported by cryptohash{_openssl}.c can be used with HMAC. This
refactoring is an advantage mainly for SCRAM, that included its own
implementation of HMAC with SHA256 without relying on the existing
crypto libraries even if PostgreSQL was built with their support.
This code has been tested on Windows and Linux, with and without
OpenSSL, across all the versions supported on HEAD from 1.1.1 down to
1.0.1. I have also checked that the implementations are working fine
using some sample results, a custom extension of my own, and doing
cross-checks across different major versions with SCRAM with the client
and the backend.
Author: Michael Paquier
Reviewed-by: Bruce Momjian
Discussion: https://postgr.es/m/X9m0nkEJEzIPXjeZ@paquier.xyz
The wait event related code is independent from the rest of the
pgstat.[ch] code, of nontrivial size and changes on a regular
basis. Put it into its own set of files.
As there doesn't seem to be a good pre-existing directory for code
like this, add src/backend/utils/activity.
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/20210316195440.twxmlov24rr2nxrg@alap3.anarazel.de
Provide a new GUC check_client_connection_interval that can be used to
check whether the client connection has gone away, while running very
long queries. It is disabled by default.
For now this uses a non-standard Linux extension (also adopted by at
least one other OS). POLLRDHUP is not defined by POSIX, and other OSes
don't have a reliable way to know if a connection was closed without
actually trying to read or write.
In future we might consider trying to send a no-op/heartbeat message
instead, but that could require protocol changes.
Author: Sergey Cherkashin <s.cherkashin@postgrespro.ru>
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tatsuo Ishii <ishii@sraoss.co.jp>
Reviewed-by: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Maksim Milyutin <milyutinma@gmail.com>
Reviewed-by: Tsunakawa, Takayuki/綱川 貴之 <tsunakawa.takay@fujitsu.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (much earlier version)
Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru
The caller of pgstat_report_replslot() passes int64 values to the function.
Also the function stores those values in PgStat_Counter (i.e., int64) fields
of PgStat_MsgReplSlot struct. But previously the function used "int" as
the data types of some arguments for those values, which could lead to
the overflow of values.
To avoid this risk, this commit fixes pgstat_report_replslot() to use
PgStat_Counter type for the arguments. Since they are the statistics counters,
PgStat_Counter, the data type used for counters, is used for them
instead of int64.
Reported-by: Vignesh C
Author: Vignesh C
Reviewed-by: Jeevan Ladhe, Fujii Masao
Discussion: https://postgr.es/m/CALDaNm080OpG=ZwOb0i8EyChH5SyHAMFWJCKaKTXmrfvJLbgaA@mail.gmail.com
Here we add a new executor node type named "Result Cache". The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins. This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again. Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.
For certain data sets, this can significantly improve the performance of
joins. The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join. In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch. Merge joins would have to
skip over all of the unmatched rows. If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join. The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large. Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join. This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does. The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables. Smaller hash tables generally perform better.
The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size. We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.
For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node. We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be. Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.
For now, the planner will only consider using a result cache for
parameterized nested loop joins. This works for both normal joins and
also for LATERAL type joins to subqueries. It is possible to use this new
node for other uses in the future. For example, to cache results from
correlated subqueries. However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio. Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.
The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations. With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be. In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%. Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join. However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values. If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join. Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature. Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.
For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache. However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default. There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression. Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default. It remains to be seen if we'll
maintain that setting for the release.
Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch. Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people. If there's some consensus on a better name, then we can
change it before the release. Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.
Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
The existing convention in SP-GiST is that any pass-by-value datatype
is stored in Datum representation, i.e. it's of width sizeof(Datum)
even when typlen is less than that. This is okay, or at least it's
too late to change it, for prefix datums and node-label datums in inner
(upper) tuples. But it's problematic for leaf datums, because we'd
prefer those to be stored in Postgres' standard on-disk representation
so that we can easily extend leaf tuples to carry additional "included"
columns.
I believe, however, that we can get away with just up and changing that.
This would be an unacceptable on-disk-format break, but there are two
big mitigating factors:
1. It seems quite unlikely that there are any SP-GiST opclasses out
there that use pass-by-value leaf datatypes. Certainly none of the
ones in core do, nor has codesearch.debian.net heard of any. Given
what SP-GiST is good for, it's hard to conceive of a use-case where
the leaf-level values would be both small and fixed-width. (As an
example, if you wanted to index text values with the leaf level being
just a byte, then every text string would have to be represented
with one level of inner tuple per preceding byte, which would be
horrendously space-inefficient and slow to access. You always want
to use as few inner-tuple levels as possible, leaving as much as
possible in the leaf values.)
2. Even granting that you have such an index, this change only
breaks things on big-endian machines. On little-endian, the high
order bytes of the Datum format will now just appear to be alignment
padding space.
So, change the code to store pass-by-value leaf datums in their
usual on-disk form. Inner-tuple datums are not touched.
This is extracted from a larger patch that intends to add support for
"included" columns. I'm committing it separately for visibility in
our commit logs.
Pavel Borisov and Tom Lane, reviewed by Andrey Borodin
Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
This gives a small performance gain, by reducing the number of calls
to the conversion/verification function, and letting it work with
larger inputs. Also, reorganizing the input pipeline makes it easier
to parallelize the input parsing: after the input has been converted
to the database encoding, the next stage of finding the newlines can
be done in parallel, because there cannot be any newline chars
"embedded" in multi-byte characters in the encodings that we support
as server encodings.
This changes behavior in one corner case: if client and server
encodings are the same single-byte encoding (e.g. latin1), previously
the input would not be checked for zero bytes ('\0'). Any fields
containing zero bytes would be truncated at the zero. But if encoding
conversion was needed, the conversion routine would throw an error on
the zero. After this commit, the input is always checked for zeros.
Reviewed-by: John Naylor
Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
With the 'noError' argument, you can try to convert a buffer without
knowing the character boundaries beforehand. The functions now need to
return the number of input bytes successfully converted.
This is is a backwards-incompatible change, if you have created a custom
encoding conversion with CREATE CONVERSION. This adds a check to
pg_upgrade for that, refusing the upgrade if there are any user-defined
encoding conversions. Custom conversions are very rare, there are no
commonly used extensions that I know of that uses that feature. No other
objects can depend on conversions, so if you do have one, you can fairly
easily drop it before upgrading, and recreate it after the upgrade with
an updated version.
Add regression tests for built-in encoding conversions. This doesn't cover
every conversion, but it covers all the internal functions in conv.c that
are used to implement the conversions.
Reviewed-by: John Naylor
Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
This removes "Add Result Cache executor node". It seems that something
weird is going on with the tracking of cache hits and misses as
highlighted by many buildfarm animals. It's not yet clear what the
problem is as other parts of the plan indicate that the cache did work
correctly, it's just the hits and misses that were being reported as 0.
This is especially a bad time to have the buildfarm so broken, so
reverting before too many more animals go red.
Discussion: https://postgr.es/m/CAApHDvq_hydhfovm4=izgWs+C5HqEeRScjMbOgbpC-jRAeK3Yw@mail.gmail.com
Here we add a new executor node type named "Result Cache". The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins. This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again. Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.
For certain data sets, this can significantly improve the performance of
joins. The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join. In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch. Merge joins would have to
skip over all of the unmatched rows. If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join. The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large. Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join. This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does. The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables. Smaller hash tables generally perform better.
The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size. We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.
For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node. We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be. Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.
For now, the planner will only consider using a result cache for
parameterized nested loop joins. This works for both normal joins and
also for LATERAL type joins to subqueries. It is possible to use this new
node for other uses in the future. For example, to cache results from
correlated subqueries. However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio. Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.
The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations. With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be. In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%. Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join. However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values. If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join. Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature. Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.
For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache. However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default. There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression. Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default. It remains to be seen if we'll
maintain that setting for the release.
Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch. Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people. If there's some consensus on a better name, then we can
change it before the release. Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.
Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
According to the comments, when an invalid or dropped column oid is passed
to has_column_privilege(), the intention has always been to return NULL.
However, when the caller had table level privilege the invalid/missing
column was never discovered, because table permissions were checked first.
Fix that by introducing extended versions of pg_attribute_acl(check|mask)
and pg_class_acl(check|mask) which take a new argument, is_missing. When
is_missing is NULL, the old behavior is preserved. But when is_missing is
passed by the caller, no ERROR is thrown for dropped or missing
columns/relations, and is_missing is flipped to true. This in turn allows
has_column_privilege to check for column privileges first, providing the
desired semantics.
Not backpatched since it is a user visible behavioral change with no previous
complaints, and the fix is a bit on the invasive side.
Author: Joe Conway
Reviewed-By: Tom Lane
Reported by: Ian Barwick
Discussion: https://postgr.es/m/flat/9b5f4311-157b-4164-7fe7-077b4fe8ed84%40joeconway.com