postgresql/src/test/isolation
Alvaro Herrera 20b6552242 Fix freezing of a dead HOT-updated tuple
Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples.  But
concurrent transaction commit/abort may turn DEAD some of the HOT tuples
that survived the prune, before HeapTupleSatisfiesVacuum tests them.
This happens to activate the code that decides to freeze the tuple ...
which resuscitates it, duplicating data.

(This is especially bad if there's any unique constraints, because those
are now internally violated due to the duplicate entries, though you
won't know until you try to REINDEX or dump/restore the table.)

One possible fix would be to simply skip doing anything to the tuple,
and hope that the next HOT prune would remove it.  But there is a
problem: if the tuple is older than freeze horizon, this would leave an
unfrozen XID behind, and if no HOT prune happens to clean it up before
the containing pg_clog segment is truncated away, it'd later cause an
error when the XID is looked up.

Fix the problem by having the tuple freezing routines cope with the
situation: don't freeze the tuple (and keep it dead).  In the cases that
the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
so that there is no need to look up the XID in pg_clog later on.

An isolation test is included, authored by Michael Paquier, loosely
based on Daniel Wood's original reproducer.  It only tests one
particular scenario, though, not all the possible ways for this problem
to surface; it be good to have a more reliable way to test this more
fully, but it'd require more work.
In message https://postgr.es/m/20170911140103.5akxptyrwgpc25bw@alvherre.pgsql
I outlined another test case (more closely matching Dan Wood's) that
exposed a few more ways for the problem to occur.

Backpatch all the way back to 9.3, where this problem was introduced by
multixact juggling.  In branches 9.3 and 9.4, this includes a backpatch
of commit e5ff9fefcd50 (of 9.5 era), since the original is not
correctable without matching the coding pattern in 9.5 up.

Reported-by: Daniel Wood
Diagnosed-by: Daniel Wood
Reviewed-by: Yi Wen Wong, Michaël Paquier
Discussion: https://postgr.es/m/E5711E62-8FDF-4DCA-A888-C200BF6B5742@amazon.com
2017-09-28 16:44:01 +02:00
..
expected Fix freezing of a dead HOT-updated tuple 2017-09-28 16:44:01 +02:00
specs Fix freezing of a dead HOT-updated tuple 2017-09-28 16:44:01 +02:00
.gitignore Move isolationtester's is-blocked query into C code for speed. 2017-04-10 10:26:54 -04:00
isolation_main.c Fix inclusions of postgres_fe.h from .h files. 2017-03-08 20:41:06 -05:00
isolation_schedule Fix freezing of a dead HOT-updated tuple 2017-09-28 16:44:01 +02:00
isolationtester.c Phase 3 of pgindent updates. 2017-06-21 15:35:54 -04:00
isolationtester.h Phase 2 of pgindent updates. 2017-06-21 15:19:25 -04:00
Makefile Improve isolation tests infrastructure. 2017-03-14 15:56:17 -07:00
README Modify the isolation tester so that multiple sessions can wait. 2016-02-11 08:36:30 -05:00
specparse.y Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
specscanner.l Remove unnecessary parentheses in return statements 2017-09-05 14:52:55 -04:00

src/test/isolation/README

Isolation tests
===============

This directory contains a set of tests for concurrent behaviors in
PostgreSQL.  These tests require running multiple interacting transactions,
which requires management of multiple concurrent connections, and therefore
can't be tested using the normal pg_regress program.  The name "isolation"
comes from the fact that the original motivation was to test the
serializable isolation level; but tests for other sorts of concurrent
behaviors have been added as well.

To run the tests, you need to have a server running at the default port
expected by libpq.  (You can set PGPORT and so forth in your environment
to control this.)  Then run
    make installcheck
To run just specific test(s), you can do something like
    ./pg_isolation_regress fk-contention fk-deadlock
(look into the specs/ subdirectory to see the available tests).

The prepared-transactions test requires the server's
max_prepared_transactions parameter to be set to at least 3; therefore it
is not run by default.  To include it in the test run, use
    make installcheck-prepared-txns

To define tests with overlapping transactions, we use test specification
files with a custom syntax, which is described in the next section.  To add
a new test, place a spec file in the specs/ subdirectory, add the expected
output in the expected/ subdirectory, and add the test's name to the
isolation_schedule file.

isolationtester is a program that uses libpq to open multiple connections,
and executes a test specified by a spec file. A libpq connection string
specifies the server and database to connect to; defaults derived from
environment variables are used otherwise.

pg_isolation_regress is a tool similar to pg_regress, but instead of using
psql to execute a test, it uses isolationtester.  It accepts all the same
command-line arguments as pg_regress.


Test specification
==================

Each isolation test is defined by a specification file, stored in the specs
subdirectory. A test specification consists of four parts, in this order:

setup { <SQL> }

  The given SQL block is executed once, in one session only, before running
  the test.  Create any test tables or other required objects here.  This
  part is optional.  Multiple setup blocks are allowed if needed; each is
  run separately, in the given order.  (The reason for allowing multiple
  setup blocks is that each block is run as a single PQexec submission,
  and some statements such as VACUUM cannot be combined with others in such
  a block.)

teardown { <SQL> }

  The teardown SQL block is executed once after the test is finished. Use
  this to clean up in preparation for the next permutation, e.g dropping
  any test tables created by setup. This part is optional.

session "<name>"

  There are normally several "session" parts in a spec file. Each
  session is executed in its own connection. A session part consists
  of three parts: setup, teardown and one or more "steps". The per-session
  setup and teardown parts have the same syntax as the per-test setup and
  teardown described above, but they are executed in each session. The
  setup part typically contains a "BEGIN" command to begin a transaction.

  Each step has the syntax

  step "<name>" { <SQL> }

  where <name> is a name identifying this step, and SQL is a SQL statement
  (or statements, separated by semicolons) that is executed in the step.
  Step names must be unique across the whole spec file.

permutation "<step name>" ...

  A permutation line specifies a list of steps that are run in that order.
  Any number of permutation lines can appear.  If no permutation lines are
  given, the test program automatically generates all possible orderings
  of the steps from each session (running the steps of any one session in
  order).  Note that the list of steps in a manually specified
  "permutation" line doesn't actually have to be a permutation of the
  available steps; it could for instance repeat some steps more than once,
  or leave others out.

Lines beginning with a # are considered comments.

For each permutation of the session steps (whether these are manually
specified in the spec file, or automatically generated), the isolation
tester runs the main setup part, then per-session setup parts, then
the selected session steps, then per-session teardown, then the main
teardown script.  Each selected step is sent to the connection associated
with its session.


Support for blocking commands
=============================

Each step may contain commands that block until further action has been taken
(most likely, some other session runs a step that unblocks it or causes a
deadlock).  A test that uses this ability must manually specify valid
permutations, i.e. those that would not expect a blocked session to execute a
command.  If a test fails to follow that rule, isolationtester will cancel it
after 60 seconds.  If the cancel doesn't work, isolationtester will exit
uncleanly after a total of 75 seconds of wait time.  Testing invalid
permutations should be avoided because they can make the isolation tests take
a very long time to run, and they serve no useful testing purpose.

Note that isolationtester recognizes that a command has blocked by looking
to see if it is shown as waiting in the pg_locks view; therefore, only
blocks on heavyweight locks will be detected.