mirror of
https://github.com/postgres/postgres.git
synced 2026-05-28 04:35:45 -04:00
Fix procLatch ownership race in ProcKill()
DisownLatch() was executed after the PGPROC entry of the process terminated is pushed back into a freelist. A newly-forked backend that recycles the slot could call OwnLatch() and PANIC with a "latch already owned by PID", taking down the server. There were two scenarios related to lock groups where this issue could be reached: * A follower pushes the leader's PGPROC back to the freelist while the leader has not yet called DisownLatch() in its own ProcKill(). * A leader outliving all its followers pushes its own PGPROC onto the freelist before reaching DisownLatch(), which would be the most common scenario. This issue is fixed by calling SwitchBackToLocalLatch() and DisownLatch() at an earlier phase of ProcKill(), before any freelist manipulation happens, so that the slot of the backend terminated is never exposed as owning a latch. Note that pgstat_reset_wait_event_storage() is kept at a later stage. An upcoming commit will take advantage of that by introducing a test able to check the original PANIC scenario. Author: Vlad Lesin <vladlesin@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/d2983796-2603-41b7-a66e-fc8489ddb954@gmail.com Backpatch-through: 14
This commit is contained in:
parent
5631045231
commit
84b9d6bcea
1 changed files with 19 additions and 11 deletions
|
|
@ -963,6 +963,24 @@ ProcKill(int code, Datum arg)
|
|||
/* Cancel any pending condition variable sleep, too */
|
||||
ConditionVariableCancelSleep();
|
||||
|
||||
/*
|
||||
* Reset MyLatch to the process local one and disown the shared latch, so
|
||||
* that signal handlers et al can continue using the latch after the
|
||||
* shared latch isn't ours anymore.
|
||||
*
|
||||
* DisownLatch() must happen before our PGPROC can appear on a freelist: a
|
||||
* newly-forked backend that pops our slot and calls OwnLatch() would
|
||||
* PANIC on a still-owned latch.
|
||||
*
|
||||
* pgstat_reset_wait_event_storage() is intentionally deferred until after
|
||||
* the lock-group block so that wait_event_info remains visible in our
|
||||
* PGPROC slot while we may be observed there. It is safe to defer
|
||||
* because our slot is not yet on any freelist at this point, and useful
|
||||
* for testing purposes.
|
||||
*/
|
||||
SwitchBackToLocalLatch();
|
||||
DisownLatch(&MyProc->procLatch);
|
||||
|
||||
proc = MyProc;
|
||||
procgloballist = proc->procgloballist;
|
||||
|
||||
|
|
@ -1019,21 +1037,11 @@ ProcKill(int code, Datum arg)
|
|||
LWLockRelease(leader_lwlock);
|
||||
}
|
||||
|
||||
/*
|
||||
* Reset MyLatch to the process local one. This is so that signal
|
||||
* handlers et al can continue using the latch after the shared latch
|
||||
* isn't ours anymore.
|
||||
*
|
||||
* Similarly, stop reporting wait events to MyProc->wait_event_info.
|
||||
*
|
||||
* After that clear MyProc and disown the shared latch.
|
||||
*/
|
||||
SwitchBackToLocalLatch();
|
||||
/* See comment above, close to DisownLatch() */
|
||||
pgstat_reset_wait_event_storage();
|
||||
|
||||
MyProc = NULL;
|
||||
MyProcNumber = INVALID_PROC_NUMBER;
|
||||
DisownLatch(&proc->procLatch);
|
||||
|
||||
/* Mark the proc no longer in use */
|
||||
proc->pid = 0;
|
||||
|
|
|
|||
Loading…
Reference in a new issue