postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-03-14 06:32:18 -04:00

Author	SHA1	Message	Date
Tom Lane	bc6b03bb81	On Windows, ensure shared memory handle gets closed if not being used. Postmaster child processes that aren't supposed to be attached to shared memory were not bothering to close the shared memory mapping handle they inherit from the postmaster process. That's mostly harmless, since the handle vanishes anyway when the child process exits -- but the syslogger process, if used, doesn't get killed and restarted during recovery from a backend crash. That meant that Windows doesn't see the shared memory mapping as becoming free, so it doesn't delete it and the postmaster is unable to create a new one, resulting in failure to recover from crashes whenever logging_collector is turned on. Per report from Dmitry Vasilyev. It's a bit astonishing that we'd not figured this out long ago, since it's been broken from the very beginnings of out native Windows support; probably some previously-unexplained trouble reports trace to this. A secondary problem is that on Cygwin (perhaps only in older versions?), exec() may not detach from the shared memory segment after all, in which case these child processes did remain attached to shared memory, posing the risk of an unexpected shared memory clobber if they went off the rails somehow. That may be a long-gone bug, but we can deal with it now if it's still live, by detaching within the infrastructure introduced here to deal with closing the handle. Back-patch to all supported branches. Tom Lane and Amit Kapila	2015-10-13 11:21:33 -04:00
Tom Lane	c355df54e7	Fix s_lock.h PPC assembly code to be compatible with native AIX assembler. On recent AIX it's necessary to configure gcc to use the native assembler (because the GNU assembler hasn't been updated to handle AIX 6+). This caused PG builds to fail with assembler syntax errors, because we'd try to compile s_lock.h's gcc asm fragment for PPC, and that assembly code relied on GNU-style local labels. We can't substitute normal labels because it would fail in any file containing more than one inlined use of tas(). Fortunately, that code is stable enough, and the PPC ISA is simple enough, that it doesn't seem like too much of a maintenance burden to just hand-code the branch offsets, removing the need for any labels. Note that the AIX assembler only accepts "$" for the location counter pseudo-symbol. The usual GNU convention is "."; but it appears that all versions of gas for PPC also accept "$", so in theory this patch will not break any other PPC platforms. This has been reported by a few people, but Steve Underwood gets the credit for being the first to pursue the problem far enough to understand why it was failing. Thanks also to Noah Misch for additional testing.	2015-08-29 16:09:25 -04:00
Tom Lane	9cd3a0fc5a	Accept alternate spellings of __sparcv7 and __sparcv8. Apparently some versions of gcc prefer __sparc_v7__ and __sparc_v8__. Per report from Waldemar Brodkorb.	2015-08-10 17:34:51 -04:00
Tom Lane	81f3d3b7c3	Fix fsync-at-startup code to not treat errors as fatal. Commit `2ce439f337` introduced a rather serious regression, namely that if its scan of the data directory came across any un-fsync-able files, it would fail and thereby prevent database startup. Worse yet, symlinks to such files also caused the problem, which meant that crash restart was guaranteed to fail on certain common installations such as older Debian. After discussion, we agreed that (1) failure to start is worse than any consequence of not fsync'ing is likely to be, therefore treat all errors in this code as nonfatal; (2) we should not chase symlinks other than those that are expected to exist, namely pg_xlog/ and tablespace links under pg_tblspc/. The latter restriction avoids possibly fsync'ing a much larger part of the filesystem than intended, if the user has left random symlinks hanging about in the data directory. This commit takes care of that and also does some code beautification, mainly moving the relevant code into fd.c, which seems a much better place for it than xlog.c, and making sure that the conditional compilation for the pre_sync_fname pass has something to do with whether pg_flush_data works. I also relocated the call site in xlog.c down a few lines; it seems a bit silly to be doing this before ValidateXLOGDirectoryStructure(). The similar logic in initdb.c ought to be made to match this, but that change is noncritical and will be dealt with separately. Back-patch to all active branches, like the prior commit. Abhijit Menon-Sen and Tom Lane	2015-05-28 17:33:03 -04:00
Robert Haas	14de825dee	Recursively fsync() the data directory after a crash. Otherwise, if there's another crash, some writes from after the first crash might make it to disk while writes from before the crash fail to make it to disk. This could lead to data corruption. Back-patch to all supported versions. Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised by me.	2015-05-04 12:27:55 -04:00
Heikki Linnakangas	6ef5d894a4	Fix broken #ifdef for __sparcv8 Rob Rowan. Backpatch to all supported versions, like the patch that added the broken #ifdef.	2015-02-13 23:57:05 +02:00
Andres Freund	c7299d32f6	Backport "Expose fsync_fname as a public API". Backport commit `cc52d5b33f` back to 9.1 to allow backpatching some unlogged table fixes that use fsync_fname.	2014-11-15 01:20:29 +01:00
Heikki Linnakangas	861d3aa43d	Fix race condition between hot standby and restoring a full-page image. There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.	2014-11-13 20:01:55 +02:00
Andres Freund	855cabb6f5	Mark x86's memory barrier inline assembly as clobbering the cpu flags. x86's memory barrier assembly was marked as clobbering "memory" but not "cc" even though 'addl' sets various flags. As it turns out gcc on x86 implicitly assumes "cc" on every inline assembler statement, so it's not a bug. But as that's poorly documented and might get copied to architectures or compilers where that's not the case, it seems better to be precise. Discussion: 20140919100016.GH4277@alap3.anarazel.de To keep the code common, backpatch to 9.2 where explicit memory barriers were introduced.	2014-09-19 17:13:52 +02:00
Andres Freund	bca91236a7	Fix typo in solaris spinlock fix. `07968dbfaa` missed part of the S_UNLOCK define when building for sparcv8+.	2014-09-09 23:37:33 +02:00
Andres Freund	27ef6b6530	Fix spinlock implementation for some !solaris sparc platforms. Some Sparc CPUs can be run in various coherence models, ranging from RMO (relaxed) over PSO (partial) to TSO (total). Solaris has always run CPUs in TSO mode while in userland, but linux didn't use to and the various *BSDs still don't. Unfortunately the sparc TAS/S_UNLOCK were only correct under TSO. Fix that by adding the necessary memory barrier instructions. On sparcv8+, which should be all relevant CPUs, these are treated as NOPs if the current consistency model doesn't require the barriers. Discussion: 20140630222854.GW26930@awork2.anarazel.de Will be backpatched to all released branches once a few buildfarm cycles haven't shown up problems. As I've no access to sparc, this is blindly written.	2014-09-09 23:37:33 +02:00
Bruce Momjian	04e15c69d2	Remove tabs after spaces in C comments This was not changed in HEAD, but will be done later as part of a pgindent run. Future pgindent runs will also do this. Report by Tom Lane Backpatch through all supported branches, but not HEAD	2014-05-06 11:26:28 -04:00
Heikki Linnakangas	4f91af8ca2	Fix dangling smgr_owner pointer when a fake relcache entry is freed. A fake relcache entry can "own" a SmgrRelation object, like a regular relcache entry. But when it was free'd, the owner field in SmgrRelation was not cleared, so it was left pointing to free'd memory. Amazingly this apparently hasn't caused crashes in practice, or we would've heard about it earlier. Andres found this with Valgrind. Report and fix by Andres Freund, with minor modifications by me. Backpatch to all supported versions.	2014-03-07 13:29:24 +02:00
Tom Lane	ebde6c4014	Fix multiple bugs in index page locking during hot-standby WAL replay. In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.	2014-01-14 17:34:51 -05:00
Tom Lane	d32e8387ba	Fix stale-pointer problem in fast-path locking logic. When acquiring a lock in fast-path mode, we must reset the locallock object's lock and proclock fields to NULL. They are not necessarily that way to start with, because the locallock could be left over from a failed lock acquisition attempt earlier in the transaction. Failure to do this led to all sorts of interesting misbehaviors when LockRelease tried to clean up no-longer-related lock and proclock objects in shared memory. Per report from Dan Wood. In passing, modify LockRelease to elog not just Assert if it doesn't find lock and proclock objects for a formerly fast-path lock, matching the code in FastPathGetRelationLockEntry and LockRefindAndRelease. This isn't a bug but it will help in diagnosing any future bugs in this area. Also, modify FastPathTransferRelationLocks and FastPathGetRelationLockEntry to break out of their loops over the fastpath array once they've found the sole matching entry. This was inconsistently done in some search loops and not others. Improve assorted related comments, too. Back-patch to 9.2 where the fast-path mechanism was introduced.	2013-11-27 18:10:03 -05:00
Kevin Grittner	ef388b65a8	Eliminate xmin from hash tag for predicate locks on heap tuples. If a tuple was frozen while its predicate locks mattered, read-write dependencies could be missed, resulting in failure to detect conflicts which could lead to anomalies in committed serializable transactions. This field was added to the tag when we still thought that it was necessary to carry locks forward to a new version of an updated row. That was later proven to be unnecessary, which allowed simplification of the code, but elimination of xmin from the tag was missed at the time. Per report and analysis by Heikki Linnakangas. Backpatch to 9.1.	2013-10-07 14:26:54 -05:00
Alvaro Herrera	30b5b3b917	Update obsolete comment	2013-09-03 16:54:03 -04:00
Peter Eisentraut	95ffbce9a1	Fix cpluspluscheck in checksum code C++ is more picky about comparing signed and unsigned integers.	2013-06-30 19:34:14 -04:00
Simon Riggs	7b2c9d6829	Ensure no xid gaps during Hot Standby startup In some cases with higher numbers of subtransactions it was possible for us to incorrectly initialize subtrans leading to complaints of missing pages. Bug report by Sergey Konoplev Analysis and fix by Andres Freund	2013-06-23 11:09:24 +01:00
Jeff Davis	20723ce801	Add buffer_std flag to MarkBufferDirtyHint(). MarkBufferDirtyHint() writes WAL, and should know if it's got a standard buffer or not. Currently, the only callers where buffer_std is false are related to the FSM. In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive. Back-patch to 9.3.	2013-06-17 08:04:18 -07:00
Tom Lane	f04216341d	Refactor checksumming code to make it easier to use externally. pg_filedump and other external utility programs are likely to want to be able to check Postgres page checksums. To avoid messy duplication of code, move the checksumming functionality into an exported header file, much as we did awhile back for the CRC code. In passing, get rid of an unportable assumption that a static char[] array will be word-aligned, and do some other minor code beautification.	2013-06-13 22:35:56 -04:00
Tom Lane	5c7603c318	Add ARM64 (aarch64) support to s_lock.h. Use the same gcc atomic functions as we do on newer ARM chips. (Basically this is a copy and paste of the __arm__ code block, but omitting the SWPB option since that definitely won't work.) Back-patch to 9.2. The patch would work further back, but we'd also need to update config.guess/config.sub in older branches to make them build out-of-the-box, and there hasn't been demand for it. Mark Salter	2013-06-04 15:42:02 -04:00
Bruce Momjian	9af4159fce	pgindent run for release 9.3 This is the first run of the Perl-based pgindent script. Also update pgindent instructions.	2013-05-29 16:58:43 -04:00
Simon Riggs	443951748c	Record data_checksum_version in control file. The value is not used anywhere in code, but will allow future changes to the checksum version should that become necessary in the future.	2013-04-30 12:27:12 +01:00
Simon Riggs	43e7a66849	Introduce new page checksum algorithm and module. Isolate checksum calculation to its own module, so that bufpage knows little if anything about the details of the calculation. This implementation is a modified FNV-1a hash checksum, details of which are given in the new checksum.c header comments. Basic implementation only, so we fix the output value. Later related commits will add version numbers to pg_control, compiler optimization flags and memory barriers. Ants Aasma, reviewed by Jeff Davis and Simon Riggs	2013-04-29 09:05:27 +01:00
Simon Riggs	96ef3b8ff1	Allow I/O reliability checks using 16-bit checksums Checksums are set immediately prior to flush out of shared buffers and checked when pages are read in again. Hint bit setting will require full page write when block is dirtied, which causes various infrastructure changes. Extensive comments, docs and README. WARNING message thrown if checksum fails on non-all zeroes page; ERROR thrown but can be disabled with ignore_checksum_failure = on. Feature enabled by an initdb option, since transition from option off to option on is long and complex and has not yet been implemented. Default is not to use checksums. Checksum used is WAL CRC-32 truncated to 16-bits. Simon Riggs, Jeff Davis, Greg Smith Wide input and assistance from many community members. Thank you.	2013-03-22 13:54:07 +00:00
Simon Riggs	bb7cc2623f	Remove PageSetTLI and rename pd_tli to pd_checksum Remove use of PageSetTLI() from all page manipulation functions and adjust README to indicate change in the way we make changes to pages. Repurpose those bytes into the pd_checksum field and explain how that works in comments about page header. Refactoring ahead of actual feature patch which would make use of the checksum field, arriving later. Jeff Davis, with comments and doc changes by Simon Riggs Direction suggested by Robert Haas; many others providing review comments.	2013-03-18 13:46:42 +00:00
Tom Lane	d43837d030	Add lock_timeout configuration parameter. This GUC allows limiting the time spent waiting to acquire any one heavyweight lock. In support of this, improve the recently-added timeout infrastructure to permit efficiently enabling or disabling multiple timeouts at once. That reduces the performance hit from turning on lock_timeout, though it's still not zero. Zoltán Böszörményi, reviewed by Tom Lane, Stephen Frost, and Hari Babu	2013-03-16 23:22:57 -04:00
Heikki Linnakangas	3d009e45bd	Add support for piping COPY to/from an external program. This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding psql \copy syntax. Like with reading/writing files, the backend version is superuser-only, and in the psql version, the program is run in the client. In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you the stdin/stdout is quoted, it's now interpreted as a filename. For example, "\copy foo from 'stdin'" now reads from a file called 'stdin', not from standard input. Before this, there was no way to specify a filename called stdin, stdout, pstdin or pstdout. This creates a new function in pgport, wait_result_to_str(), which can be used to convert the exit status of a process, as returned by wait(3), to a human-readable string. Etsuro Fujita, reviewed by Amit Kapila.	2013-02-27 18:22:31 +02:00
Peter Eisentraut	0cb1fac3b1	Add noreturn attributes to some error reporting functions	2013-02-12 07:13:22 -05:00
Alvaro Herrera	0ac5ad5134	Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov	2013-01-23 12:04:59 -03:00
Alvaro Herrera	279628a0a7	Accelerate end-of-transaction dropping of relations When relations are dropped, at end of transaction we need to remove the files and clean the buffer pool of buffers containing pages of those relations. Previously we would scan the buffer pool once per relation to clean up buffers. When there are many relations to drop, the repeated scans make this process slow; so we now instead pass a list of relations to drop and scan the pool once, checking each buffer against the passed list. When the number of relations is larger than a threshold (which as of this patch is being set to 20 relations) we sort the array before starting, and bsearch the array; when it's smaller, we simply scan the array linearly each time, because that's faster. The exact optimal threshold value depends on many factors, but the difference is not likely to be significant enough to justify making it user-settable. This has been measured to be a significant win (a 15x win when dropping 100,000 relations; an extreme case, but reportedly a real one). Author: Tomas Vondra, some tweaks by me Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera	2013-01-17 16:13:17 -03:00
Heikki Linnakangas	9ee4d06f3f	Make GiST indexes on-disk compatible with 9.2 again. The patch that turned XLogRecPtr into a uint64 inadvertently changed the on-disk format of GiST indexes, because the NSN field in the GiST page opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that field back to the two-field struct that XLogRecPtr was before. This is the same we did to LSNs in the page header to avoid changing on-disk format. Bump catversion, as this invalidates any existing GiST indexes built on 9.3devel.	2013-01-17 16:46:16 +02:00
Bruce Momjian	bd61a623ac	Update copyrights for 2013 Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.	2013-01-01 17:15:01 -05:00
Kevin Grittner	b19e4250b4	Fix performance problems with autovacuum truncation in busy workloads. In situations where there are over 8MB of empty pages at the end of a table, the truncation work for trailing empty pages takes longer than deadlock_timeout, and there is frequent access to the table by processes other than autovacuum, there was a problem with the autovacuum worker process being canceled by the deadlock checking code. The truncation work done by autovacuum up that point was lost, and the attempt tried again by a later autovacuum worker. The attempts could continue indefinitely without making progress, consuming resources and blocking other processes for up to deadlock_timeout each time. This patch has the autovacuum worker checking whether it is blocking any other thread at 20ms intervals. If such a condition develops, the autovacuum worker will persist the work it has done so far, release its lock on the table, and sleep in 50ms intervals for up to 5 seconds, hoping to be able to re-acquire the lock and try again. If it is unable to get the lock in that time, it moves on and a worker will try to continue later from the point this one left off. While this patch doesn't change the rules about when and what to truncate, it does cause the truncation to occur sooner, with less blocking, and with the consumption of fewer resources when there is contention for the table's lock. The only user-visible change other than improved performance is that the table size during truncation may change incrementally instead of just once. This problem exists in all supported versions but is infrequently reported, although some reports of performance problems when autovacuum runs might be caused by this. Initial commit is just the master branch, but this should probably be backpatched once the build farm and general developer usage confirm that there are no surprising effects. Jan Wieck	2012-12-11 14:33:08 -06:00
Alvaro Herrera	da07a1e856	Background worker processes Background workers are postmaster subprocesses that run arbitrary user-specified code. They can request shared memory access as well as backend database connections; or they can just use plain libpq frontend database connections. Modules listed in shared_preload_libraries can register background workers in their _PG_init() function; this is early enough that it's not necessary to provide an extra GUC option, because the necessary extra resources can be allocated early on. Modules can install more than one bgworker, if necessary. Care is taken that these extra processes do not interfere with other postmaster tasks: only one such process is started on each ServerLoop iteration. This means a large number of them could be waiting to be started up and postmaster is still able to quickly service external connection requests. Also, shutdown sequence should not be impacted by a worker process that's reasonably well behaved (i.e. promptly responds to termination signals.) The current implementation lets worker processes specify their start time, i.e. at what point in the server startup process they are to be started: right after postmaster start (in which case they mustn't ask for shared memory access), when consistent state has been reached (useful during recovery in a HOT standby server), or when recovery has terminated (i.e. when normal backends are allowed). In case of a bgworker crash, actions to take depend on registration data: if shared memory was requested, then all other connections are taken down (as well as other bgworkers), just like it were a regular backend crashing. The bgworker itself is restarted, too, within a configurable timeframe (which can be configured to be never). More features to add to this framework can be imagined without much effort, and have been discussed, but this seems good enough as a useful unit already. An elementary sample module is supplied. Author: Álvaro Herrera This patch is loosely based on prior patches submitted by KaiGai Kohei, and unsubmitted code by Simon Riggs. Reviewed by: KaiGai Kohei, Markus Wanner, Andres Freund, Heikki Linnakangas, Simon Riggs, Amit Kapila	2012-12-06 17:47:30 -03:00
Simon Riggs	f21bb9cfb5	Refactor inCommit flag into generic delayChkpt flag. Rename PGXACT->inCommit flag into delayChkpt flag, and generalise comments to allow use in other situations, such as the forthcoming potential use in checksum patch. Replace wait loop to look for VXIDs with delayChkpt set. No user visible changes, not behaviour changes at present. Simon Riggs, reviewed and rebased by Jeff Davis	2012-12-03 13:13:53 +00:00
Tom Lane	3114cb60a1	Don't advance checkPoint.nextXid near the end of a checkpoint sequence. This reverts commit `c11130690d` in favor of actually fixing the problem: namely, that we should never have been modifying the checkpoint record's nextXid at this point to begin with. The nextXid should match the state as of the checkpoint's logical WAL position (ie the redo point), not the state as of its physical position. It's especially bogus to advance it in some wal_levels and not others. In any case there is no need for the checkpoint record to carry the same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by LogStandbySnapshot, as any replay operation will already have adopted that value as current. This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug #6291 from Daniel Farina, in that if a checkpoint were in progress at the instant of XID wraparound, the epoch bump would be lost as reported. (And, of course, these days there's at least a 50-50 chance of a checkpoint being in progress at any given instant.) Diagnosed by me and independently by Andres Freund. Back-patch to all branches supporting hot standby.	2012-12-02 15:20:41 -05:00
Simon Riggs	5c11725867	Rearrange storage of data in xl_running_xacts. Previously we stored all xids mixed together. Now we store top-level xids first, followed by all subxids. Also skip logging any subxids if the snapshot is suboverflowed, since there are potentially large numbers of them and they are not useful in that case anyway. Has value in the envisaged design for decoding of WAL. No planned effect on Hot Standby. Andres Freund, reviewed by me	2012-12-02 19:39:37 +00:00
Simon Riggs	f1e57a4ec9	Cleanup VirtualXact at end of Hot Standby.	2012-11-29 21:59:11 +00:00
Heikki Linnakangas	1f67078ea3	Add OpenTransientFile, with automatic cleanup at end-of-xact. Files opened with BasicOpenFile or PathNameOpenFile are not automatically cleaned up on error. That puts unnecessary burden on callers that only want to keep the file open for a short time. There is AllocateFile, but that returns a buffered FILE * stream, which in many cases is not the nicest API to work with. So add function called OpenTransientFile, which returns a unbuffered fd that's cleaned up like the FILE* returned by AllocateFile(). This plugs a few rare fd leaks in error cases: 1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile 2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't use OpenTransientFile here because the fd is supposed to persist over transaction boundaries. 3. lo_import/lo_export - fixed by using OpenTransientFile instead of PathNameOpenFile. In addition to plugging those leaks, this replaces many BasicOpenFile() calls with OpenTransientFile() that were not leaking, because the code meticulously closed the file on error. That wasn't strictly necessary, but IMHO it's good for robustness. The same leaks exist in older versions, but given the rarity of the issues, I'm not backpatching this. Not yet, anyway - it might be good to backpatch later, after this mechanism has had some more testing in master branch.	2012-11-27 10:25:50 +02:00
Tom Lane	3e7fdcffd6	Fix WaitLatch() to return promptly when the requested timeout expires. If the sleep is interrupted by a signal, we must recompute the remaining time to wait; otherwise, a steady stream of non-wait-terminating interrupts could delay return from WaitLatch indefinitely. This has been shown to be a problem for the autovacuum launcher, and there may well be other places now or in the future with similar issues. So we'd better make the function robust, even though this'll add at least one gettimeofday call per wait. Back-patch to 9.2. We might eventually need to fix 9.1 as well, but the code is quite different there, and the usage of WaitLatch in 9.1 is so limited that it's not clearly important to do so. Reported and diagnosed by Jeff Janes, though I rewrote his patch rather heavily.	2012-11-08 20:04:48 -05:00
Tom Lane	ff3f9c8de5	Close un-owned SMgrRelations at transaction end. If an SMgrRelation is not "owned" by a relcache entry, don't allow it to live past transaction end. This design allows the same SMgrRelation to be used for blind writes of multiple blocks during a transaction, but ensures that we don't hold onto such an SMgrRelation indefinitely. Because an SMgrRelation typically corresponds to open file descriptors at the fd.c level, leaving it open when there's no corresponding relcache entry can mean that we prevent the kernel from reclaiming deleted disk space. (While CacheInvalidateSmgr messages usually fix that, there are cases where they're not issued, such as DROP DATABASE. We might want to add some more sinval messaging for that, but I'd be inclined to keep this type of logic anyway, since allowing VFDs to accumulate indefinitely for blind-written relations doesn't seem like a good idea.) This code replaces a previous attempt towards the same goal that proved to be unreliable. Back-patch to 9.1 where the previous patch was added.	2012-10-17 12:38:21 -04:00
Tom Lane	9bacf0e373	Revert "Use "transient" files for blind writes, take 2". This reverts commit `fba105b109`. That approach had problems with the smgr-level state not tracking what we really want to happen, and with the VFD-level state not tracking the smgr-level state very well either. In consequence, it was still possible to hold kernel file descriptors open for long-gone tables (as in recent report from Tore Halset), and yet there were also cases of FDs being closed undesirably soon. A replacement implementation will follow.	2012-10-17 12:37:08 -04:00
Tom Lane	e81e8f9342	Split up process latch initialization for more-fail-soft behavior. In the previous coding, new backend processes would attempt to create their self-pipe during the OwnLatch call in InitProcess. However, pipe creation could fail if the kernel is short of resources; and the system does not recover gracefully from a FATAL error right there, since we have armed the dead-man switch for this process and not yet set up the on_shmem_exit callback that would disarm it. The postmaster then forces an unnecessary database-wide crash and restart, as reported by Sean Chittenden. There are various ways we could rearrange the code to fix this, but the simplest and sanest seems to be to split out creation of the self-pipe into a new function InitializeLatchSupport, which must be called from a place where failure is allowed. For most processes that gets called in InitProcess or InitAuxiliaryProcess, but processes that don't call either but still use latches need their own calls. Back-patch to 9.1, which has only a part of the latch logic that 9.2 and HEAD have, but nonetheless includes this bug.	2012-10-14 22:59:56 -04:00
Tom Lane	7e0cce0265	Remove unnecessary overhead in backend's large-object operations. Do read/write permissions checks at most once per large object descriptor, not once per lo_read or lo_write call as before. The repeated tests were quite useless in the read case since the snapshot-based tests were guaranteed to produce the same answer every time. In the write case, the extra tests could in principle detect revocation of write privileges after a series of writes has started --- but there's a race condition there anyway, since we'd check privileges before performing and certainly before committing the write. So there's no real advantage to checking every single time, and we might as well redefine it as "only check the first time". On the same reasoning, remove the LargeObjectExists checks in inv_write and inv_truncate. We already checked existence when the descriptor was opened, and checking again doesn't provide any real increment of safety that would justify the cost.	2012-10-09 16:38:00 -04:00
Tom Lane	95d035e66d	Autoconfiscate selection of 64-bit int type for 64-bit large object API. Get rid of the fundamentally indefensible assumption that "long long int" exists and is exactly 64 bits wide on every platform Postgres runs on. Instead let the configure script select the type to use for "pg_int64". This is a bit of a pain in the rear since we do not want to pollute client namespace with all the random symbols that pg_config.h defines; instead we have to create a separate generated header file, "pg_config_ext.h". But now that the infrastructure is there, we might have the ability to add some other stuff that's long been wanting in this area.	2012-10-07 21:52:43 -04:00
Tatsuo Ishii	7e2f8ed2b0	Fix compiling errors on Windows platform. Fix wrong usage of INT64CONST macro. Fix lo_hton64 and lo_ntoh64 not to use int32_t and uint32_t.	2012-10-07 23:30:31 +09:00
Tatsuo Ishii	461ef73f09	Add API for 64-bit large object access. Now users can access up to 4TB large objects (standard 8KB BLCKSZ case). For this purpose new libpq API lo_lseek64, lo_tell64 and lo_truncate64 are added. Also corresponding new backend functions lo_lseek64, lo_tell64 and lo_truncate64 are added. inv_api.c is changed to handle 64-bit offsets. Patch contributed by Nozomi Anzai (backend side) and Yugo Nagata (frontend side, docs, regression tests and example program). Reviewed by Kohei Kaigai. Committed by Tatsuo Ishii with minor editings.	2012-10-07 08:36:48 +09:00
Tom Lane	73b796a52c	Improve coding around the fsync request queue. In all branches back to 8.3, this patch fixes a questionable assumption in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are no uninitialized pad bytes in the request queue structs. This would only cause trouble if (a) there were such pad bytes, which could happen in 8.4 and up if the compiler makes enum ForkNumber narrower than 32 bits, but otherwise would require not-currently-planned changes in the widths of other typedefs; and (b) the kernel has not uniformly initialized the contents of shared memory to zeroes. Still, it seems a tad risky, and we can easily remove any risk by pre-zeroing the request array for ourselves. In addition to that, we need to establish a coding rule that struct RelFileNode can't contain any padding bytes, since such structs are copied into the request array verbatim. (There are other places that are assuming this anyway, it turns out.) In 9.1 and up, the risk was a bit larger because we were also effectively assuming that struct RelFileNodeBackend contained no pad bytes, and with fields of different types in there, that would be much easier to break. However, there is no good reason to ever transmit fsync or delete requests for temp files to the bgwriter/checkpointer, so we can revert the request structs to plain RelFileNode, getting rid of the padding risk and saving some marginal number of bytes and cycles in fsync queue manipulation while we are at it. The savings might be more than marginal during deletion of a temp relation, because the old code transmitted an entirely useless but nonetheless expensive-to-process ForgetRelationFsync request to the background process, and also had the background process perform the file deletion even though that can safely be done immediately. In addition, make some cleanup of nearby comments and small improvements to the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.	2012-07-17 16:56:54 -04:00

1 2 3 4 5 ...

835 commits