postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-04-15 22:10:45 -04:00

Author	SHA1	Message	Date
Kevin Grittner	4edb7bd2fd	C comment improvement & typo fix. Thomas Munro	2016-06-02 12:52:41 -05:00
Tom Lane	22b27b4c9e	Avoid useless closely-spaced writes of statistics files. The original intent in the stats collector was that we should not write out stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not make requests at all if they see the existing data is newer than that, and the stats collector is supposed to disregard requests having a cutoff_time older than its most recently written data, so that close-together requests don't result in multiple writes. But the latter part of that got broken in commit `187492b6c2`, so that if two backends concurrently decide the existing stats are too old, the collector would write the data twice. (In principle the collector's logic would still merge requests as long as the second one arrives before we've actually written data ... but since the message collection loop would write data immediately after processing a single inquiry message, that never happened in practice, and in any case the window in which it might work would be much shorter than PGSTAT_STAT_INTERVAL.) To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff time is too old, and doesn't add a request to the queue if so. This means that we do not need DBWriteRequest.request_time, because the decision is taken before making a queue entry. And that means that we don't really need the DBWriteRequest data structure at all; an OID list of database OIDs will serve and allow removal of some rather verbose and crufty code. In passing, improve the comments in this area, which have been rather neglected. Also change backend_read_statsfile so that it's not silently relying on MyDatabaseId to have some particular value in the autovacuum launcher process. It accidentally worked as desired because MyDatabaseId is zero in that process; but that does not seem like a dependency we want, especially with no documentation about it. Although this patch is mine, it turns out I'd rediscovered a known bug, for which Tomas Vondra had already submitted a patch that's functionally equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas for reviewing this version. Back-patch to 9.3 where the bug was introduced. Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz> Patch: <4625.1464202586@sss.pgh.pa.us>	2016-05-31 15:55:15 -04:00
Andres Freund	87a3023c60	Move memory barrier in UnlockBufHdr to before releasing the lock. This bug appears to have been introduced late in the development of `48354581a4` ("Allow Pin/UnpinBuffer to operate in a lockfree manner."). Found while debugging a bug which turned out to be independent of the commit mentioned above. Backpatch: -	2016-05-30 15:35:53 -07:00
Alvaro Herrera	975ad4e602	Fix PageAddItem BRIN bug BRIN was relying on the ability to remove a tuple from an index page, then putting another tuple in the same line pointer. But PageAddItem refuses to add a tuple beyond the first free item past the last used item, and in particular, it rejects an attempt to add an item to an empty page anywhere other than the first line pointer. PageAddItem issues a WARNING and indicates to the caller that it failed, which in turn causes the BRIN calling code to issue a PANIC, so the whole sequence looks like this: WARNING: specified item offset is too large PANIC: failed to add BRIN tuple To fix, create a new function PageAddItemExtended which is like PageAddItem except that the two boolean arguments become a flags bitmap; the "overwrite" and "is_heap" boolean flags in PageAddItem become PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN. PageAddItem() retains its original signature, for compatibility with third-party modules (other callers in core code are not modified, either). Also, in the belt-and-suspenders spirit, I added a new sanity check in brinGetTupleForHeapBlock to raise an error if an TID found in the revmap is not marked as live by the page header. This causes it to react with "ERROR: corrupted BRIN index" to the bug at hand, rather than a hard crash. Backpatch to 9.5. Bug reported by Andreas Seltenreich as detected by his handy sqlsmith fuzzer. Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu	2016-05-30 14:47:22 -04:00
Tom Lane	9dd4178cec	Be more predictable about reporting "lock timeout" vs "statement timeout". If both timeout indicators are set when we arrive at ProcessInterrupts, we've historically just reported "lock timeout". However, some buildfarm members have been observed to fail isolationtester's timeouts test by reporting "lock timeout" when the statement timeout was expected to fire first. The cause seems to be that the process is allowed to sleep longer than expected (probably due to heavy machine load) so that the lock timeout happens before we reach the point of reporting the error, and then this arbitrary tiebreak rule does the wrong thing. We can improve matters by comparing the scheduled timeout times to decide which error to report. I had originally proposed greatly reducing the 1-second window between the two timeouts in the test cases. On reflection that is a bad idea, at least for the case where the lock timeout is expected to fire first, because that would assume that it takes negligible time to get from statement start to the beginning of the lock wait. Thus, this patch doesn't completely remove the risk of test failures on slow machines. Empirically, however, the case this handles is the one we are seeing in the buildfarm. The explanation may be that the other case requires the scheduler to take the CPU away from a busy process, whereas the case fixed here only requires the scheduler to not give the CPU back right away to a process that has been woken from a multi-second sleep (and, perhaps, has been swapped out meanwhile). Back-patch to 9.3 where the isolationtester timeouts test was added. Discussion: <8693.1464314819@sss.pgh.pa.us>	2016-05-27 10:40:20 -04:00
Tom Lane	f5e7b2f910	Mark wal_level as PGDLLIMPORT. Per buildfarm, this is needed to allow extensions to use XLogIsNeeded() in Windows builds.	2016-05-24 22:48:47 -04:00
Tom Lane	465e09da63	Add support for more extensive testing of raw_expression_tree_walker(). If RAW_EXPRESSION_COVERAGE_TEST is defined, do a no-op tree walk over every basic DML statement submitted to parse analysis. If we'd had this in place earlier, bug #14153 would have been caught by buildfarm testing. The difficulty is that raw_expression_tree_walker() is only used in limited cases involving CTEs (particularly recursive ones), so it's very easy for an oversight in it to not be noticed during testing of a seemingly-unrelated feature. The type of error we can expect to catch with this is complete omission of a node type from raw_expression_tree_walker(), and perhaps also recursion into a field that doesn't contain a node tree, though that would be an unlikely mistake. It won't catch failure to add new fields that need to be recursed into, unfortunately. I'll go enable this on one or two of my own buildfarm animals once bug #14153 is dealt with. Discussion: <27861.1464040417@sss.pgh.pa.us>	2016-05-23 19:08:26 -04:00
Tom Lane	8a4930e3fa	Fix latent crash in do_text_output_multiline(). do_text_output_multiline() would fail (typically with a null pointer dereference crash) if its input string did not end with a newline. Such cases do not arise in our current sources; but it certainly could happen in future, or in extension code's usage of the function, so we should fix it. To fix, replace "eol += len" with "eol = text + len". While at it, make two cosmetic improvements: mark the input string const, and rename the argument from "text" to "txt" to dodge pgindent strangeness (since "text" is a typedef name). Even though this problem is only latent at present, it seems like a good idea to back-patch the fix, since it's a very simple/safe patch and it's not out of the realm of possibility that we might in future back-patch something that expects sane behavior from do_text_output_multiline(). Per report from Hao Lee. Report: <CAGoxFiFPAGyPAJLcFxTB5cGhTW2yOVBDYeqDugYwV4dEd1L_Ag@mail.gmail.com>	2016-05-23 14:16:40 -04:00
Tom Lane	16ea51a263	Pin the built-in index access methods. This was overlooked in commit `473b93287`, which introduced DROP ACCESS METHOD. Although that command is restricted to superusers, we don't want even superusers dropping the built-in methods; "DROP ACCESS METHOD btree" in particular is unrecoverable from. Pin these objects in the same way that other initdb-created objects are pinned. I chose to bump catversion for this fix. That's not absolutely necessary perhaps, but it will ensure that no 9.6 production systems are missing the pin entries.	2016-05-19 14:40:02 -04:00
Tom Lane	8ee29a19d6	Stamp 9.6beta1.	2016-05-09 16:47:49 -04:00
Tom Lane	1a2c17f8e2	Fix pg_upgrade to not fail when new-cluster TOAST rules differ from old. This patch essentially reverts commit `4c6780fd17`, in favor of a much simpler solution for the case where the new cluster would choose to create a TOAST table but the old cluster doesn't have one: just don't create a TOAST table. The existing code failed in at least two different ways if the situation arose: (1) ALTER TABLE RESET didn't grab an exclusive lock, so that the lock sanity check in create_toast_table failed; (2) pg_upgrade did not provide a pg_type OID for the new toast table, so that the crosscheck in TypeCreate failed. While both these problems were introduced by later patches, they show that the hack being used to cause TOAST table creation is overwhelmingly fragile (and untested). I also note that before the TypeCreate crosscheck was added, the code would have resulted in assigning an indeterminate pg_type OID to the toast table, possibly causing a later OID conflict in that catalog; so that it didn't really work even when committed. If we simply don't create a TOAST table, there will only be a problem if the code tries to store a tuple that's wider than a page, and field compression isn't sufficient to get it under a page. Given that the TOAST creation threshold is intended to be about a quarter of a page, it's very hard to believe that cross-version differences in the do-we-need-a-toast- table heuristic could result in an observable problem. So let's just follow the old version's conclusion about whether a TOAST table is needed. (If we ever do change needs_toast_table() so much that this conclusion doesn't apply, we can devise a solution at that time, and hopefully do it in a less klugy way than `4c6780fd17` did.) Back-patch to 9.3, like the previous patch. Discussion: <8110.1462291671@sss.pgh.pa.us>	2016-05-06 22:05:56 -04:00
Kevin Grittner	2cc41acd8f	Fix hash index vs "snapshot too old" problemms Hash indexes are not WAL-logged, and so do not maintain the LSN of index pages. Since the "snapshot too old" feature counts on detecting error conditions using the LSN of a table and all indexes on it, this makes it impossible to safely do early vacuuming on any table with a hash index, so add this to the tests for whether the xid used to vacuum a table can be adjusted based on old_snapshot_threshold. While at it, add a paragraph to the docs for old_snapshot_threshold which specifically mentions this and other aspects of the feature which may otherwise surprise users. Problem reported and patch reviewed by Amit Kapila	2016-05-06 07:47:12 -05:00
Dean Rasheed	93a8c6fd6c	Move and rename fmtReloptionsArray(). Move fmtReloptionsArray() from pg_dump.c to string_utils.c so that it is available to other frontend code. In particular psql's \ev and \sv commands need it to handle view reloptions. Also rename the function to appendReloptionsArray(), which is a more accurate description of what it does. Author: Dean Rasheed Reviewed-by: Peter Eisentraut Discussion: http://www.postgresql.org/message-id/CAEZATCWZjCgKRyM-agE0p8ax15j9uyQoF=qew7D2xB6cF76T8A@mail.gmail.com	2016-05-06 12:45:36 +01:00
Tom Lane	0b9a234432	Rename tsvector delete() to ts_delete(), and filter() to ts_filter(). The similarity of the original names to SQL keywords seems like a bad idea. Rename them before we're stuck with 'em forever. In passing, minor code and docs cleanup. Discussion: <4875.1462210058@sss.pgh.pa.us>	2016-05-05 19:43:32 -04:00
Alvaro Herrera	c1543a81a7	Revert timeline following in replication slots This reverts commits `f07d18b6e9`, `82c83b3372`, `3a3b309041`, and `24c5f1a103`. This feature has shown enough immaturity that it was deemed better to rip it out before rushing some more fixes at the last minute. There are discussions on larger changes in this area for the next release.	2016-05-04 17:32:22 -03:00
Robert Haas	9888b34fdb	Fix more things to be parallel-safe. Conversion functions were previously marked as parallel-unsafe, since that is the default, but in fact they are safe. Parallel-safe functions defined in pg_proc.h and redefined in system_views.sql were ending up as parallel-unsafe because the redeclarations were not marked PARALLEL SAFE. While editing system_views.sql, mark ts_debug() parallel safe also. Andreas Karlsson	2016-05-03 14:36:38 -04:00
Alvaro Herrera	6b60916829	Fix thinko in comment Pointed out by Andres Freund	2016-05-02 16:46:42 -03:00
Alvaro Herrera	234a266066	Fix code comments regarding logical decoding Back in `3b02ea4f07` I added some comments in various places to explain how logical decoding and other things worked. Not all of the changes were welcome, because they were misleading or wrong. This changes them a little bit to make them more accurate. Some other comments are also changed to be more accurate. Also, fix a bunch of typos. Author: Álvaro Herrera, Craig Ringer Andres Freund reviewed some parts of this.	2016-05-02 16:04:29 -03:00
Robert Haas	37d0c2cb1a	Fix parallel safety markings for pg_start_backup. Commit `7117685461` made pg_start_backup parallel-restricted rather than parallel-safe, because it now relies on backend-private state that won't be synchronized with the parallel worker. However, it didn't update pg_proc.h. Separately, Andreas Karlsson observed that system_views.sql neglected to reiterate the parallel-safety markings whe redefining various functions, including this one; so add a PARALLEL RESTRICTED declaration there to match the new value in pg_proc.h.	2016-05-02 10:42:34 -04:00
Tom Lane	207d5a656e	Fix mishandling of equivalence-class tests in parameterized plans. Given a three-or-more-way equivalence class, such as X.Y = Y.Y = Z.Z, it was possible for the planner to omit one of the quals needed to enforce that all members of the equivalence class are actually equal. This only happened in the case of a parameterized join node for two of the relations, that is a plan tree like Nested Loop -> Scan X -> Nested Loop -> Scan Y -> Scan Z Filter: Z.Z = X.X The eclass machinery normally expects to apply X.X = Y.Y when those two relations are joined, but in this shape of plan tree they aren't joined until the top node --- and, if the lower nested loop is marked as parameterized by X, the top node will assume that the relevant eclass condition(s) got pushed down into the lower node. On the other hand, the scan of Z assumes that it's only responsible for constraining Z.Z to match any one of the other eclass members. So one or another of the required quals sometimes fell between the cracks, depending on whether consideration of the eclass in get_joinrel_parampathinfo() for the lower nested loop chanced to generate X.X = Y.Y or X.X = Z.Z as the appropriate constraint there. If it generated the latter, it'd erroneously suppose that the Z scan would take care of matters. To fix, force X.X = Y.Y to be generated and applied at that join node when this case occurs. This is extremely hard to hit in practice, because various planner behaviors conspire to mask the problem; starting with the fact that the planner doesn't really like to generate a parameterized plan of the above shape. (It might have been impossible to hit it before we tweaked things to allow this plan shape for star-schema cases.) Many thanks to Alexander Kirkouski for submitting a reproducible test case. The bug can be demonstrated in all branches back to 9.2 where parameterized paths were introduced, so back-patch that far.	2016-04-29 20:19:38 -04:00
Kevin Grittner	7c3e8039f4	Add a few entries to the tail of time mapping, to see old values. Without a few entries beyond old_snapshot_threshold, the lookup would often fail, resulting in the more aggressive pruning or vacuum being skipped often enough to matter. This was very clearly shown by a python test script posted by Ants Aasma, and was likely a factor in an earlier but somewhat less clear-cut test case posted by Jeff Janes. This patch makes no change to the logic, per se -- it just makes the array of mapping entries big enough to make lookup misses based on timing much less likely. An occasional miss is still possible if a thread stalls for more than 10 minutes, but that does not create any problem with correctness of behavior. Besides, if things are so busy that a thread is stalling for more than 10 minutes, it is probably OK to skip the more aggressive cleanup at that particular point in time.	2016-04-29 16:46:08 -05:00
Andrew Dunstan	d34e7b2812	Fix comment whitespace in VS2105 patch per gripe from Michael Paquier.	2016-04-29 14:18:51 -04:00
Magnus Hagander	a03bda323b	Fix typo Author: Thomas Munro	2016-04-29 16:15:07 +02:00
Andrew Dunstan	0fb54de9aa	Support building with Visual Studio 2015 Adjust the way we detect the locale. As a result the minumum Windows version supported by VS2015 and later is Windows Vista. Add some tweaks to remove new compiler warnings. Remove documentation references to the now obsolete msysGit. Michael Paquier, somewhat edited by me, reviewed by Christian Ullrich. Backpatch to 9.5	2016-04-29 08:09:07 -04:00
Tom Lane	23b09e15b9	Adjust DatumGetBool macro, this time for sure. Commit `23a41573c` attempted to fix the DatumGetBool macro to ignore bits in a Datum that are to the left of the actual bool value. But it did that by casting the Datum to bool; and on compilers that use C99 semantics for bool, that ends up being a whole-word test, not a 1-byte test. This seems to be the true explanation for contrib/seg failing in VS2015. To fix, use GET_1_BYTE() explicitly. I think in the previous patch, I'd had some idea of not having to commit to bool being exactly 1 byte wide, but regardless of what the compiler's bool is, boolean columns and Datums are certainly 1 byte wide. The previous fix was (eventually) back-patched into all active versions, so do likewise with this one.	2016-04-28 11:50:58 -04:00
Teodor Sigaev	f8467f7da8	Prevent to use magic constants Use macroses for definition amstrategies/amsupport fields instead of hardcoded values. Author: Nikolay Shaplov with addition for contrib/bloom	2016-04-28 16:39:25 +03:00
Teodor Sigaev	e2c79e14d9	Prevent multiple cleanup process for pending list in GIN. Previously, ginInsertCleanup could exit early if it detects that someone else is cleaning up the pending list, without waiting for that someone else to finish the job. But in this case vacuum could miss tuples to be deleted. Cleanup process now locks metapage with a help of heavyweight LockPage(ExclusiveLock), and it guarantees that there is no another cleanup process at the same time. Lock is taken differently depending on caller of cleanup process: any vacuums and gin_clean_pending_list() will be blocked until lock becomes available, ordinary insert uses conditional lock to prevent indefinite waiting on lock. Insert into pending list doesn't use this lock, so insertion isn't blocked. Also, patch adds stopping of cleanup process when at-start-cleanup-tail is reached in order to prevent infinite cleanup in case of massive insertion. But it will stop only for automatic maintenance tasks like autovacuum. Patch introduces choice of limit of memory to use: autovacuum_work_mem, maintenance_work_mem or work_mem depending on call path. Patch for previous releases should be reworked due to changes between 9.6 and previous ones in this area. Discover and diagnostics by Jeff Janes and Tomas Vondra Patch by me with some ideas of Jeff Janes	2016-04-28 16:21:42 +03:00
Tom Lane	4c804fbdfb	Clean up parsing of synchronous_standby_names GUC variable. Commit `989be0810d` added a flex/bison lexer/parser to interpret synchronous_standby_names. It was done in a pretty crufty way, though, making assorted end-use sites responsible for calling the parser at the right times. That was not only vulnerable to errors of omission, but made it possible that lexer/parser errors occur at very undesirable times, and created memory leakages even if there was no error. Instead, perform the parsing once during check_synchronous_standby_names and let guc.c manage the resulting data. To do that, we have to flatten the parsed representation into a single hunk of malloc'd memory, but that is not very hard. While at it, work a little harder on making useful error reports for parsing problems; the previous code felt that "synchronous_standby_names parser returned 1" was an appropriate user-facing error message. (To be fair, it did also log a syntax error message, but separately from the GUC problem report, which is at best confusing.) It had some outright bugs in the face of invalid input, too. I (tgl) also concluded that we need to restrict unquoted names in synchronous_standby_names to be just SQL identifiers. The previous coding would accept darn near anything, which (1) makes the quoting convention both nearly-unnecessary and formally ambiguous, (2) makes it very hard to understand what is a syntax error and what is a creative interpretation of the input as a standby name, and (3) makes it impossible to further extend the syntax in future without a compatibility break. I presume that we're intending future extensions of the syntax, else this parsing infrastructure is massive overkill, so (3) is an important objection. Since we've taken a compatibility hit for non-identifier names with this change anyway, we might as well lock things down now and insist that users use double quotes for standby names that aren't identifiers. Kyotaro Horiguchi and Tom Lane	2016-04-27 17:55:25 -04:00
Robert Haas	360ca27a9b	Remove mergeHyperLogLog. It's buggy. If somebody needs this later, they'll need to put back a non-buggy vesion of it. Discussion: CAM3SWZT-i6R9JU5YXa8MJUou2_r3LfGJZpQ9tYa1BYxfkj0=cQ@mail.gmail.com Discussion: CAM3SWZRUOLsYoTT83QgdUy9D8ehYWm_nvbrrfcOOzikiRfFY7g@mail.gmail.com Peter Geoghegan	2016-04-27 10:55:32 -04:00
Robert Haas	59eb551279	Fix EXPLAIN VERBOSE output for parallel aggregate. The way that PartialAggregate and FinalizeAggregate plan nodes were displaying output columns before was bogus. Now, FinalizeAggregate produces the same outputs as an Aggregate would have produced, while PartialAggregate produces each of those outputs prefixed by the word PARTIAL. Discussion: 12585.1460737650@sss.pgh.pa.us Patch by me, reviewed by David Rowley.	2016-04-27 07:37:40 -04:00
Andres Freund	c6ff84b06a	Emit invalidations to standby for transactions without xid. So far, when a transaction with pending invalidations, but without an assigned xid, committed, we simply ignored those invalidation messages. That's problematic, because those are actually sent for a reason. Known symptoms of this include that existing sessions on a hot-standby replica sometimes fail to notice new concurrently built indexes and visibility map updates. The solution is to WAL log such invalidations in transactions without an xid. We considered to alternatively force-assign an xid, but that'd be problematic for vacuum, which might be run in systems with few xids. Important: This adds a new WAL record, but as the patch has to be back-patched, we can't bump the WAL page magic. This means that standbys have to be updated before primaries; otherwise "PANIC: standby_redo: unknown op code 32" errors can be encountered. XXX: Reported-By: Васильев Дмитрий, Masahiko Sawada Discussion: CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com	2016-04-26 20:21:54 -07:00
Noah Misch	213c7df033	Impose a full barrier in generic-xlc.h atomics functions. pg_atomic_compare_exchange_*_impl() were providing only the semantics of an acquire barrier. Buildfarm members hornet and mandrill revealed this deficit beginning with commit `008608b9d5`. While we have no report of symptoms in 9.5, we can't rule out the possibility of certain compilers, hardware, or extension code relying on these functions' specified barrier semantics. Back-patch to 9.5, where commit `b64d92f1a5` introduced atomics. Reviewed by Andres Freund.	2016-04-26 21:53:58 -04:00
Tom Lane	125ad539a2	Improve TranslateSocketError() to handle more Windows error codes. The coverage was rather lean for cases that bind() or listen() might return. Add entries for everything that there's a direct equivalent for in the set of Unix errnos that elog.c has heard of.	2016-04-21 16:58:47 -04:00
Tom Lane	e54528155a	Remove dead code in win32.h. There's no longer a need for the MSVC-version-specific code stanza that forcibly redefines errno code symbols, because since commit `73838b52` we're unconditionally redefining them in the stanza before this one anyway. Now it's merely confusing and ugly, so get rid of it; and improve the comment that explains what's going on here. Although this is just cosmetic, back-patch anyway since I'm intending to back-patch some less-cosmetic changes in this same hunk of code.	2016-04-21 16:16:19 -04:00
Tom Lane	14216649f3	PGDLLIMPORT-ify old_snapshot_threshold. Revert commit `7cb1db1d95`, which represented a misunderstanding of the problem (if snapmgr.h weren't already included in bufmgr.h, things wouldn't compile anywhere). Instead install what I think is the real fix.	2016-04-21 14:33:34 -04:00
Robert Haas	36f69faeff	Comment improvements for ForeignPath. It's not necessarily just scanning a base relation any more. Amit Langote and Etsuro Fujita	2016-04-21 13:30:48 -04:00
Kevin Grittner	11e178d0dc	Inline initial comparisons in TestForOldSnapshot() Even with old_snapshot_threshold = -1 (which disables the "snapshot too old" feature), performance regressions were seen at moderate to high concurrency. For example, a one-socket, four-core system running 200 connections at saturation could see up to a 2.3% regression, with larger regressions possible on NUMA machines. By inlining the early (smaller, faster) tests in the TestForOldSnapshot() function, the i7 case dropped to a 0.2% regression, which could easily just be noise, and is clearly an improvement. Further testing will show whether more is needed.	2016-04-21 08:40:08 -05:00
Tom Lane	bde361fef5	Fix memory leak and other bugs in ginPlaceToPage() & subroutines. Commit `36a35c550a` turned the interface between ginPlaceToPage and its subroutines in gindatapage.c and ginentrypage.c into a royal mess: page-update critical sections were started in one place and finished in another place not even in the same file, and the very same subroutine might return having started a critical section or not. Subsequent patches band-aided over some of the problems with this design by making things even messier. One user-visible resulting problem is memory leaks caused by the need for the subroutines to allocate storage that would survive until ginPlaceToPage calls XLogInsert (as reported by Julien Rouhaud). This would not typically be noticeable during retail index updates. It could be visible in a GIN index build, in the form of memory consumption swelling to several times the commanded maintenance_work_mem. Another rather nasty problem is that in the internal-page-splitting code path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before entering the critical section that it's supposed to be cleared in; a failure in between would leave the index in a corrupt state. There were also assorted coding-rule violations with little immediate consequence but possible long-term hazards, such as beginning an XLogInsert sequence before entering a critical section, or calling elog(DEBUG) inside a critical section. To fix, redefine the API between ginPlaceToPage() and its subroutines by splitting the subroutines into two parts. The "beginPlaceToPage" subroutine does what can be done outside a critical section, including full computation of the result pages into temporary storage when we're going to split the target page. The "execPlaceToPage" subroutine is called within a critical section established by ginPlaceToPage(), and it handles the actual page update in the non-split code path. The critical section, as well as the XLOG insertion call sequence, are both now always started and finished in ginPlaceToPage(). Also, make ginPlaceToPage() create and work in a short-lived memory context to eliminate the leakage problem. (Since a short-lived memory context had been getting created in the most common code path in the subroutines, this shouldn't cause any noticeable performance penalty; we're just moving the overhead up one call level.) In passing, fix a bunch of comments that had gone unmaintained throughout all this klugery. Report: <571276DD.5050303@dalibo.com>	2016-04-20 14:25:15 -04:00
Kevin Grittner	a343e223a5	Revert no-op changes to BufferGetPage() The reverted changes were intended to force a choice of whether any newly-added BufferGetPage() calls needed to be accompanied by a test of the snapshot age, to support the "snapshot too old" feature. Such an accompanying test is needed in about 7% of the cases, where the page is being used as part of a scan rather than positioning for other purposes (such as DML or vacuuming). The additional effort required for back-patching, and the doubt whether the intended benefit would really be there, have indicated it is best just to rely on developers to do the right thing based on comments and existing usage, as we do with many other conventions. This change should have little or no effect on generated executable code. Motivated by the back-patching pain of Tom Lane and Robert Haas	2016-04-20 08:31:19 -05:00
Tom Lane	75c24d0f74	Further reduce the number of semaphores used under --disable-spinlocks. Per discussion, there doesn't seem to be much value in having NUM_SPINLOCK_SEMAPHORES set to 1024: under any scenario where you are running more than a few backends concurrently, you really had better have a real spinlock implementation if you want tolerable performance. And 1024 semaphores is a sizable fraction of the system-wide SysV semaphore limit on many platforms. Therefore, reduce this setting's default value to 128 to make it less likely to cause out-of-semaphores problems.	2016-04-18 13:33:06 -04:00
Robert Haas	5702277ca9	Tweak EXPLAIN for parallel query to show workers launched. The previous display was sort of confusing, because it didn't distinguish between the number of workers that we planned to launch and the number that actually got launched. This has already confused several people, so display both numbers and label them clearly. Julien Rouhaud, reviewed by me.	2016-04-15 11:52:18 -04:00
Tom Lane	6b85d4ba9b	Fix portability problem induced by commit `a6f6b7819`. pg_xlogdump includes bufmgr.h. With a compiler that emits code for static inline functions even when they're unreferenced, that leads to unresolved external references in the new static-inline version of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done for similar issues elsewhere. Per buildfarm member pademelon.	2016-04-15 10:44:28 -04:00
Andres Freund	4b74c6a40e	Make init_spin_delay() C89 compliant #2 . My previous attempt at doing so, in `80abbeba23`, was not sufficient. While that fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant expressions in the struct initializer, because the file/line/function information comes from the caller of s_lock(). Give up on using a macro, and use a static inline instead. Discussion: 4369.1460435533@sss.pgh.pa.us	2016-04-14 19:26:13 -07:00
Andres Freund	533cd2303a	Remove trailing commas in enums. These aren't valid C89. Found thanks to gcc's -Wc90-c99-compat. These exist in differing places in most supported branches.	2016-04-14 19:25:16 -07:00
Tom Lane	c2dc194bdb	Adjust signature of walrcv_receive hook. Commit `314cbfc5da` redefined the signature of this hook as typedef int (walrcv_receive_type) (char buffer, int wait_fd); But in fact the type of the "wait_fd" variable ought to be pgsocket, which is what WaitLatchOrSocket expects, and which is necessary if we want to be able to assign PGINVALID_SOCKET to it on Windows. So fix that.	2016-04-14 13:49:37 -04:00
Tom Lane	22989a8e34	Fix prototype of pgwin32_bind(). I (tgl) had copied-and-pasted this from pgwin32_accept(), failing to notice that the third parameter should be "int" not "int *". David Rowley	2016-04-14 09:44:21 -04:00
Andres Freund	be65eddd80	Add required database and origin filtering for logical messages. Logical messages, added in `3fe3511d05`, during decoding failed to filter messages emitted in other databases and messages emitted "under" a replication origin the output plugin isn't interested in. Add tests to verify that both types of filtering actually work. While touching message.sql remove hunk obsoleted by `d25379e`. Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because `3fe3511d05` had omitted doing so. `3fe3511d05` additionally didn't bump catversion, but `7a542700d` has done so since. Author: Petr Jelinek Reported-By: Andres Freund Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de	2016-04-13 17:38:54 -07:00
Andres Freund	80abbeba23	Make init_spin_delay() C89 compliant and change stuck spinlock reporting. The current definition of init_spin_delay (introduced recently in `48354581a`) wasn't C89 compliant. It's not legal to refer to refer to non-constant expressions, and the ptr argument was one. This, as reported by Tom, lead to a failure on buildfarm animal pademelon. The pointer, especially on system systems with ASLR, isn't super helpful anyway, though. So instead of making init_spin_delay into an inline function, make s_lock_stuck() report the function name in addition to file:line and change init_spin_delay() accordingly. While not a direct replacement, the function name is likely more useful anyway (line numbers are often hard to interpret in third party reports). This also fixes what file/line number is reported for waits via s_lock(). As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h. Reported-By: Tom Lane Discussion: 4369.1460435533@sss.pgh.pa.us	2016-04-13 17:00:53 -07:00
Andres Freund	6b93fcd149	Avoid atomic operation in MarkLocalBufferDirty(). The recent patch to make Pin/UnpinBuffer lockfree in the hot path (`48354581a`), accidentally used pg_atomic_fetch_or_u32() in MarkLocalBufferDirty(). Other code operating on local buffers was careful to only use pg_atomic_read/write_u32 which just read/write from memory; to avoid unnecessary overhead. On its own that'd just make MarkLocalBufferDirty() slightly less efficient, but in addition InitLocalBuffers() doesn't call pg_atomic_init_u32() - thus the spinlock fallback for the atomic operations isn't initialized. That in turn caused, as reported by Tom, buildfarm animal gaur to fail. As those errors are actually useful against this type of error, continue to omit - intentionally this time - initialization of the atomic variable. In addition, add an explicit note about only using pg_atomic_read/write on local buffers's state to BufferDesc's description. Reported-By: Tom Lane Discussion: 1881.1460431476@sss.pgh.pa.us	2016-04-13 15:28:29 -07:00
Tom Lane	95ef43c430	Widen amount-to-flush arguments of FileWriteback and callers. It's silly to define these counts as narrower than they might someday need to be. Also, I believe that the BLCKSZ * nflush calculation in mdwriteback was capable of overflowing an int.	2016-04-13 18:12:06 -04:00
Tom Lane	d1b7d4877b	Provide errno-translation wrappers around bind() and listen() on Windows. I've seen one too many "could not bind IPv4 socket: No error" log entries from the Windows buildfarm members. Per previous discussion, this is likely caused by the fact that we're doing nothing to translate WSAGetLastError() to errno. Put in a wrapper layer to do that. If this works as expected, it should get back-patched, but let's see what happens in the buildfarm first. Discussion: <4065.1452450340@sss.pgh.pa.us>	2016-04-12 19:52:21 -04:00
Robert Haas	deb71fa971	Fix costing for parallel aggregation. The original patch kind of ignored the fact that we were doing something different from a costing point of view, but nobody noticed. This patch fixes that oversight. David Rowley	2016-04-12 16:25:55 -04:00
Tom Lane	f1f01de145	Redefine create_upper_paths_hook as being invoked once per upper relation. Per discussion, this gives potential users of the hook more flexibility, because they can build custom Paths that implement only one stage of upper processing atop core-provided Paths for earlier stages.	2016-04-12 15:23:14 -04:00
Tom Lane	5713f03973	Improve API of GenericXLogRegister(). Rename this function to GenericXLogRegisterBuffer() to make it clearer what it does, and leave room for other sorts of "register" actions in future. Also, replace its "bool isNew" argument with an integer flags argument, so as to allow adding more flags in future without an API break. Alexander Korotkov, adjusted slightly by me	2016-04-12 11:42:06 -04:00
Kevin Grittner	a6f6b78196	Use static inline function for BufferGetPage() I was initially concerned that the some of the hundreds of references to BufferGetPage() where the literal BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a macro, leading to some hard-to-find performance regressions in corner cases. Inspection of disassembled code has shown identical code at all inspected locations, and the size difference doesn't amount to even one byte per such call. So make it readable. Per gripes from Álvaro Herrera and Tom Lane	2016-04-11 16:47:50 -05:00
Andres Freund	008608b9d5	Avoid the use of a separate spinlock to protect a LWLock's wait queue. Previously we used a spinlock, in adition to the atomically manipulated ->state field, to protect the wait queue. But it's pretty simple to instead perform the locking using a flag in state. Due to `6150a1b0` BufferDescs, on platforms (like PPC) with > 1 byte spinlocks, increased their size above 64byte. As 64 bytes are the size we pad allocated BufferDescs to, this can increase false sharing; causing performance problems in turn. Together with the previous commit this reduces the size to <= 64 bytes on all common platforms. Author: Andres Freund Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com 20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de	2016-04-10 20:12:32 -07:00
Andres Freund	48354581a4	Allow Pin/UnpinBuffer to operate in a lockfree manner. Pinning/Unpinning a buffer is a very frequent operation; especially in read-mostly cache resident workloads. Benchmarking shows that in various scenarios the spinlock protecting a buffer header's state becomes a significant bottleneck. The problem can be reproduced with pgbench -S on larger machines, but can be considerably worse for queries which touch the same buffers over and over at a high frequency (e.g. nested loops over a small inner table). To allow atomic operations to be used, cram BufferDesc's flags, usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable; that allows to manipulate them together using 32bit compare-and-swap operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could be lifted by using a 64bit field, but it's not a realistic configuration atm). As not all operations can easily implemented in a lockfree manner, implement the previous buf_hdr_lock via a flag bit in the atomic variable. That way we can continue to lock the header in places where it's needed, but can get away without acquiring it in the more frequent hot-paths. There's some additional operations which can be done without the lock, but aren't in this patch; but the most important places are covered. As bufmgr.c now essentially re-implements spinlocks, abstract the delay logic from s_lock.c into something more generic. It now has already two users, and more are coming up; there's a follupw patch for lwlock.c at least. This patch is based on a proof-of-concept written by me, which Alexander Korotkov made into a fully working patch; the committed version is again revised by me. Benchmarking and testing has, amongst others, been provided by Dilip Kumar, Alexander Korotkov, Robert Haas. On a large x86 system improvements for readonly pgbench, with a high client count, of a factor of 8 have been observed. Author: Alexander Korotkov and Andres Freund Discussion: 2400449.GjM57CE0Yg@dinodell	2016-04-10 20:12:32 -07:00
Tom Lane	08e785436f	Get rid of GenericXLogUnregister(). This routine is unsafe as implemented, because it invalidates the page image pointers returned by previous GenericXLogRegister() calls. Rather than complicate the API or the implementation to avoid that, let's just get rid of it; the use-case for having it seems much too thin to justify a lot of work here. While at it, do some wordsmithing on the SGML docs for generic WAL.	2016-04-09 16:39:30 -04:00
Kevin Grittner	381200be4b	Fix typo in C comment.	2016-04-09 09:07:42 -05:00
Kevin Grittner	56dffb5a73	Turn special page pointer validation to static inline function Inclusion of multiple macros inside another macro was pushing MSVC past its size liimit. Reported by buildfarm.	2016-04-09 08:17:22 -05:00
Alvaro Herrera	c09b18f21c	Support \crosstabview in psql \crosstabview is a completely different way to display results from a query: instead of a vertical display of rows, the data values are placed in a grid where the column and row headers come from the data itself, similar to a spreadsheet. The sort order of the horizontal header can be specified by using another column in the query, and the vertical header determines its ordering from the order in which they appear in the query. This only allows displaying a single value in each cell. If more than one value correspond to the same cell, an error is thrown. Merging of values can be done in the query itself, if necessary. This may be revisited in the future. Author: Daniel Verité Reviewed-by: Pavel Stehule, Dean Rasheed	2016-04-08 20:23:18 -03:00
Andres Freund	c1ddd2361f	Expose more out/readfuncs support functions. Previously `bcac23d` exposed a subset of support functions, namely the ones Kaigai found useful. In 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de I mentioned that there's some functions missing to use the facility in an external project. To avoid having to add functions piecemeal, add all the functions which are used to define READ_* and WRITE_* macros; users of the extensible node functionality are likely to need these. Additionally expose outDatum(), which doesn't have it's own WRITE_ macro, as it needs information from the embedding struct. Discussion: 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de	2016-04-08 14:26:36 -07:00
Stephen Frost	7a542700df	Create default roles This creates an initial set of default roles which administrators may use to grant access to, historically, superuser-only functions. Using these roles instead of granting superuser access reduces the number of superuser roles required for a system. Documention for each of the default roles has been added to user-manag.sgml. Bump catversion to 201604082, as we had a commit that bumped it to 201604081 and another that set it back to 201604071... Reviews by José Luis Tallón and Robert Haas	2016-04-08 16:56:27 -04:00
Stephen Frost	293007898d	Reserve the "pg_" namespace for roles This will prevent users from creating roles which begin with "pg_" and will check for those roles before allowing an upgrade using pg_upgrade. This will allow for default roles to be provided at initdb time. Reviews by José Luis Tallón and Robert Haas	2016-04-08 16:56:27 -04:00
Kevin Grittner	848ef42bb8	Add the "snapshot too old" feature This feature is controlled by a new old_snapshot_threshold GUC. A value of -1 disables the feature, and that is the default. The value of 0 is just intended for testing. Above that it is the number of minutes a snapshot can reach before pruning and vacuum are allowed to remove dead tuples which the snapshot would otherwise protect. The xmin associated with a transaction ID does still protect dead tuples. A connection which is using an "old" snapshot does not get an error unless it accesses a page modified recently enough that it might not be able to produce accurate results. This is similar to the Oracle feature, and we use the same SQLSTATE and error message for compatibility.	2016-04-08 14:36:30 -05:00
Kevin Grittner	8b65cf4c5e	Modify BufferGetPage() to prepare for "snapshot too old" feature This patch is a no-op patch which is intended to reduce the chances of failures of omission once the functional part of the "snapshot too old" patch goes in. It adds parameters for snapshot, relation, and an enum to specify whether the snapshot age check needs to be done for the page at this point. This initial patch passes NULL for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the third. The follow-on patch will change the places where the test needs to be made.	2016-04-08 14:30:10 -05:00
Teodor Sigaev	8b99edefca	Revert CREATE INDEX ... INCLUDING ... It's not ready yet, revert two commits `690c543550` - unstable test output `386e3d7609` - patch itself	2016-04-08 21:52:13 +03:00
Magnus Hagander	35e2e357cb	Add authentication parameters compat_realm and upn_usename for SSPI These parameters are available for SSPI authentication only, to make it possible to make it behave more like "normal gssapi", while making it possible to maintain compatibility. compat_realm is on by default, but can be turned off to make the authentication use the full Kerberos realm instead of the NetBIOS name. upn_username is off by default, and can be turned on to return the users Kerberos UPN rather than the SAM-compatible name (a user in Active Directory can have both a legacy SAM-compatible username and a new Kerberos one. Normally they are the same, but not always) Author: Christian Ullrich Reviewed by: Robbie Harwood, Alvaro Herrera, me	2016-04-08 20:28:38 +02:00
Tom Lane	34c33a1f00	Add BSD authentication method. Create a "bsd" auth method that works the same as "password" so far as clients are concerned, but calls the BSD Authentication service to check the password. This is currently only available on OpenBSD. Marisa Emerson, reviewed by Thomas Munro	2016-04-08 13:52:06 -04:00
Robert Haas	af025eed53	Add combine functions for various floating-point aggregates. This allows parallel aggregation to use them. It may seem surprising that we use float8_combine for both float4_accum and float8_accum transition functions, but that's because those functions differ only in the type of the non-transition-state argument. Haribabu Kommi, reviewed by David Rowley and Tomas Vondra	2016-04-08 13:47:06 -04:00
Teodor Sigaev	1ec4c7c055	Restore original tsquery operation numbering. As noticed by Tom Lane changing operation's number in commit `bb140506df` causes on-disk format incompatibility. Revert to previous numbering, that is reason to add special array to store priorities of operation. Also it reverts order of tsquery to previous. Author: Dmitry Ivanov	2016-04-08 20:11:30 +03:00
Teodor Sigaev	386e3d7609	CREATE INDEX ... INCLUDING (column[, ...]) Now indexes (but only B-tree for now) can contain "extra" column(s) which doesn't participate in index structure, they are just stored in leaf tuples. It allows to use index only scan by using single index instead of two or more indexes. Author: Anastasia Lubennikova with minor editorializing by me Reviewers: David Rowley, Peter Geoghegan, Jeff Janes	2016-04-08 19:45:59 +03:00
Robert Haas	25fe8b5f1a	Add a 'parallel_degree' reloption. The code that estimates what parallel degree should be uesd for the scan of a relation is currently rather stupid, so add a parallel_degree reloption that can be used to override the planner's rather limited judgement. Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila, and me. Some further hacking by me.	2016-04-08 11:14:56 -04:00
Peter Eisentraut	2f1d2b7a75	Set PAM_RHOST item for PAM authentication The PAM_RHOST item is set to the remote IP address or host name and can be used by PAM modules. A pg_hba.conf option is provided to choose between IP address and resolved host name. From: Grzegorz Sampolski <grzsmp@gmail.com> Reviewed-by: Haribabu Kommi <kommi.haribabu@gmail.com>	2016-04-08 10:48:44 -04:00
Teodor Sigaev	4e55b3f033	Rename comparePos() to compareWordEntryPos() Rename comparePos() to compareWordEntryPos() to prevent export of too generic name. Per gripe from Tom Lane.	2016-04-08 12:04:15 +03:00
Robert Haas	0711803775	Use quicksort, not replacement selection, for external sorting. We still use replacement selection for the first run of the sort only and only when the number of tuples is relatively small. Otherwise, the first run, and subsequent runs in all cases, are produced using quicksort. This tends to be faster except perhaps for very small amounts of working memory. Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy, Greg Stark, and me.	2016-04-08 02:36:26 -04:00
Robert Haas	719c84c1be	Extend relations multiple blocks at a time to improve scalability. Contention on the relation extension lock can become quite fierce when multiple processes are inserting data into the same relation at the same time at a high rate. Experimentation shows the extending the relation multiple blocks at a time improves scalability. Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.	2016-04-08 02:04:46 -04:00
Simon Riggs	137805f89a	Use Foreign Key relationships to infer multi-column join selectivity In cases where joins use multiple columns we currently assess each join separately causing gross mis-estimates for join cardinality. This patch adds use of FK information for the first time into the planner. When FKs are present and we have multi-column join information, plan estimates will be drastically improved. Cases with multiple FKs are handled, though partial matches are ignored currently. Net effect is substantial performance improvements for joins in many common cases. Additional planning time is isolated to cases that are currently performing poorly, measured at 0.08 - 0.15 ms. Please watch for planner performance regressions; circumstances seem unlikely but the law of unintended consequences may apply somewhen. Additional complex tests welcome to prove this before release. Tests can be performed using SET enable_fkey_estimates = on \| off using scripts provided during Hackers discussions, message id: 552335D9.3090707@2ndquadrant.com Authors: Tomas Vondra and David Rowley Reviewed and tested by Simon Riggs, adding comments only	2016-04-08 02:51:09 +01:00
Teodor Sigaev	bb140506df	Phrase full text search. Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery. On-disk and binary in/out format of tsquery are backward compatible. It has two side effect: - change order for tsquery, so, users, who has a btree index over tsquery, should reindex it - less number of parenthesis in tsquery output, and tsquery becomes more readable Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov Reviewers: Alexander Korotkov, Artur Zakirov	2016-04-07 18:44:18 +03:00
Simon Riggs	015e88942a	Load FK defs into relcache for use by planner Fastpath ignores this if no triggers defined. Author: Tomas Vondra, with fastpath and comments added by me Reviewers: David Rowley, Simon Riggs	2016-04-07 12:08:33 +01:00
Stephen Frost	29dd1504a1	Bump catversion for pg_dump dump catalog ACL patches Pointed out by Tom.	2016-04-06 23:04:48 -04:00
Stephen Frost	23f34fa4ba	In pg_dump, include pg_catalog and extension ACLs, if changed Now that all of the infrastructure exists, add in the ability to dump out the ACLs of the objects inside of pg_catalog or the ACLs for objects which are members of extensions, but only if they have been changed from their original values. The original values are tracked in pg_init_privs. When pg_dump'ing 9.6-and-above databases, we will dump out the ACLs for all objects in pg_catalog and the ACLs for all extension members, where the ACL has been changed from the original value which was set during either initdb or CREATE EXTENSION. This should not change dumps against pre-9.6 databases. Reviews by Alexander Korotkov, Jose Luis Tallon	2016-04-06 21:45:32 -04:00
Stephen Frost	6c268df127	Add new catalog called pg_init_privs This new catalog holds the privileges which the system was initialized with at initdb time, along with any permissions set by extensions at CREATE EXTENSION time. This allows pg_dump (and any other similar use-cases) to detect when the privileges set on initdb-created or extension-created objects have been changed from what they were set to at initdb/extension-creation time and handle those changes appropriately. Reviews by Alexander Korotkov, Jose Luis Tallon	2016-04-06 21:45:32 -04:00
Teodor Sigaev	0b62fd036e	Add jsonb_insert It inserts a new value into an jsonb array at arbitrary position or a new key to jsonb object. Author: Dmitry Dolgov Reviewers: Petr Jelinek, Vitaly Burovoy, Andrew Dunstan	2016-04-06 19:25:00 +03:00
Tom Lane	de94e2af18	Run pgindent on a batch of (mostly-planner-related) source files. Getting annoyed at the amount of unrelated chatter I get from pgindent'ing Rowley's unique-joins patch. Re-indent all the files it touches.	2016-04-06 11:34:02 -04:00
Simon Riggs	3fe3511d05	Generic Messages for Logical Decoding API and mechanism to allow generic messages to be inserted into WAL that are intended to be read by logical decoding plugins. This commit adds an optional new callback to the logical decoding API. Messages are either text or bytea. Messages can be transactional, or not, and are identified by a prefix to allow multiple concurrent decoding plugins. (Not to be confused with Generic WAL records, which are intended to allow crash recovery of extensible objects.) Author: Petr Jelinek and Andres Freund Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs Discussion: 5685F999.6010202@2ndquadrant.com	2016-04-06 10:05:41 +01:00
Fujii Masao	989be0810d	Support multiple synchronous standby servers. Previously synchronous replication offered only the ability to confirm that all changes made by a transaction had been transferred to at most one synchronous standby server. This commit extends synchronous replication so that it supports multiple synchronous standby servers. It enables users to consider one or more standby servers as synchronous, and increase the level of transaction durability by ensuring that transaction commits wait for replies from all of those synchronous standbys. Multiple synchronous standby servers are configured in synchronous_standby_names which is extended to support new syntax of 'num_sync ( standby_name [ , ... ] )', where num_sync specifies the number of synchronous standbys that transaction commits need to wait for replies from and standby_name is the name of a standby server. The syntax of 'standby_name [ , ... ]' which was used in 9.5 or before is also still supported. It's the same as new syntax with num_sync=1. This commit doesn't include "quorum commit" feature which was discussed in pgsql-hackers. Synchronous standbys are chosen based on their priorities. synchronous_standby_names determines the priority of each standby for being chosen as a synchronous standby. The standbys whose names appear earlier in the list are given higher priority and will be considered as synchronous. Other standby servers appearing later in this list represent potential synchronous standbys. The regression test for multiple synchronous standbys is not included in this commit. It should come later. Authors: Sawada Masahiko, Beena Emerson, Michael Paquier, Fujii Masao Reviewed-By: Kyotaro Horiguchi, Amit Kapila, Robert Haas, Simon Riggs, Amit Langote, Thomas Munro, Sameer Thakur, Suraj Kharage, Abhijit Menon-Sen, Rajeev Rastogi Many thanks to the various individuals who were involved in discussing and developing this feature.	2016-04-06 17:18:25 +09:00
Alvaro Herrera	f2fcad27d5	Support ALTER THING .. DEPENDS ON EXTENSION This introduces a new dependency type which marks an object as depending on an extension, such that if the extension is dropped, the object automatically goes away; and also, if the database is dumped, the object is included in the dump output. Currently the grammar supports this for indexes, triggers, materialized views and functions only, although the utility code is generic so adding support for more object types is a matter of touching the parser rules only. Author: Abhijit Menon-Sen Reviewed-by: Alexander Korotkov, Álvaro Herrera Discussion: http://www.postgresql.org/message-id/20160115062649.GA5068@toroid.org	2016-04-05 18:38:54 -03:00
Robert Haas	41ea0c2376	Fix parallel-safety code for parallel aggregation. has_parallel_hazard() was ignoring the proparallel markings for aggregates, which is no good. Fix that. There was no way to mark an aggregate as actually being parallel-safe, either, so add a PARALLEL option to CREATE AGGREGATE. Patch by me, reviewed by David Rowley.	2016-04-05 16:06:15 -04:00
Robert Haas	11c8669c0c	Add parallel query support functions for assorted aggregates. This lets us use parallel aggregate for a variety of useful cases that didn't work before, like sum(int8), sum(numeric), several versions of avg(), and various other functions. Add some regression tests, as well, testing the general sanity of these and future catalog entries. David Rowley, reviewed by Tomas Vondra, with a few further changes by me.	2016-04-05 14:32:53 -04:00
Magnus Hagander	7117685461	Implement backup API functions for non-exclusive backups Previously non-exclusive backups had to be done using the replication protocol and pg_basebackup. With this commit it's now possible to make them using pg_start_backup/pg_stop_backup as well, as long as the backup program can maintain a persistent connection to the database. Doing this, backup_label and tablespace_map are returned as results from pg_stop_backup() instead of being written to the data directory. This makes the server safe from a crash during an ongoing backup, which can be a problem with exclusive backups. The old syntax of the functions remain and work exactly as before, but since the new syntax is safer this should eventually be deprecated and removed. Only reference documentation is included. The main section on backup still needs to be rewritten to cover this, but since that is already scheduled for a separate large rewrite, it's not included in this patch. Reviewed by David Steele and Amit Kapila	2016-04-05 20:03:49 +02:00
Tom Lane	66229ac004	Introduce a LOG_SERVER_ONLY ereport level, which is never sent to client. This elevel is useful for logging audit messages and similar information that should not be passed to the client. It's equivalent to LOG in terms of decisions about logging priority in the postmaster log, but messages with this elevel will never be sent to the client. In the current implementation, it's just an alias for the longstanding COMMERROR elevel (or more accurately, we've made COMMERROR an alias for this). At some point it might be interesting to allow a LOG_ONLY flag to be attached to any elevel, but that would be considerably more complicated, and it's not clear there's enough use-cases to justify the extra work. For now, let's just take the easy 90% solution. David Steele, reviewed by Fabien Coelho, Petr Jelínek, and myself	2016-04-04 12:32:42 -04:00
Teodor Sigaev	65578341af	Add Generic WAL interface This interface is designed to give an access to WAL for extensions which could implement new access method, for example. Previously it was impossible because restoring from custom WAL would need to access system catalog to find a redo custom function. This patch suggests generic way to describe changes on page with standart layout. Bump XLOG_PAGE_MAGIC because of new record type. Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and minor editorization by my Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby, Michael Paquier	2016-04-01 12:21:48 +03:00
Tom Lane	f9aefcb91f	Support using index-only scans with partial indexes in more cases. Previously, the planner would reject an index-only scan if any restriction clause for its table used a column not available from the index, even if that restriction clause would later be dropped from the plan entirely because it's implied by the index's predicate. This is a fairly common situation for partial indexes because predicates using columns not included in the index are often the most useful kind of predicate, and we have to duplicate (or at least imply) the predicate in the WHERE clause in order to get the index to be considered at all. So index-only scans were essentially unavailable with such partial indexes. To fix, we have to do detection of implied-by-predicate clauses much earlier in the planner. This patch puts it in check_index_predicates (nee check_partial_indexes), meaning it gets done for every partial index, whereas we previously only considered this issue at createplan time, so that the work was only done for an index actually selected for use. That could result in a noticeable planning slowdown for queries against tables with many partial indexes. However, testing suggested that there isn't really a significant cost, especially not with reasonable numbers of partial indexes. We do get a small additional benefit, which is that cost_index is more accurate since it correctly discounts the evaluation cost of clauses that will be removed. We can also avoid considering such clauses as potential indexquals, which saves useless matching cycles in the case where the predicate columns aren't in the index, and prevents generating bogus plans that double-count the clause's selectivity when the columns are in the index. Tomas Vondra and Kyotaro Horiguchi, reviewed by Kevin Grittner and Konstantin Knizhnik, and whacked around a little by me	2016-03-31 14:49:10 -04:00
Alvaro Herrera	3dd0792ae0	Blind attempt at fixing Win32 issue on `24c5f1a103` As best as I can tell, MyReplicationSlot needs to be PGDLLIMPORT in order for the new test_slot_timelines test module to compile. Per buildfarm	2016-03-30 23:12:20 -03:00
Alvaro Herrera	24c5f1a103	Enable logical slots to follow timeline switches When decoding from a logical slot, it's necessary for xlog reading to be able to read xlog from historical (i.e. not current) timelines; otherwise, decoding fails after failover, because the archives are in the historical timeline. This is required to make "failover logical slots" possible; it currently has no other use, although theoretically it could be used by an extension that creates a slot on a standby and continues to replay from the slot when the standby is promoted. This commit includes a module in src/test/modules with functions to manipulate the slots (which is not otherwise possible in SQL code) in order to enable testing, and a new test in src/test/recovery to ensure that the behavior is as expected. Author: Craig Ringer Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek	2016-03-30 20:07:05 -03:00
Alvaro Herrera	3b02ea4f07	XLogReader general code cleanup Some minor tweaks and comment additions, for cleanliness sake and to avoid having the upcoming timeline-following patch be polluted with unrelated cleanup. Extracted from a larger patch by Craig Ringer, reviewed by Andres Freund, with some additions by myself.	2016-03-30 18:56:13 -03:00
Tom Lane	50861cd683	Improve portability of I/O behavior for the geometric types. Formerly, the geometric I/O routines such as box_in and point_out relied directly on strtod() and sprintf() for conversion of the float8 component values of their data types. However, the behavior of those functions is pretty platform-dependent, especially for edge-case values such as infinities and NaNs. This was exposed by commit `acdf2a8b37`, which added test cases involving boxes with infinity endpoints, and immediately failed on Windows and AIX buildfarm members. We solved these problems years ago in the main float8in and float8out functions, so let's fix it by making the geometric types use that code instead of depending directly on the platform-supplied functions. To do this, refactor the float8in code so that it can be used to parse just part of a string, and as a convenience make the guts of float8out usable without going through DirectFunctionCall. While at it, get rid of geo_ops.c's fairly shaky assumptions about the maximum output string length for a double, by having it build results in StringInfo buffers instead of fixed-length strings. In passing, convert all the "invalid input syntax for type foo" messages in this area of the code into "invalid input syntax for type %s" to reduce the number of distinct translatable strings, per recent discussion. We would have needed a fair number of the latter anyway for code-sharing reasons, so we might as well just go whole hog. Note: this patch is by no means intended to guarantee that the geometric types uniformly behave sanely for infinity or NaN component values. But any bugs we have in that line were there all along, they were just harder to reach in a platform-independent way.	2016-03-30 17:25:03 -04:00
Teodor Sigaev	2d02a856e8	Bump catalog version, forget in `acdf2a8b37`	2016-03-30 18:56:21 +03:00
Teodor Sigaev	acdf2a8b37	Introduce SP-GiST operator class over box. Patch implements quad-tree over boxes, naive approach of 2D quad tree will not work for any non-point objects because splitting space on node is not efficient. The idea of pathc is treating 2D boxes as 4D points, so, object will not overlap (in 4D space). The performance tests reveal that this technique especially beneficial with too much overlapping objects, so called "spaghetti data". Author: Alexander Lebedev with editorization by Emre Hasegeli and me	2016-03-30 18:42:36 +03:00
Teodor Sigaev	ccd6eb49a4	Introduce traversalValue for SP-GiST scan During scan sometimes it would be very helpful to know some information about parent node or all ancestor nodes. Right now reconstructedValue could be used but it's not a right usage of it (range opclass uses that). traversalValue is arbitrary piece of memory in separate MemoryContext while reconstructedVale should have the same type as indexed column. Subsequent patches for range opclass and quad4d tree will use it. Author: Alexander Lebedev, Teodor Sigaev	2016-03-30 18:29:28 +03:00
Robert Haas	314cbfc5da	Add new replication mode synchronous_commit = 'remote_apply'. In this mode, the master waits for the transaction to be applied on the remote side, not just written to disk. That means that you can count on a transaction started on the standby to see all commits previously acknowledged by the master. To make this work, the standby sends a reply after replaying each commit record generated with synchronous_commit >= 'remote_apply'. This introduces a small inefficiency: the extra replies will be sent even by standbys that aren't the current synchronous standby. But previously-existing synchronous_commit levels make no attempt at all to optimize which replies are sent based on what the primary cares about, so this is no worse, and at least avoids any extra replies for people not using the feature at all. Thomas Munro, reviewed by Michael Paquier and by me. Some additional tweaks by me.	2016-03-29 21:29:49 -04:00
Tom Lane	e511d878f3	Allow to_timestamp(float8) to convert float infinity to timestamp infinity. With the original SQL-function implementation, such cases failed because we don't support infinite intervals. Converting the function to C lets us bypass the interval representation, which should be a bit faster as well as more flexible. Vitaly Burovoy, reviewed by Anastasia Lubennikova	2016-03-29 17:09:29 -04:00
Robert Haas	5fe5a2cee9	Allow aggregate transition states to be serialized and deserialized. This is necessary infrastructure for supporting parallel aggregation for aggregates whose transition type is "internal". Such values can't be passed between cooperating processes, because they are just pointers. David Rowley, reviewed by Tomas Vondra and by me.	2016-03-29 15:04:05 -04:00
Robert Haas	f9143d102f	Rework custom scans to work more like the new extensible node stuff. Per discussion, the new extensible node framework is thought to be better designed than the custom path/scan/scanstate stuff we added in PostgreSQL 9.5. Rework the latter to be more like the former. This is not backward-compatible, but we generally don't promise that for C APIs, and there probably aren't many people using this yet anyway. KaiGai Kohei, reviewed by Petr Jelinek and me. Some further cosmetic changes by me.	2016-03-29 11:28:04 -04:00
Robert Haas	5d4171d1c7	Don't require a user mapping for FDWs to work. Commit `fbe5a3fb73` accidentally changed this behavior; put things back the way they were, and add some regression tests. Report by Andres Freund; patch by Ashutosh Bapat, with a bit of kibitzing by me.	2016-03-28 21:50:28 -04:00
Robert Haas	868628e4fd	On all Windows platforms, not just Cygwin, use _timezone and _tzname. Up until now, we've been using timezone and tzname, but Visual Studio 2015 (for which we wish to add support) no longer declares those symbols. All versions since Visual Studio 2003 apparently support the underscore-equipped names, and we don't support anything older than Visual Studio 2005, so this should work OK everywhere. But let's see what the buildfarm thinks. Michael Paquier, reviewed by Petr Jelinek	2016-03-28 20:59:25 -04:00
Andres Freund	1a7a43672b	Don't use !! but != 0/NULL to force boolean evaluation. I introduced several uses of !! to force bit arithmetic to be boolean, but per discussion the project prefers != 0/NULL. Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com	2016-03-27 18:10:19 +02:00
Andres Freund	af4472bcb8	Change various GinIs macros to return 0/1. Returning the direct result of bit arithmetic, in a macro intended to be used in a boolean manner, can be problematic if the return value is stored in a variable of type 'bool'. If bool is implemented using C99's _Bool, that can lead to comparison failures if the variable is then compared again with the expression (see ginStepRight() for an example that fails), as _Bool forces the result to be 0/1. That happens in some configurations of newer MSVC compilers. It's also problematic when storing the result of such an expression in a narrower type. Several gin macros have been declared in that style since gin's initial commit in `8a3631f8d8`. There's a lot more macros like this, but this is the only one causing regression test failures; and I don't want to commit and backpatch a larger patch with lots of conflicts just before the next set of minor releases. Discussion: 20150811154237.GD17575@awork2.anarazel.de Backpatch: All supported branches	2016-03-27 17:46:48 +02:00
Tom Lane	c94959d411	Fix DROP OPERATOR to reset oprcom/oprnegate links to the dropped operator. This avoids leaving dangling links in pg_operator; which while fairly harmless are also unsightly. While we're at it, simplify OperatorUpd, which went through heap_modify_tuple for no very good reason considering it had already made a tuple copy it could just scribble on. Roma Sokolov, reviewed by Tomas Vondra, additional hacking by Robert Haas and myself.	2016-03-25 12:33:16 -04:00
Tom Lane	c1156411ad	Move psql's psqlscan.l into src/fe_utils. This completes (at least for now) the project of getting rid of ad-hoc linkages among the src/bin/ subdirectories. Everything they share is now in src/fe_utils/ and is included from a static library at link time. A side benefit is that we can restore the FLEX_NO_BACKUP check for psqlscanslash.l. We might need to think of another way to do that check if we ever need to build two lexers with that property in the same source directory, but there's no foreseeable reason to need that.	2016-03-24 20:28:47 -04:00
Tom Lane	d65bea26a8	Move psql's print.c and mbprint.c into src/fe_utils. Just turning the crank ...	2016-03-24 18:27:28 -04:00
Tom Lane	588d963b00	Create src/fe_utils/, and move stuff into there from pg_dump's dumputils. Per discussion, we want to create a static library and put the stuff into it that until now has been shared across src/bin/ directories by ad-hoc methods like symlinking a source file. This commit creates the library and populates it with a couple of files that contain the widely-useful portions of pg_dump's dumputils.c file. dumputils.c survives, because it has some stuff that didn't seem appropriate for fe_utils, but it's significantly smaller and is no longer referenced from any other directory. Follow-on patches will move more stuff into fe_utils. The Mkvcbuild.pm hacking here is just a best guess; we'll see how the buildfarm likes it.	2016-03-24 15:55:57 -04:00
Tom Lane	c2d1eea9e7	Avoid PGDLLIMPORT for simple local references in frontend programs. I was wondering if this would be an issue, and buildfarm member frogmouth says it is.	2016-03-23 23:26:44 -04:00
Alvaro Herrera	473b932870	Support CREATE ACCESS METHOD This enables external code to create access methods. This is useful so that extensions can add their own access methods which can be formally tracked for dependencies, so that DROP operates correctly. Also, having explicit support makes pg_dump work correctly. Currently only index AMs are supported, but we expect different types to be added in the future. Authors: Alexander Korotkov, Petr Jelínek Reviewed-By: Teodor Sigaev, Petr Jelínek, Jim Nasby Commitfest-URL: https://commitfest.postgresql.org/9/353/ Discussion: https://www.postgresql.org/message-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com	2016-03-23 23:01:35 -03:00
Tom Lane	2c6af4f442	Move keywords.c/kwlookup.c into src/common/. Now that we have src/common/ for code shared between frontend and backend, we can get rid of (most of) the klugy ways that the keyword table and keyword lookup code were formerly shared between different uses. This is a first step towards a more general plan of getting rid of special-purpose kluges for sharing code in src/bin/. I chose to merge kwlookup.c back into keywords.c, as it once was, and always has been so far as keywords.h is concerned. We could have kept them separate, but there is noplace that uses ScanKeywordLookup without also wanting access to the backend's keyword list, so there seems little point. ecpg is still a bit weird, but at least now the trickiness is documented. I think that the MSVC build script should require no adjustments beyond what's done here ... but we'll soon find out.	2016-03-23 20:22:08 -04:00
Robert Haas	e06a38965b	Support parallel aggregation. Parallel workers can now partially aggregate the data and pass the transition values back to the leader, which can combine the partial results to produce the final answer. David Rowley, based on earlier work by Haribabu Kommi. Reviewed by Álvaro Herrera, Tomas Vondra, Amit Kapila, James Sewell, and me.	2016-03-21 09:30:18 -04:00
Andres Freund	7fa0064092	Properly declare FeBeWaitSet. Surprising that this worked on a number of systems. Reported by buildfarm member longfin.	2016-03-21 12:58:18 +01:00
Andres Freund	98a64d0bd7	Introduce WaitEventSet API. Commit `ac1d794` ("Make idle backends exit if the postmaster dies.") introduced a regression on, at least, large linux systems. Constantly adding the same postmaster_alive_fds to the OSs internal datastructures for implementing poll/select can cause significant contention; leading to a performance regression of nearly 3x in one example. This can be avoided by using e.g. linux' epoll, which avoids having to add/remove file descriptors to the wait datastructures at a high rate. Unfortunately the current latch interface makes it hard to allocate any persistent per-backend resources. Replace, with a backward compatibility layer, WaitLatchOrSocket with a new WaitEventSet API. Users can allocate such a Set across multiple calls, and add more than one file-descriptor to wait on. The latter has been added because there's upcoming postgres features where that will be helpful. In addition to the previously existing poll(2), select(2), WaitForMultipleObjects() implementations also provide an epoll_wait(2) based implementation to address the aforementioned performance problem. Epoll is only available on linux, but that is the most likely OS for machines large enough (four sockets) to reproduce the problem. To actually address the aforementioned regression, create and use a long-lived WaitEventSet for FE/BE communication. There are additional places that would benefit from a long-lived set, but that's a task for another day. Thanks to Amit Kapila, who helped make the windows code I blindly wrote actually work. Reported-By: Dmitry Vasilyev Discussion: CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com 20160114143931.GG10941@awork2.anarazel.de	2016-03-21 12:22:54 +01:00
Andres Freund	72e2d21c12	Combine win32 and unix latch implementations. Previously latches for windows and unix had been implemented in different files. A later patch introduce an expanded wait infrastructure, keeping the implementation separate would introduce too much duplication. This basically just moves the functions, without too much change. The reason to keep this separate is that it allows blame to continue working a little less badly; and to make review a tiny bit easier. Discussion: 20160114143931.GG10941@awork2.anarazel.de	2016-03-21 11:03:26 +01:00
Peter Eisentraut	b555ed8102	Merge wal_level "archive" and "hot_standby" into new name "replica" The distinction between "archive" and "hot_standby" existed only because at the time "hot_standby" was added, there was some uncertainty about stability. This is now a long time ago. We would like to move forward with simplifying the replication configuration, but this distinction is in the way, because a primary server cannot tell (without asking a standby or predicting the future) which one of these would be the appropriate level. Pick a new name for the combined setting to make it clearer that it covers all (non-logical) backup and replication uses. The old values are still accepted but are converted internally. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reviewed-by: David Steele <david@pgmasters.net>	2016-03-18 23:56:03 +01:00
Andres Freund	fad0f9d8c9	Remove unused, and dangerous, TestLatch() macro. The macro has not seen any in-tree use since latches had been introduced in `2746e5f`, in 2010.	2016-03-18 11:46:42 -07:00
Robert Haas	0bf3ae88af	Directly modify foreign tables. postgres_fdw can now sent an UPDATE or DELETE statement directly to the foreign server in simple cases, rather than sending a SELECT FOR UPDATE statement and then updating or deleting rows one-by-one. Etsuro Fujita, reviewed by Rushabh Lathia, Shigeru Hanada, Kyotaro Horiguchi, Albe Laurenz, Thom Brown, and me.	2016-03-18 13:55:52 -04:00
Teodor Sigaev	3187d6de0e	Introduce parse_ident() SQL-layer function to split qualified identifier into array parts. Author: Pavel Stehule with minor editorization by me and Jim Nasby	2016-03-18 18:16:14 +03:00
Teodor Sigaev	f4ceed6ceb	Improve support of Hunspell - allow to use non-ascii characters as affix flag. Non-numeric affix flags now are stored as string instead of numeric value of character. - allow to use 0 as affix flag in numeric encoded affixes That adds support for arabian, hungarian, turkish and brazilian portuguese languages. Author: Artur Zakirov with heavy editorization by me	2016-03-17 17:23:38 +03:00
Peter Eisentraut	fc201dfd95	Add syslog_split_messages parameter Reviewed-by: Andreas Karlsson <andreas@proxel.se>	2016-03-16 23:21:44 -04:00
Peter Eisentraut	f4c454e9ba	Add syslog_sequence_numbers parameter Reviewed-by: Andreas Karlsson <andreas@proxel.se>	2016-03-16 23:21:44 -04:00
Tom Lane	a70e13a39e	Be more careful about out-of-range dates and timestamps. Tighten the semantics of boundary-case timestamptz so that we allow timestamps >= '4714-11-24 00:00+00 BC' and < 'ENDYEAR-01-01 00:00+00 AD' exactly, no more and no less, but it is allowed to enter timestamps within that range using non-GMT timezone offsets (which could make the nominal date 4714-11-23 BC or ENDYEAR-01-01 AD). This eliminates dump/reload failure conditions for timestamps near the endpoints. To do this, separate checking of the inputs for date2j() from the final range check, and allow the Julian date code to handle a range slightly wider than the nominal range of the datatypes. Also add a bunch of checks to detect out-of-range dates and timestamps that formerly could be returned by operations such as date-plus-integer. All C-level functions that return date, timestamp, or timestamptz should now be proof against returning a value that doesn't pass IS_VALID_DATE() or IS_VALID_TIMESTAMP(). Vitaly Burovoy, reviewed by Anastasia Lubennikova, and substantially whacked around by me	2016-03-16 19:09:28 -04:00
Robert Haas	c6dda1f48e	Add idle_in_transaction_session_timeout. Vik Fearing, reviewed by Stéphane Schildknecht and me, and revised slightly by me.	2016-03-16 11:30:45 -04:00
Robert Haas	3aff33aa68	Fix typos. Oskari Saarenmaa	2016-03-15 18:06:11 -04:00
Robert Haas	c16dc1aca5	Add simple VACUUM progress reporting. There's a lot more that could be done here yet - in particular, this reports only very coarse-grained information about the index vacuuming phase - but even as it stands, the new pg_stat_progress_vacuum can tell you quite a bit about what a long-running vacuum is actually doing. Amit Langote and Robert Haas, based on earlier work by Vinayak Pokale and Rahila Syed.	2016-03-15 13:32:56 -04:00
Tom Lane	0e9b89986b	Cope if platform declares mbstowcs_l(), but not locale_t, in <xlocale.h>. Previously, we included <xlocale.h> only if necessary to get the definition of type locale_t. According to notes in PGAC_TYPE_LOCALE_T, this is important because on some versions of glibc that file supplies an incompatible declaration of locale_t. (This info may be obsolete, because on my RHEL6 box that seems to be the only definition of locale_t; but there may still be glibc's in the wild for which it's a live concern.) It turns out though that on FreeBSD and maybe other BSDen, you can get locale_t from stdlib.h or locale.h but mbstowcs_l() and friends only from <xlocale.h>. This was leaving us compiling calls to mbstowcs_l() and friends with no visible prototype, which causes a warning and could possibly cause actual trouble, since it's not declared to return int. Hence, adjust the configure checks so that we'll include <xlocale.h> either if it's necessary to get type locale_t or if it's necessary to get a declaration of mbstowcs_l(). Report and patch by Aleksander Alekseev, somewhat whacked around by me. Back-patch to all supported branches, since we have been using mbstowcs_l() since 9.1.	2016-03-15 13:19:57 -04:00
Tom Lane	101fd9349e	Add a GetForeignUpperPaths callback function for FDWs. This is basically like the just-added create_upper_paths_hook, but control is funneled only to the FDW responsible for all the baserels of the current query; so providing such a callback is much less likely to add useless overhead than using the hook function is. The documentation is a bit sketchy. We'll likely want to improve it, and/or adjust the call conventions, when we get some experience with actually using this callback. Hopefully somebody will find time to experiment with it before 9.6 feature freeze.	2016-03-14 20:04:48 -04:00
Peter Eisentraut	be6de4c121	Add missing include for self-containment	2016-03-14 19:56:33 -04:00
Tom Lane	5864d6a4b6	Provide a planner hook at a suitable place for creating upper-rel Paths. In the initial revision of the upper-planner pathification work, the only available way for an FDW or custom-scan provider to inject Paths representing post-scan-join processing was to insert them during scan-level GetForeignPaths or similar processing. While that's not impossible, it'd require quite a lot of duplicative processing to look forward and see if the extension would be capable of implementing the whole query. To improve matters for custom-scan providers, provide a hook function at the point where the core code is about to start filling in upperrel Paths. At this point Paths are available for the whole scan/join tree, which should reduce the amount of redundant effort considerably. (An alternative design that was suggested was to provide a separate hook for each post-scan-join processing step, but that seems messy and not clearly more useful.) Following our time-honored tradition, there's no documentation for this hook outside the source code. As-is, this hook is only meant for custom scan providers, which we can't assume very much about. A followon patch will implement an FDW callback to let FDWs do the same thing in a somewhat more structured fashion.	2016-03-14 19:23:29 -04:00
Tom Lane	28048cbaa2	Allow callers of create_foreignscan_path to specify nondefault PathTarget. Although the default choice of rel->reltarget should typically be sufficient for scan or join paths, it's not at all sufficient for the purposes PathTargets were invented for; in particular not for upper-relation Paths. So break API compatibility by adding a PathTarget argument to create_foreignscan_path(). To ease updating of existing code, accept a NULL value of the argument as selecting rel->reltarget.	2016-03-14 17:31:28 -04:00
Tom Lane	307c78852f	Rethink representation of PathTargets. In commit `19a541143a` I did not make PathTarget a subtype of Node, and embedded a RelOptInfo's reltarget directly into it rather than having a separately-allocated Node. In hindsight that was misguided micro-optimization, enabled by the fact that at that point we didn't have any Paths with custom PathTargets. Now that PathTarget processing has been fleshed out some more, it's easier to see that it's better to have PathTarget as an indepedent Node type, even if it does cost us one more palloc to create a RelOptInfo. So change it while we still can. This commit just changes the representation, without doing anything more interesting than that.	2016-03-14 16:59:59 -04:00
Robert Haas	6be84eeb8d	Update more comments for `96198d94cb`. Etsuro Fujita, reviewed (though not completely endorsed) by Ashutosh Bapat, and slightly expanded by me.	2016-03-14 14:29:12 -04:00
Tom Lane	2da7549987	pg_stat_get_progress_info() should be marked STRICT. I didn't bother with a catversion bump. Report and patch by Thomas Munro	2016-03-14 12:51:55 -04:00
Tom Lane	23a27b039d	Widen query numbers-of-tuples-processed counters to uint64. This patch widens SPI_processed, EState's es_processed field, PortalData's portalPos field, FuncCallContext's call_cntr and max_calls fields, ExecutorRun's count argument, PortalRunFetch's result, and the max number of rows in a SPITupleTable to uint64, and deals with (I hope) all the ensuing fallout. Some of these values were declared uint32 before, and others "long". I also removed PortalData's posOverflow field, since that logic seems pretty useless given that portalPos is now always 64 bits. The user-visible results are that command tags for SELECT etc will correctly report tuple counts larger than 4G, as will plpgsql's GET GET DIAGNOSTICS ... ROW_COUNT command. Queries processing more tuples than that are still not exactly the norm, but they're becoming more common. Most values associated with FETCH/MOVE distances, such as PortalRun's count argument and the count argument of most SPI functions that have one, remain declared as "long". It's not clear whether it would be worth promoting those to int64; but it would definitely be a large dollop of additional API churn on top of this, and it would only help 32-bit platforms which seem relatively less likely to see any benefit. Andreas Scherbaum, reviewed by Christian Ullrich, additional hacking by me	2016-03-12 16:05:29 -05:00
Tom Lane	570be1f73f	Re-export a few of createplan.c's make_xxx() functions. CitusDB is using these and don't wish to redesign their code right now. I am not on board with this being a good idea, or a good precedent, but I lack the energy to fight about it.	2016-03-12 12:12:59 -05:00
Teodor Sigaev	a9eb6c83ef	Bump catalog version missed in `6943a946c7`	2016-03-11 19:31:04 +03:00
Teodor Sigaev	6943a946c7	Tsvector editing functions Adds several tsvector editting function: convert tsvector to/from text array, set weight for given lexemes, delete lexeme(s), unnest, filter lexemes with given weights Author: Stas Kelvich with some editorization by me Reviewers: Tomas Vondram, Teodor Sigaev	2016-03-11 19:22:36 +03:00
Tom Lane	49635d7b3e	Minor additional refactoring of planner.c's PathTarget handling. Teach make_group_input_target() and make_window_input_target() to work entirely with the PathTarget representation of tlists, rather than constructing a tlist and immediately deconstructing it into PathTarget format. In itself this only saves a few palloc's; the bigger picture is that it opens the door for sharing cost_qual_eval work across all of planner.c's constructions of PathTargets. I'll come back to that later. In support of this, flesh out tlist.c's infrastructure for PathTargets a bit more.	2016-03-11 10:24:55 -05:00
Simon Riggs	73e7e49da3	Allow emit_log_hook to see original message text emit_log_hook could only see the translated text, making it harder to identify which message was being sent. Pass original text to allow the exact message to be identified, whichever language is used for logging. Discussion: 20160216.184755.59721141.horiguchi.kyotaro@lab.ntt.co.jp Author: Kyotaro Horiguchi	2016-03-11 09:53:06 +00:00
Andres Freund	9cd00c457e	Checkpoint sorting and balancing. Up to now checkpoints were written in the order they're in the BufferDescriptors. That's nearly random in a lot of cases, which performs badly on rotating media, but even on SSDs it causes slowdowns. To avoid that, sort checkpoints before writing them out. We currently sort by tablespace, relfilenode, fork and block number. One of the major reasons that previously wasn't done, was fear of imbalance between tablespaces. To address that balance writes between tablespaces. The other prime concern was that the relatively large allocation to sort the buffers in might fail, preventing checkpoints from happening. Thus pre-allocate the required memory in shared memory, at server startup. This particularly makes it more efficient to have checkpoint flushing enabled, because that'll often result in a lot of writes that can be coalesced into one flush. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund	2016-03-10 17:05:09 -08:00
Andres Freund	428b1d6b29	Allow to trigger kernel writeback after a configurable number of writes. Currently writes to the main data files of postgres all go through the OS page cache. This means that some operating systems can end up collecting a large number of dirty buffers in their respective page caches. When these dirty buffers are flushed to storage rapidly, be it because of fsync(), timeouts, or dirty ratios, latency for other reads and writes can increase massively. This is the primary reason for regular massive stalls observed in real world scenarios and artificial benchmarks; on rotating disks stalls on the order of hundreds of seconds have been observed. On linux it is possible to control this by reducing the global dirty limits significantly, reducing the above problem. But global configuration is rather problematic because it'll affect other applications; also PostgreSQL itself doesn't always generally want this behavior, e.g. for temporary files it's undesirable. Several operating systems allow some control over the kernel page cache. Linux has sync_file_range(2), several posix systems have msync(2) and posix_fadvise(2). sync_file_range(2) is preferable because it requires no special setup, whereas msync() requires the to-be-flushed range to be mmap'ed. For the purpose of flushing dirty data posix_fadvise(2) is the worst alternative, as flushing dirty data is just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages from the page cache. Thus the feature is enabled by default only on linux, but can be enabled on all systems that have any of the above APIs. While desirable and likely possible this patch does not contain an implementation for windows. With the infrastructure added, writes made via checkpointer, bgwriter and normal user backends can be flushed after a configurable number of writes. Each of these sources of writes controlled by a separate GUC, checkpointer_flush_after, bgwriter_flush_after and backend_flush_after respectively; they're separate because the number of flushes that are good are separate, and because the performance considerations of controlled flushing for each of these are different. A later patch will add checkpoint sorting - after that flushes from the ckeckpoint will almost always be desirable. Bgwriter flushes are most of the time going to be random, which are slow on lots of storage hardware. Flushing in backends works well if the storage and bgwriter can keep up, but if not it can have negative consequences. This patch is likely to have negative performance consequences without checkpoint sorting, but unfortunately so has sorting without flush control. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund	2016-03-10 17:04:34 -08:00
Tom Lane	c82c92b111	Give pull_var_clause() reject/recurse/return behavior for WindowFuncs too. All along, this function should have treated WindowFuncs in a manner similar to Aggrefs, ie with an option whether or not to recurse into them. By not considering the case, it was always recursing, which is OK for most callers (although I suspect that the case in prepare_sort_from_pathkeys might represent a bug). But now we need return-without-recursing behavior as well. There are also more than a few callers that should never see a WindowFunc, and now we'll get some error checking on that.	2016-03-10 16:23:52 -05:00
Tom Lane	364a9f47ab	Refactor pull_var_clause's API to make it less tedious to extend. In commit `1d97c19a0f` and later `c1d9579dd8`, we extended pull_var_clause's API by adding enum-type arguments. That's sort of a pain to maintain, though, because it means every time we add a new behavior we must touch every last one of the call sites, even if there's a reasonable default behavior that most of them could use. Let's switch over to using a bitmask of flags, instead; that seems more maintainable and might save a nanosecond or two as well. This commit changes no behavior in itself, though I'm going to follow it up with one that does add a new behavior. In passing, remove flatten_tlist(), which has not been used since 9.1 and would otherwise need the same API changes. Removing these enums means that optimizer/tlist.h no longer needs to depend on optimizer/var.h. Changing that caused a number of C files to need addition of #include "optimizer/var.h" (probably we can thank old runs of pgrminclude for that); but on balance it seems like a good change anyway.	2016-03-10 15:53:07 -05:00
Simon Riggs	37c54863cf	Rework wait for AccessExclusiveLocks on Hot Standby Earlier version committed in 9.0 caused spurious waits in some cases. New infrastructure for lock waits in 9.3 used to correct and improve this. Jeff Janes based upon a proposal by Simon Riggs, who also reviewed Additional review comments from Amit Kapila	2016-03-10 19:26:24 +00:00
Robert Haas	53be0b1add	Provide much better wait information in pg_stat_activity. When a process is waiting for a heavyweight lock, we will now indicate the type of heavyweight lock for which it is waiting. Also, you can now see when a process is waiting for a lightweight lock - in which case we will indicate the individual lock name or the tranche, as appropriate - or for a buffer pin. Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful discussion and suggestions by many others, including Alexander Korotkov, Vladimir Borodin, and many others.	2016-03-10 12:44:09 -05:00
Andres Freund	606e0f9841	Introduce durable_rename() and durable_link_or_rename(). Renaming a file using rename(2) is not guaranteed to be durable in face of crashes; especially on filesystems like xfs and ext4 when mounted with data=writeback. To be certain that a rename() atomically replaces the previous file contents in the face of crashes and different filesystems, one has to fsync the old filename, rename the file, fsync the new filename, fsync the containing directory. This sequence is not generally adhered to currently; which exposes us to data loss risks. To avoid having to repeat this arduous sequence, introduce durable_rename(), which wraps all that. Also add durable_link_or_rename(). Several places use link() (with a fallback to rename()) to rename a file, trying to avoid replacing the target file out of paranoia. Some of those rename sequences need to be durable as well. There seems little reason extend several copies of the same logic, so centralize the link() callers. This commit does not yet make use of the new functions; they're used in a followup commit. Author: Michael Paquier, Andres Freund Discussion: 56583BDD.9060302@2ndquadrant.com Backpatch: All supported branches	2016-03-09 18:53:53 -08:00
Robert Haas	b6fb6471f6	Add a generic command progress reporting facility. Using this facility, any utility command can report the target relation upon which it is operating, if there is one, and up to 10 64-bit counters; the intent of this is that users should be able to figure out what a utility command is doing without having to resort to ugly hacks like attaching strace to a backend. As a demonstration, this adds very crude reporting to lazy vacuum; we just report the target relation and nothing else. A forthcoming patch will make VACUUM report a bunch of additional data that will make this much more interesting. But this gets the basic framework in place. Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao, and Masanori Oyama.	2016-03-09 12:08:58 -05:00
Tom Lane	51c0f63e4d	Improve handling of pathtargets in planner.c. Refactor so that the internal APIs in planner.c deal in PathTargets not targetlists, and establish a more regular structure for deriving the targets needed for successive steps. There is more that could be done here; calculating the eval costs of each successive target independently is both inefficient and wrong in detail, since we won't actually recompute values available from the input node's tlist. But it's no worse than what happened before the pathification rewrite. In any case this seems like a good starting point for considering how to handle Konstantin Knizhnik's function-evaluation-postponement patch.	2016-03-09 01:12:16 -05:00
Tom Lane	9e8b99420f	Improve handling of group-column indexes in GroupingSetsPath. Instead of having planner.c compute a groupColIdx array and store it in GroupingSetsPaths, make create_groupingsets_plan() find the grouping columns by searching in the child plan node's tlist. Although that's probably a bit slower for create_groupingsets_plan(), it's more like the way every other plan node type does this, and it provides positive confirmation that we know which child output columns we're supposed to be grouping on. (Indeed, looking at this now, I'm not at all sure that it wasn't broken before, because create_groupingsets_plan() isn't demanding an exact tlist match from its child node.) Also, this allows substantial simplification in planner.c, because it no longer needs to compute the groupColIdx array at all; no other cases were using it. I'd intended to put off this refactoring until later (like 9.7), but in view of the likely bug fix and the need to rationalize planner.c's tlist handling so we can do something sane with Konstantin Knizhnik's function-evaluation-postponement patch, I think it can't wait.	2016-03-08 22:32:11 -05:00
Tom Lane	8c314b9853	Finish refactoring make_foo() functions in createplan.c. This patch removes some redundant cost calculations that I left for later cleanup in commit `3fc6e2d7f5`. There's now a uniform policy that the make_foo() convenience functions don't do any cost calculations. Most of their callers copy costs from the source Path node, and for those that don't, the calculation in the make_foo() function wasn't necessarily right anyhow. (make_result() was particularly a mess, as it was serving multiple callers using cost calcs designed for only the first one or two that had ever existed.) Aside from saving a few cycles, this ensures that what EXPLAIN prints matches the costs we used for planning purposes. It does not change any planner decisions, since the decisions are already made.	2016-03-08 16:28:34 -05:00
Robert Haas	070140ee48	Add some functions to fd.c for the convenience of extensions. For example, if you want to perform an ioctl() on a file descriptor opened through the fd.c routines, there's no way to do that without being able to get at the underlying fd. KaiGai Kohei	2016-03-08 10:09:50 -05:00
Robert Haas	77a1d1e798	Department of second thoughts: remove PD_ALL_FROZEN. Commit `a892234f83` added a second bit per page to the visibility map, which still seems like a good idea, but it also added a second page-level bit alongside PD_ALL_VISIBLE to track whether the visibility map bit was set. That no longer seems like a clever plan, because we don't really need that bit for anything. We always clear both bits when the page is modified anyway. Patch by me, reviewed by Kyotaro Horiguchi and Masahiko Sawada.	2016-03-08 08:46:48 -05:00
Tom Lane	3fc6e2d7f5	Make the upper part of the planner work by generating and comparing Paths. I've been saying we needed to do this for more than five years, and here it finally is. This patch removes the ever-growing tangle of spaghetti logic that grouping_planner() used to use to try to identify the best plan for post-scan/join query steps. Now, there is (nearly) independent consideration of each execution step, and entirely separate construction of Paths to represent each of the possible ways to do that step. We choose the best Path or set of Paths using the same add_path() logic that's been used inside query_planner() for years. In addition, this patch removes the old restriction that subquery_planner() could return only a single Plan. It now returns a RelOptInfo containing a set of Paths, just as query_planner() does, and the parent query level can use each of those Paths as the basis of a SubqueryScanPath at its level. This allows finding some optimizations that we missed before, wherein a subquery was capable of returning presorted data and thereby avoiding a sort in the parent level, making the overall cost cheaper even though delivering sorted output was not the cheapest plan for the subquery in isolation. (A couple of regression test outputs change in consequence of that. However, there is very little change in visible planner behavior overall, because the point of this patch is not to get immediate planning benefits but to create the infrastructure for future improvements.) There is a great deal left to do here. This patch unblocks a lot of planner work that was basically impractical in the old code structure, such as allowing FDWs to implement remote aggregation, or rewriting plan_set_operations() to allow consideration of multiple implementation orders for set operations. (The latter will likely require a full rewrite of plan_set_operations(); what I've done here is only to fix it to return Paths not Plans.) I have also left unfinished some localized refactoring in createplan.c and planner.c, because it was not necessary to get this patch to a working state. Thanks to Robert Haas, David Rowley, and Amit Kapila for review.	2016-03-07 15:58:22 -05:00
Andres Freund	c8f621c43a	logical decoding: Fix handling of large old tuples with replica identity full. When decoding the old version of an UPDATE or DELETE change, and if that tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or failed in more subtle ways in non-assert builds. Normally individual tuples aren't bigger than MaxHeapTupleSize, with big datums toasted. But that's not the case for the old version of a tuple for logical decoding; the replica identity is logged as one piece. With the default replica identity btree limits that to small tuples, but that's not the case for FULL. Change the tuple buffer infrastructure to separate allocate over-large tuples, instead of always going through the slab cache. This unfortunately requires changing the ReorderBufferTupleBuf definition, we need to store the allocated size someplace. To avoid requiring output plugins to recompile, don't store HeapTupleHeaderData directly after HeapTupleData, but point to it via t_data; that leaves rooms for the allocated size. As there's no reason for an output plugin to look at ReorderBufferTupleBuf->t_data.header, remove the field. It was just a minor convenience having it directly accessible. Reported-By: Adam Dratwiński Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com	2016-03-05 18:02:20 -08:00
Andres Freund	d9e903f3cb	logical decoding: Tell reorderbuffer about all xids. Logical decoding's reorderbuffer keeps transactions in an LSN ordered list for efficiency. To make that's efficiently possible upper-level xids are forced to be logged before nested subtransaction xids. That only works though if these records are all looked at: Unfortunately we didn't do so for e.g. row level locks, which are otherwise uninteresting for logical decoding. This could lead to errors like: "ERROR: subxact logged without previous toplevel record". It's not sufficient to just look at row locking records, the xid could appear first due to a lot of other types of records (which will trigger the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny). So invent infrastructure to tell reorderbuffer about xids seen, when they'd otherwise not pass through reorderbuffer.c. Reported-By: Jarred Ward Bug: #13844 Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org Backpatch: 9.4, where logical decoding was added	2016-03-05 18:02:20 -08:00
Joe Conway	dc7d70ea05	Expose control file data via SQL accessible functions. Add four new SQL accessible functions: pg_control_system(), pg_control_checkpoint(), pg_control_recovery(), and pg_control_init() which expose a subset of the control file data. Along the way move the code to read and validate the control file to src/common, where it can be shared by the new backend functions and the original pg_controldata frontend program. Patch by me, significant input, testing, and review by Michael Paquier.	2016-03-05 11:10:19 -08:00
Teodor Sigaev	d78a7d9c7f	Improve support of Hunspell in ispell dictionary. Now it's possible to load recent version of Hunspell for several languages. To handle these dictionaries Hunspell patch adds support for: * FLAG long - sets the double extended ASCII character flag type * FLAG num - sets the decimal number flag type (from 1 to 65535) * AF parameter - alias for flag's set Also it moves test dictionaries into separate directory. Author: Artur Zakirov with editorization by me	2016-03-04 20:08:47 +03:00
Simon Riggs	c7111d11b1	Revert buggy optimization of index scans `606c0123d6` attempted to reduce cost of index scans using > and < strategies, though got that completely wrong in a few complex cases. Revert whole patch until we find a safe optimization.	2016-03-03 09:53:43 +00:00
Tom Lane	eb43e851d6	Create stub functions to support pg_upgrade of old contrib/tsearch2. Commits `9ff60273e3` and `dbe2328959` adjusted the declarations of some core functions referenced by contrib/tsearch2's install script, forgetting that in a pg_upgrade situation, we'll be trying to restore operator class definitions that reference the old signatures. We've hit this problem before; solve it in the same way as before, namely by installing stub functions that have the expected signature and just invoke the correct function. Per report from Jeff Janes. (Someday we ought to stop supporting contrib/tsearch2, but I'm not sure today is that day.)	2016-03-02 17:37:54 -05:00
Robert Haas	a892234f83	Change the format of the VM fork to add a second bit per page. The new bit indicates whether every tuple on the page is already frozen. It is cleared only when the all-visible bit is cleared, and it can be set only when we vacuum a page and find that every tuple on that page is both visible to every transaction and in no need of any future vacuuming. A future commit will use this new bit to optimize away full-table scans that would otherwise be triggered by XID wraparound considerations. A page which is merely all-visible must still be scanned in that case, but a page which is all-frozen need not be. This commit does not attempt that optimization, although that optimization is the goal here. It seems better to get the basic infrastructure in place first. Per discussion, it's very desirable for pg_upgrade to automatically migrate existing VM forks from the old format to the new format. That, too, will be handled in a follow-on patch. Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit Kapila, Simon Riggs, Andres Freund, and others, and substantially revised by me.	2016-03-01 21:49:41 -05:00
Robert Haas	35746bc348	Add new FDW API to test for parallel-safety. This is basically a bug fix; the old code assumes that a ForeignScan is always parallel-safe, but for postgres_fdw, for example, this is definitely false. It should be true for file_fdw, though, since a worker can read a file from the filesystem just as well as any other backend process. Original patch by Thomas Munro. Documentation, and changes to the comments, by me.	2016-02-26 16:14:46 +05:30
Alvaro Herrera	343f709c06	Fix typos Backpatch to: 9.4	2016-02-25 20:50:20 -03:00
Tom Lane	52f5d578d6	Create a function to reliably identify which sessions block which others. This patch introduces "pg_blocking_pids(int) returns int[]", which returns the PIDs of any sessions that are blocking the session with the given PID. Historically people have obtained such information using a self-join on the pg_locks view, but it's unreasonably tedious to do it that way with any modicum of correctness, and the addition of parallel queries has pretty much broken that approach altogether. (Given some more columns in the view than there are today, you could imagine handling parallel-query cases with a 4-way join; but ugh.) The new function has the following behaviors that are painful or impossible to get right via pg_locks: 1. Correctly understands which lock modes block which other ones. 2. In soft-block situations (two processes both waiting for conflicting lock modes), only the one that's in front in the wait queue is reported to block the other. 3. In parallel-query cases, reports all sessions blocking any member of the given PID's lock group, and reports a session by naming its leader process's PID, which will be the pg_backend_pid() value visible to clients. The motivation for doing this right now is mostly to fix the isolation tests. Commit `38f8bdcac4` lobotomized isolationtester's is-it-waiting query by removing its ability to recognize nonconflicting lock modes, as a crude workaround for the inability to handle soft-block situations properly. But even without the lock mode tests, the old query was excessively slow, particularly in CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new deadlock-hard test because the deadlock timeout elapses before they can probe the waiting status of all eight sessions. Replacing the pg_locks self-join with use of pg_blocking_pids() is not only much more correct, but a lot faster: I measure it at about 9X faster in a typical dev build with Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the test, without having to lengthen deadlock_timeout yet more and thus slow down the test for everyone else.	2016-02-22 14:31:43 -05:00
Tom Lane	73bf8715aa	Remove redundant PGPROC.lockGroupLeaderIdentifier field. We don't really need this field, because it's either zero or redundant with PGPROC.pid. The use of zero to mark "not a group leader" is not necessary since we can just as well test whether lockGroupLeader is NULL. This does not save very much, either as to code or data, but the simplification seems worthwhile anyway.	2016-02-22 11:20:35 -05:00
Tom Lane	c7a1c5a6b6	Cosmetic improvements in new config_info code. Coverity griped about use of unchecked strcpy() into a local variable. There's unlikely to be any actual bug there, since no caller would be passing a path longer than MAXPGPATH, but nonetheless use of strlcpy() seems preferable. While at it, get rid of unmaintainable separation between list of field names and list of field values in favor of initializing them in parallel. And we might as well declare get_configdata()'s path argument as const char *, even though no current caller needs that.	2016-02-21 11:38:24 -05:00
Robert Haas	d91a4a6c85	Cosmetic improvements to group locking. Reflow text in lock manager README so that it fits within 80 columns. Correct some mistakes. Expand the README to explain not only why group locking exists but also the data structures that support it. Improve comments related to group locking several files. Change the name of a macro argument for improved clarity. Most of these problems were reported by Tom Lane, but I found a few of them myself. Robert Haas and Tom Lane	2016-02-21 15:42:02 +05:30
Dean Rasheed	53874c5228	Add pg_size_bytes() to parse human-readable size strings. This will parse strings in the format produced by pg_size_pretty() and return sizes in bytes. This allows queries to be written with clauses like "pg_total_relation_size(oid) > pg_size_bytes('10 GB')". Author: Pavel Stehule with various improvements by Vitaly Burovoy Discussion: http://www.postgresql.org/message-id/CAFj8pRD-tGoDKnxdYgECzA4On01_uRqPrwF-8LdkSE-6bDHp0w@mail.gmail.com Reviewed-by: Vitaly Burovoy, Oleksandr Shulgin, Kyotaro Horiguchi, Michael Paquier and Robert Haas	2016-02-20 09:57:27 +00:00
Noah Misch	5882ca6686	Call xlc __isync() after, not before, associated compare-and-swap. Architecture reference material specifies this order, and s_lock.h inline assembly agrees. The former order failed to provide mutual exclusion to lwlock.c and perhaps to other clients. The two xlc buildfarm members, hornet and mandrill, have failed sixteen times with duplicate key errors involving pg_class_oid_index or pg_type_oid_index. Back-patch to 9.5, where commit `b64d92f1a5` introduced atomics. Reviewed by Andres Freund and Tom Lane.	2016-02-19 22:47:50 -05:00
Tom Lane	19a541143a	Add an explicit representation of the output targetlist to Paths. Up to now, there's been an assumption that all Paths for a given relation compute the same output column set (targetlist). However, there are good reasons to remove that assumption. For example, an indexscan on an expression index might be able to return the value of an expensive function "for free". While we have the ability to generate such a plan today in simple cases, we don't have a way to model that it's cheaper than a plan that computes the function from scratch, nor a way to create such a plan in join cases (where the function computation would normally happen at the topmost join node). Also, we need this so that we can have Paths representing post-scan/join steps, where the targetlist may well change from one step to the next. Therefore, invent a "struct PathTarget" representing the columns we expect a plan step to emit. It's convenient to include the output tuple width and tlist evaluation cost in this struct, and there will likely be additional fields in future. While Path nodes that actually do have custom outputs will need their own PathTargets, it will still be true that most Paths for a given relation will compute the same tlist. To reduce the overhead added by this patch, keep a "default PathTarget" in RelOptInfo, and allow Paths that compute that column set to just point to their parent RelOptInfo's reltarget. (In the patch as committed, actually every Path is like that, since we do not yet have any cases of custom PathTargets.) I took this opportunity to provide some more-honest costing of PlaceHolderVar evaluation. Up to now, the assumption that "scan/join reltargetlists have cost zero" was applied not only to Vars, where it's reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval cost of a PlaceHolderVar's expression to the first plan level where it can be computed, by including it in the PathTarget cost field and adding that to the cost estimates for Paths. This isn't perfect yet but it's much better than before, and there is a way forward to improve it more. This costing change affects the join order chosen for a couple of the regression tests, changing expected row ordering.	2016-02-18 20:02:03 -05:00
Joe Conway	a5c43b8869	Add new system view, pg_config Move and refactor the underlying code for the pg_config client application to src/common in support of sharing it with a new system information SRF called pg_config() which makes the same information available via SQL. Additionally wrap the SRF with a new system view, as called pg_config. Patch by me with extensive input and review by Michael Paquier and additional review by Alvaro Herrera.	2016-02-17 09:12:06 -08:00
Robert Haas	f1f5ec1efa	Reuse abbreviated keys in ordered [set] aggregates. When processing ordered aggregates following a sort that could make use of the abbreviated key optimization, only call the equality operator to compare successive pairs of tuples when their abbreviated keys were not equal. Peter Geoghegan, reviewd by Andreas Karlsson and by me.	2016-02-17 15:40:00 +05:30
Joe Conway	851636bfda	Move DATA entry to correct position In commit `7b4bfc87` the DATA and DESCR entries for the new row_security_active() function were inadvertantly put after the PROVOLATILE defines, rather than before as they should have been placed. Move them up where they belong. Backpatch to 9.5 where the new entries were introduced.	2016-02-15 16:38:47 -08:00
Andres Freund	7975c5e0a9	Allow the WAL writer to flush WAL at a reduced rate. Commit `4de82f7d7` increased the WAL flush rate, mainly to increase the likelihood that hint bits can be set quickly. More quickly set hint bits can reduce contention around the clog et al. But unfortunately the increased flush rate can have a significant negative performance impact, I have measured up to a factor of ~4. The reason for this slowdown is that if there are independent writes to the underlying devices, for example because shared buffers is a lot smaller than the hot data set, or because a checkpoint is ongoing, the fdatasync() calls force cache flushes to be emitted to the storage. This is achieved by flushing WAL only if the last flush was longer than wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC) unflushed blocks are pending. Based on some tests the default for wal_writer_delay is 1MB, which seems to work well both on SSD and rotational media. To avoid negative performance impact due to `4de82f7d7` an earlier commit (`db76b1e`) made SetHintBits() more likely to succeed; preventing performance regressions in the pgbench tests I performed. Discussion: 20160118163908.GW10941@awork2.anarazel.de	2016-02-16 00:56:34 +01:00
Tom Lane	8c95ae81fa	Suppress compiler warnings about useless comparison of unsigned to zero. Reportedly, some compilers warn about tests like "c < 0" if c is unsigned, and hence complain about the character range checks I added in commit `3bb3f42f37`. This is a bit of a pain since the regex library doesn't really want to assume that chr is unsigned. However, since any such reconfiguration would involve manual edits of regcustom.h anyway, we can put it on the shoulders of whoever wants to do that to adjust this new range-checking macro correctly. Per gripes from Coverity and Andres.	2016-02-15 17:12:16 -05:00
Joe Conway	cfafd8bead	Correct Copyright year from 2015 to 2016 Looks like this patch went in after Copyright messages were updated for 2016 and it missed the boat. Fixed.	2016-02-15 13:19:35 -08:00
Noah Misch	9449c4b1ec	Replace broken link in comment.	2016-02-15 02:35:52 -05:00
Robert Haas	bcac23de73	Introduce extensible node types. An extensible node is always tagged T_Extensible, but the extnodename field identifies it more specifically; it may also include arbitrary private data. Extensible nodes can be copied, tested for equality, serialized, and deserialized, but the core system doesn't know anything about them otherwise. Some extensions may find it useful to include these nodes in fdw_private or custom_private lists in lieu of arm-wrestling their data into a format that the core code can understand. Along the way, so as not to burden the authors of such extensible node types too much, expose the functions for writing serialized tokens, and for serializing and deserializing bitmapsets. KaiGai Kohei, per a design suggested by me. Reviewed by Andres Freund and by me, and further edited by me.	2016-02-12 09:38:11 -05:00
Tom Lane	d4c3a156cb	Remove GROUP BY columns that are functionally dependent on other columns. If a GROUP BY clause includes all columns of a non-deferred primary key, as well as other columns of the same relation, those other columns are redundant and can be dropped from the grouping; the pkey is enough to ensure that each row of the table corresponds to a separate group. Getting rid of the excess columns will reduce the cost of the sorting or hashing needed to implement GROUP BY, and can indeed remove the need for a sort step altogether. This seems worth testing for since many query authors are not aware of the GROUP-BY-primary-key exception to the rule about queries not being allowed to reference non-grouped-by columns in their targetlists or HAVING clauses. Thus, redundant GROUP BY items are not uncommon. Also, we can make the test pretty cheap in most queries where it won't help by not looking up a rel's primary key until we've found that at least two of its columns are in GROUP BY. David Rowley, reviewed by Julien Rouhaud	2016-02-11 17:34:59 -05:00
Tom Lane	72eee410d4	Move pg_constraint.h function declarations to new file pg_constraint_fn.h. A pending patch requires exporting a function returning Bitmapset from catalog/pg_constraint.c. As things stand, that would mean including nodes/bitmapset.h in pg_constraint.h, which might be hazardous for the client-side includability of that header. It's not entirely clear whether any client-side code needs to include pg_constraint.h, but it seems prudent to assume that there is some such code somewhere. Therefore, split off the function definitions into a new file pg_constraint_fn.h, similarly to what we've done for some other catalog header files.	2016-02-11 15:51:28 -05:00
Robert Haas	c319991bca	Use separate lwlock tranches for buffer, lock, and predicate lock managers. This finishes the work - spread across many commits over the last several months - of putting each type of lock other than the named individual locks into a separate tranche. Amit Kapila	2016-02-11 14:07:33 -05:00
Robert Haas	a455878d99	Rename PGPROC fields related to group XID clearing again. Commit `0e141c0fbb` introduced a new facility to reduce ProcArrayLock contention by clearing several XIDs from the ProcArray under a single lock acquisition. The names initially chosen were deemed not to be very good choices, so commit `4aec49899e` renamed them. But now it seems like we still didn't get it right. A pending patch wants to add similar infrastructure for batching CLOG updates, so the names need to be clear enough to allow a new set of structure members with a related purpose. Amit Kapila	2016-02-11 08:55:24 -05:00
Tom Lane	c5e9b77127	Revert "Temporarily make pg_ctl and server shutdown a whole lot chattier." This reverts commit `3971f64843` and a couple of followon debugging commits; I think we've learned what we can from them.	2016-02-10 16:01:04 -05:00
Robert Haas	79a7ff0fe5	Code cleanup in the wake of recent LWLock refactoring. As of commit `c1772ad922`, there's no longer any way of requesting additional LWLocks in the main tranche, so we don't need NumLWLocks() or LWLockAssign() any more. Also, some of the allocation counters that we had previously aren't needed any more either. Amit Kapila	2016-02-10 09:58:09 -05:00
Tom Lane	3971f64843	Temporarily make pg_ctl and server shutdown a whole lot chattier. This is a quick hack, due to be reverted when its purpose has been served, to try to gather information about why some of the buildfarm critters regularly fail with "postmaster does not shut down" complaints. Maybe they are just really overloaded, but maybe something else is going on. Hence, instrument pg_ctl to print the current time when it starts waiting for postmaster shutdown and when it gives up, and add a lot of logging of the current time in the server's checkpoint and shutdown code paths. No attempt has been made to make this pretty. I'm not even totally sure if it will build on Windows, but we'll soon find out.	2016-02-08 18:43:11 -05:00
Tom Lane	3bb3f42f37	Fix some regex issues with out-of-range characters and large char ranges. Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a bad choice because it is outside the range of type "celt" (int32). Characters approaching that limit could lead to infinite loops in logic such as "for (c = a; c <= b; c++)" where c is of type celt but the range bounds are chr. Such loops will work safely only if CHR_MAX+1 is representable in celt, since c must advance to beyond b before the loop will exit. Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe. It's highly unlikely that Unicode will ever assign codes that high, and none of our other backend encodings need characters beyond that either. In addition to modifying the macro, we have to explicitly enforce character range restrictions on the values of \u, \U, and \x escape sequences, else the limit is trivially bypassed. Also, the code for expanding case-independent character ranges in bracket expressions had a potential integer overflow in its calculation of the number of characters it could generate, which could lead to allocating too small a character vector and then overwriting memory. An attacker with the ability to supply arbitrary regex patterns could easily cause transient DOS via server crashes, and the possibility for privilege escalation has not been ruled out. Quite aside from the integer-overflow problem, the range expansion code was unnecessarily inefficient in that it always produced a result consisting of individual characters, abandoning the knowledge that we had a range to start with. If the input range is large, this requires excessive memory. Change it so that the original range is reported as-is, and then we add on any case-equivalent characters that are outside that range. With this approach, we can bound the number of individual characters allowed without sacrificing much. This patch allows at most 100000 individual characters, which I believe to be more than the number of case pairs existing in Unicode, so that the restriction will never be hit in practice. It's still possible for range() to take awhile given a large character code range, so also add statement-cancel detection to its loop. The downstream function dovec() also lacked cancel detection, and could take a long time given a large output from range(). Per fuzz testing by Greg Stark. Back-patch to all supported branches. Security: CVE-2016-0773	2016-02-08 10:25:40 -05:00
Robert Haas	d89f06f048	Fix parallel-safety markings for pg_upgrade functions. These establish backend-local state which will not be copied to parallel workers, so they must be marked parallel-restricted, not parallel-safe.	2016-02-07 11:45:21 -05:00
Robert Haas	7c944bd903	Introduce a new GUC force_parallel_mode for testing purposes. When force_parallel_mode = true, we enable the parallel mode restrictions for all queries for which this is believed to be safe. For the subset of those queries believed to be safe to run entirely within a worker, we spin up a worker and run the query there instead of running it in the original process. When force_parallel_mode = regress, make additional changes to allow the regression tests to run cleanly even though parallel workers have been injected under the hood. Taken together, this facilitates both better user testing and better regression testing of the parallelism code. Robert Haas, with help from Amit Kapila and Rushabh Lathia.	2016-02-07 11:41:33 -05:00
Robert Haas	a1c1af2a1f	Introduce group locking to prevent parallel processes from deadlocking. For locking purposes, we now regard heavyweight locks as mutually non-conflicting between cooperating parallel processes. There are some possible pitfalls to this approach that are not to be taken lightly, but it works OK for now and can be changed later if we find a better approach. Without this, it's very easy for parallel queries to silently self-deadlock if the user backend holds strong relation locks. Robert Haas, with help from Amit Kapila. Thanks to Noah Misch and Andres Freund for extensive discussion of possible issues with this approach.	2016-02-07 10:16:13 -05:00
Tom Lane	aa2387e2fd	Improve speed of timestamp/time/date output functions. It seems that sprintf(), at least in glibc's version, is unreasonably slow compared to hand-rolled code for printing integers. Replacing most uses of sprintf() in the datetime.c output functions with special-purpose code turns out to give more than a 2X speedup in COPY of a table with a single timestamp column; which is pretty impressive considering all the other logic in that code path. David Rowley and Andres Freund, reviewed by Peter Geoghegan and myself	2016-02-06 23:11:28 -05:00
Robert Haas	78bea62ab0	Fix typo. Amit Kapila	2016-02-05 07:56:59 -05:00
Tom Lane	6819514fca	Add num_nulls() and num_nonnulls() to count NULL arguments. An example use-case is "CHECK(num_nonnulls(a,b,c) = 1)" to assert that exactly one of a,b,c isn't NULL. The functions are variadic, so they can also be pressed into service to count the number of null or nonnull elements in an array. Marko Tiikkaja, reviewed by Pavel Stehule	2016-02-04 23:03:37 -05:00
Robert Haas	a104a017fc	Add some additional core functions to support join pushdown for FDWs. GetExistingLocalJoinPath() is useful for handling EvalPlanQual rechecks properly, and GetUserMappingById() is needed to make sure you're using the right credentials. Shigeru Hanada, Etsuro Fujita, Ashutosh Bapat, Robert Haas	2016-02-04 17:05:09 -05:00
Robert Haas	c1772ad922	Change the way that LWLocks for extensions are allocated. The previous RequestAddinLWLocks() method had several disadvantages. First, the locks would be in the main tranche; we've recently decided that it's useful for LWLocks used for separate purposes to have separate tranche IDs. Second, there wasn't any correlation between what code called RequestAddinLWLocks() and what code called LWLockAssign(); when multiple modules are in use, it could become quite difficult to troubleshoot problems where LWLockAssign() ran out of locks. To fix, create a concept of named LWLock tranches which can be used either by extension or by core code. Amit Kapila and Robert Haas	2016-02-04 16:43:04 -05:00
Robert Haas	b47b4dbf68	Extend sortsupport for text to more opclasses. Have varlena.c expose an interface that allows the char(n), bytea, and bpchar types to piggyback on a now-generalized SortSupport for text. This pushes a little more knowledge of the bpchar/char(n) type into varlena.c than might be preferred, but that seems like the approach that creates least friction. Also speed things up for index builds that use text_pattern_ops or varchar_pattern_ops. This patch does quite a bit of renaming, but it seems likely to be worth it, so as to avoid future confusion about the fact that this code is now more generally used than the old names might have suggested. Peter Geoghegan, reviewed by Álvaro Herrera and Andreas Karlsson, with small tweaks by me.	2016-02-03 14:29:53 -05:00
Robert Haas	69d34408e5	Allow parallel custom and foreign scans. This patch doesn't put the new infrastructure to use anywhere, and indeed it's not clear how it could ever be used for something like postgres_fdw which has to send an SQL query and wait for a reply, but there might be FDWs or custom scan providers that are CPU-bound, so let's give them a way to join club parallel. KaiGai Kohei, reviewed by me.	2016-02-03 12:49:46 -05:00
Robert Haas	f2305d40ec	Remove CustomPath's TextOutCustomPath method. You can't really do anything useful with this in the form it currently exists; among other problems, there's no way to reread whatever information might be produced when the path is output. Work is underway to replace this with a more useful and more general system of extensible nodes, but let's start by getting rid of this bit. Extracted from a larger patch by KaiGai Kohei.	2016-02-03 10:38:50 -05:00
Peter Eisentraut	7d17e683fc	Add support for systemd service notifications Insert sd_notify() calls at server start and stop for integration with systemd. This allows the use of systemd service units of type "notify", which greatly simplifies the systemd configuration. Reviewed-by: Pavel Stěhule <pavel.stehule@gmail.com>	2016-02-02 21:04:29 -05:00
Tom Lane	2ad83fff22	Remove unnecessary "implementation of FOO operator" DESCR() entries. Apparently at least one committer hasn't gotten the word that these do not need to be maintained by hand, since initdb will create them automatically. Noted while fixing bug #13905. No catversion bump since the post-initdb state is exactly the same either way. I don't see a need for back-patch, either.	2016-02-02 11:52:27 -05:00
Tom Lane	a4627e8fd4	Fix pg_description entries for jsonb_to_record() and jsonb_to_recordset(). All the other jsonb function descriptions refer to the arguments as being "jsonb", but these two said "json". Make it consistent. Per bug #13905 from Petru Florin Mihancea. No catversion bump --- we can't force one in the back branches, and this isn't very critical anyway.	2016-02-02 11:39:50 -05:00
Robert Haas	7191ce8bea	Make all built-in lwlock tranche IDs fixed. This makes the values more stable, which seems like a good thing for anybody who needs to look at at them. Alexander Korotkov and Amit Kapila	2016-02-02 06:45:55 -05:00
Robert Haas	2251179e6a	Migrate replication slot I/O locks into a separate tranche. This is following in a long train of similar changes and for the same reasons - see `b319356f0e` and `fe702a7b3f` inter alia. Author: Amit Kapila Reviewed-by: Alexander Korotkov, Robert Haas	2016-01-29 09:45:38 -05:00
Robert Haas	b319356f0e	Migrate PGPROC's backendLock into PGPROC itself, using a new tranche. Previously, each PGPROC's backendLock was part of the main tranche, and the PGPROC just contained a pointer. Now, the actual LWLock is part of the PGPROC. As with previous, similar patches, this makes it significantly easier to identify these lwlocks in LWLOCK_STATS or Trace_lwlocks output and improves modularity. Author: Ildus Kurbangaliev Reviewed-by: Amit Kapila, Robert Haas	2016-01-29 08:14:28 -05:00
Robert Haas	fbe5a3fb73	Only try to push down foreign joins if the user mapping OIDs match. Previously, the foreign join pushdown infrastructure left the question of security entirely up to individual FDWs, but it would be easy for a foreign data wrapper to inadvertently open up subtle security holes that way. So, make it the core code's job to determine which user mapping OID is relevant, and don't attempt join pushdown unless it's the same for all relevant relations. Per a suggestion from Tom Lane. Shigeru Hanada and Ashutosh Bapat, reviewed by Etsuro Fujita and KaiGai Kohei, with some further changes by me.	2016-01-28 14:05:36 -05:00
Robert Haas	96198d94cb	Avoid multiple foreign server connections when all use same user mapping. Previously, postgres_fdw's connection cache was keyed by user OID and server OID, but this can lead to multiple connections when it's not really necessary. In particular, if all relevant users are mapped to the public user mapping, then their connection options are certainly the same, so one connection can be used for all of them. While we're cleaning things up here, drop the "server" argument to GetConnection(), which isn't really needed. This saves a few cycles because callers no longer have to look this up; the function itself does, but only when establishing a new connection, not when reusing an existing one. Ashutosh Bapat, with a few small changes by me.	2016-01-28 12:05:19 -05:00
Fujii Masao	7f46eaf035	Add gin_clean_pending_list function to clean up GIN pending list This function cleans up the pending list of the GIN index by moving entries in it to the main GIN data structure in bulk. It returns the number of pages cleaned up from the pending list. This function is useful, for example, when the pending list needs to be cleaned up quickly to improve the performance of the search using GIN index. VACUUM can do the same thing, too, but it may take days to run on a large table. Jeff Janes, reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me. Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com	2016-01-28 12:57:52 +09:00
Fujii Masao	e09507a272	Fix volatility marking of pg_size_pretty function pg_size_pretty function should be marked immutable rather than volatile because it always returns the same result given the same argument. Pavel Stehule	2016-01-27 11:13:31 +09:00
Tom Lane	e1bd684a34	Add trigonometric functions that work in degrees. The implementations go to some lengths to deliver exact results for values where an exact result can be expected, such as sind(30) = 0.5 exactly. Dean Rasheed, reviewed by Michael Paquier	2016-01-22 15:46:22 -05:00
Tom Lane	a396144ac0	Remove new coupling between NAMEDATALEN and MAX_LEVENSHTEIN_STRLEN. Commit `e529cd4ffa` introduced an Assert requiring NAMEDATALEN to be less than MAX_LEVENSHTEIN_STRLEN, which has been 255 for a long time. Since up to that instant we had always allowed NAMEDATALEN to be substantially more than that, this was ill-advised. It's debatable whether we need MAX_LEVENSHTEIN_STRLEN at all (versus putting a CHECK_FOR_INTERRUPTS into the loop), or whether it has to be so tight; but this patch takes the narrower approach of just not applying the MAX_LEVENSHTEIN_STRLEN limit to calls from the parser. Trusting the parser for this seems reasonable, first because the strings are limited to NAMEDATALEN which is unlikely to be hugely more than 256, and second because the maximum distance is tightly constrained by MAX_FUZZY_DISTANCE (though we'd forgotten to make use of that limit in one place). That means the cost is not really O(mn) but more like O(max(m,n)). Relaxing the limit for user-supplied calls is left for future research; given the lack of complaints to date, it doesn't seem very high priority. In passing, fix confusion between lengths-in-bytes and lengths-in-chars in comments and error messages. Per gripe from Kevin Day; solution suggested by Robert Haas. Back-patch to 9.5 where the unwanted restriction was introduced.	2016-01-22 11:53:06 -05:00
Tom Lane	be44ed27b8	Improve index AMs' opclass validation procedures. The amvalidate functions added in commit `65c5fcd353` were on the crude side. Improve them in a few ways: * Perform signature checking for operators and support functions. * Apply more thorough checks for missing operators and functions, where possible. * Instead of reporting problems as ERRORs, report most problems as INFO messages and make the amvalidate function return FALSE. This allows more than one problem to be discovered per run. * Report object names rather than OIDs, and work a bit harder on making the messages understandable. Also, remove a few more opr_sanity regression test queries that are now superseded by the amvalidate checks.	2016-01-21 19:47:15 -05:00
Tom Lane	b99551832e	Add defenses against putting expanded objects into Const nodes. Putting a reference to an expanded-format value into a Const node would be a bad idea for a couple of reasons. It'd be possible for the supposedly immutable Const to change value, if something modified the referenced variable ... in fact, if the Const's reference were R/W, any function that has the Const as argument might itself change it at runtime. Also, because datumIsEqual() is pretty simplistic, the Const might fail to compare equal to other Consts that it should compare equal to, notably including copies of itself. This could lead to unexpected planner behavior, such as "could not find pathkey item to sort" errors or inferior plans. I have not been able to find any way to get an expanded value into a Const within the existing core code; but Paul Ramsey was able to trigger the problem by writing a datatype input function that returns an expanded value. The best fix seems to be to establish a rule that varlena values being placed into Const nodes should be passed through pg_detoast_datum(). That will do nothing (and cost little) in normal cases, but it will flatten expanded values and thereby avoid the above problems. Also, it will convert short-header or compressed values into canonical format, which will avoid possible unexpected lack-of-equality issues for those cases too. And it provides a last-ditch defense against putting a toasted value into a Const, which we already knew was dangerous, cf commit `2b0c86b665`. (In the light of this discussion, I'm no longer sure that that commit provided 100% protection against such cases, but this fix should do it.) The test added in commit `65c3d05e18` to catch datatype input functions with unstable results would fail for functions that returned expanded values; but it seems a bit uncharitable to deem a result unstable just because it's expressed in expanded form, so revise the coding so that we check for bitwise equality only after applying pg_detoast_datum(). That's a sufficient condition anyway given the new rule about detoasting when forming a Const. Back-patch to 9.5 where the expanded-object facility was added. It's possible that this should go back further; but in the absence of clear evidence that there's any live bug in older branches, I'll refrain for now.	2016-01-21 12:56:08 -05:00
Fujii Masao	38710a374e	Remove unused argument from ginInsertCleanup() It's an oversight in commit `dc943ad`.	2016-01-22 01:22:56 +09:00
Simon Riggs	c80b31d557	Refactor headers to split out standby defs Jeff Janes	2016-01-20 18:51:34 -08:00
Simon Riggs	978b2f65aa	Speedup 2PC by skipping two phase state files in normal path 2PC state info is written only to WAL at PREPARE, then read back from WAL at COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr checkpoint cycle will be written to disk in the same form as previously. Crash recovery path is not altered. Measured performance gains of 50-100% for short 2PC transactions by completely avoiding writing files and fsyncing. Other optimizations still available, further patches in related areas expected. Stas Kelvich and heavily edited by Simon Riggs Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas, a concrete example of how Postgres-XC has fed back ideas into PostgreSQL. Reviewed by Michael Paquier, Jeff Janes and Andres Freund Performance testing by Jesper Pedersen	2016-01-20 18:40:44 -08:00
Simon Riggs	422a55a687	Refactor to create generic WAL page read callback Previously we didn’t have a generic WAL page read callback function, surprisingly. Logical decoding has logical_read_local_xlog_page(), which was actually generic, so move that to xlogfunc.c and rename to read_local_xlog_page(). Maintain logical_read_local_xlog_page() so existing callers still work. As requested by Michael Paquier, Alvaro Herrera and Andres Freund	2016-01-20 17:18:58 -08:00
Robert Haas	45be99f8cd	Support parallel joins, and make related improvements. The core innovation of this patch is the introduction of the concept of a partial path; that is, a path which if executed in parallel will generate a subset of the output rows in each process. Gathering a partial path produces an ordinary (complete) path. This allows us to generate paths for parallel joins by joining a partial path for one side (which at the baserel level is currently always a Partial Seq Scan) to an ordinary path on the other side. This is subject to various restrictions at present, especially that this strategy seems unlikely to be sensible for merge joins, so only nested loops and hash joins paths are generated. This also allows an Append node to be pushed below a Gather node in the case of a partitioned table. Testing revealed that early versions of this patch made poor decisions in some cases, which turned out to be caused by the fact that the original cost model for Parallel Seq Scan wasn't very good. So this patch tries to make some modest improvements in that area. There is much more to be done in the area of generating good parallel plans in all cases, but this seems like a useful step forward. Patch by me, reviewed by Dilip Kumar and Amit Kapila.	2016-01-20 14:40:26 -05:00
Robert Haas	a7de3dc5c3	Support multi-stage aggregation. Aggregate nodes now have two new modes: a "partial" mode where they output the unfinalized transition state, and a "finalize" mode where they accept unfinalized transition states rather than individual values as input. These new modes are not used anywhere yet, but they will be necessary for parallel aggregation. The infrastructure also figures to be useful for cases where we want to aggregate local data and remote data via the FDW interface, and want to bring back partial aggregates from the remote side that can then be combined with locally generated partial aggregates to produce the final value. It may also be useful even when neither FDWs nor parallelism are in play, as explained in the comments in nodeAgg.c. David Rowley and Simon Riggs, reviewed by KaiGai Kohei, Heikki Linnakangas, Haribabu Kommi, and me.	2016-01-20 13:46:50 -05:00
Tom Lane	dbe2328959	Fix assorted inconsistencies in GIN opclass support function declarations. GIN had some minor issues too, mostly using "internal" where something else would be more appropriate. I went with the same approach as in `9ff60273e3`, namely preferring the opclass' indexed datatype for arguments that receive an operator RHS value, even if that's not necessarily what they really are. Again, this is with an eye to having a uniform rule for ginvalidate() to check support function signatures.	2016-01-19 22:32:22 -05:00
Alvaro Herrera	948c97958b	Add two HyperLogLog functions New functions initHyperLogLogError() and freeHyperLogLog() simplify using this module from elsewhere. Author: Tomáš Vondra Review: Peter Geoghegan	2016-01-19 17:40:15 -03:00
Tom Lane	9ff60273e3	Fix assorted inconsistencies in GiST opclass support function declarations. The conventions specified by the GiST SGML documentation were widely ignored. For example, the strategy-number argument for "consistent" and "distance" functions is specified to be a smallint, but most of the built-in support functions declared it as an integer, and for that matter the core code passed it using Int32GetDatum not Int16GetDatum. None of that makes any real difference at runtime, but it's quite confusing for newcomers to the code, and it makes it very hard to write an amvalidate() function that checks support function signatures. So let's try to instill some consistency here. Another similar issue is that the "query" argument is not of a single well-defined type, but could have different types depending on the strategy (corresponding to search operators with different righthand-side argument types). Some of the functions threw up their hands and declared the query argument as being of "internal" type, which surely isn't right ("any" would have been more appropriate); but the majority position seemed to be to declare it as being of the indexed data type, corresponding to a search operator with both input types the same. So I've specified a convention that that's what to do always. Also, the result of the "union" support function actually must be of the index's storage type, but the documentation suggested declaring it to return "internal", and some of the functions followed that. Standardize on telling the truth, instead. Similarly, standardize on declaring the "same" function's inputs as being of the storage type, not "internal". Also, somebody had forgotten to add the "recheck" argument to both the documentation of the "distance" support function and all of their SQL declarations, even though the C code was happily using that argument. Clean that up too. Fix up some other omissions in the docs too, such as documenting that union's second input argument is vestigial. So far as the errors in core function declarations go, we can just fix pg_proc.h and bump catversion. Adjusting the erroneous declarations in contrib modules is more debatable: in principle any change in those scripts should involve an extension version bump, which is a pain. However, since these changes are purely cosmetic and make no functional difference, I think we can get away without doing that.	2016-01-19 12:04:36 -05:00
Tom Lane	65c5fcd353	Restructure index access method API to hide most of it at the C level. This patch reduces pg_am to just two columns, a name and a handler function. All the data formerly obtained from pg_am is now provided in a C struct returned by the handler function. This is similar to the designs we've adopted for FDWs and tablesample methods. There are multiple advantages. For one, the index AM's support functions are now simple C functions, making them faster to call and much less error-prone, since the C compiler can now check function signatures. For another, this will make it far more practical to define index access methods in installable extensions. A disadvantage is that SQL-level code can no longer see attributes of index AMs; in particular, some of the crosschecks in the opr_sanity regression test are no longer possible from SQL. We've addressed that by adding a facility for the index AM to perform such checks instead. (Much more could be done in that line, but for now we're content if the amvalidate functions more or less replace what opr_sanity used to do.) We might also want to expose some sort of reporting functionality, but this patch doesn't do that. Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily editorialized on by me.	2016-01-17 19:36:59 -05:00
Tom Lane	8d290c8ec6	Re-pgindent a few files. In preparation for landing index AM interface changes.	2016-01-17 19:13:18 -05:00
Magnus Hagander	cf7dfbf2d6	Fix minor typo in comment Tatsuro Yamada	2016-01-15 10:24:37 +01:00
Simon Riggs	e63bb4549a	Add new user fn pg_current_xlog_flush_location() Tomas Vondra, reviewed by Michael Paquier and Amit Kapila Minor edits by me	2016-01-12 07:54:52 +00:00
Tom Lane	26d538dc93	Clean up some lack-of-STRICT issues in the core code, too. A scan for missed proisstrict markings in the core code turned up these functions: brin_summarize_new_values pg_stat_reset_single_table_counters pg_stat_reset_single_function_counters pg_create_logical_replication_slot pg_create_physical_replication_slot pg_drop_replication_slot The first three of these take OID, so a null argument will normally look like a zero to them, resulting in "ERROR: could not open relation with OID 0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX functions. The other three will dump core on a null argument, though this is mitigated by the fact that they won't do so until after checking that the caller is superuser or has rolreplication privilege. In addition, the pg_logical_slot_get/peek[_binary]_changes family was intentionally marked nonstrict, but failed to make nullness checks on all the arguments; so again a null-pointer-dereference crash is possible but only for superusers and rolreplication users. Add the missing ARGISNULL checks to the latter functions, and mark the former functions as strict in pg_proc. Make that change in the back branches too, even though we can't force initdb there, just so that installations initdb'd in future won't have the issue. Since none of these bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX functions hardly misbehave at all), it seems sufficient to do this. In addition, fix some order-of-operations oddities in the slot_get_changes family, mostly cosmetic, but not the part that moves the function's last few operations into the PG_TRY block. As it stood, there was significant risk for an error to exit without clearing historical information from the system caches. The slot_get_changes bugs go back to 9.4 where that code was introduced. Back-patch appropriate subsets of the pg_proc changes into all active branches, as well.	2016-01-09 16:58:32 -05:00
Simon Riggs	687f2cd7a0	Avoid pin scan for replay of XLOG_BTREE_VACUUM Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require complex interlocking that matched the requirements on the master. This required an O(N) operation that became a significant problem with large indexes, causing replication delays of seconds or in some cases minutes while the XLOG_BTREE_VACUUM was replayed. This commit skips the “pin scan” that was previously required, by observing in detail when and how it is safe to do so, with full documentation. The pin scan is skipped only in replay; the VACUUM code path on master is not touched here. The current commit still performs the pin scan for toast indexes, though this can also be avoided if we recheck scans on toast indexes. Later patch will address this. No tests included. Manual tests using an additional patch to view WAL records and their timing have shown the change in WAL records and their handling has successfully reduced replication delay.	2016-01-09 10:10:08 +00:00
Magnus Hagander	2650486ebc	Fix typo in comment Tatsuro Yamada	2016-01-08 08:54:40 +01:00
Alvaro Herrera	b1a9bad9e7	pgstat: add WAL receiver status view & SRF This new view provides insight into the state of a running WAL receiver in a HOT standby node. The information returned includes the PID of the WAL receiver process, its status (stopped, starting, streaming, etc), start LSN and TLI, last received LSN and TLI, timestamp of last message send and receipt, latest end-of-WAL LSN and time, and the name of the slot (if any). Access to the detailed data is only granted to superusers; others only get the PID. Author: Michael Paquier Reviewer: Haribabu Kommi	2016-01-07 16:21:19 -03:00
Alvaro Herrera	a967613911	Windows: Make pg_ctl reliably detect service status pg_ctl is using isatty() to verify whether the process is running in a terminal, and if not it sends its output to Windows' Event Log ... which does the wrong thing when the output has been redirected to a pipe, as reported in bug #13592. To fix, make pg_ctl use the code we already have to detect service-ness: in the master branch, move src/backend/port/win32/security.c to src/port (with suitable tweaks so that it runs properly in backend and frontend environments); pg_ctl already has access to pgport so it Just Works. In older branches, that's likely to cause trouble, so instead duplicate the required code in pg_ctl.c. Author: Michael Paquier Bug report and diagnosis: Egon Kocjan Backpatch: all supported branches	2016-01-07 11:59:08 -03:00
Alvaro Herrera	abb1733922	Add scale(numeric) Author: Marko Tiikkaja	2016-01-05 19:02:13 -03:00
Tom Lane	ea0d494dae	Make the to_reg*() functions accept text not cstring. Using cstring as the input type was a poor decision, because that's not really a full-fledged type. In particular, it lacks implicit coercions from text or varchar, meaning that usages like to_regproc('foo'\|\|'bar') wouldn't work; basically the only case that did work without explicit casting was a simple literal constant argument. The lack of field complaints about this suggests that hardly anyone is using these functions, so hopefully fixing it won't cause much of a compatibility problem. They've only been there since 9.4, anyway. Petr Korobeinikov	2016-01-05 13:02:43 -05:00
Alvaro Herrera	efa318bcfa	Make pg_shseclabel available in early backend startup While the in-core authentication mechanism doesn't need to access pg_shseclabel at all, it's reasonable to think that an authentication hook will want to look at the label for the role logging in, or for rows in other catalogs used during the authentication phase of startup. Catalog version bumped, because this changes the "is nailed" status for pg_shseclabel. Author: Adam Brightwell	2016-01-05 14:50:53 -03:00
Bruce Momjian	ee94300446	Update copyright for 2016 Backpatch certain files through 9.1	2016-01-02 13:33:40 -05:00
Tom Lane	0dab5ef39b	Fix ALTER OPERATOR to update dependencies properly. Fix an oversight in commit `321eed5f0f`: replacing an operator's selectivity functions needs to result in a corresponding update in pg_depend. We have a function that can handle that, but it was not called by AlterOperator(). To fix this without enlarging pg_operator.h's #include list beyond what clients can safely include, split off the function definitions into a new file pg_operator_fn.h, similarly to what we've done for some other catalog header files. It's not entirely clear whether any client-side code needs to include pg_operator.h, but it seems prudent to assume that there is some such code somewhere.	2015-12-31 17:37:31 -05:00
Joe Conway	241448b23a	Rename (new\|old)estCommitTs to (new\|old)estCommitTsXid The variables newestCommitTs and oldestCommitTs sound as if they are timestamps, but in fact they are the transaction Ids that correspond to the newest and oldest timestamps rather than the actual timestamps. Rename these variables to reflect that they are actually xids: to wit newestCommitTsXid and oldestCommitTsXid respectively. Also modify related code in a similar fashion, particularly the user facing output emitted by pg_controldata and pg_resetxlog. Complaint and patch by me, review by Tom Lane and Alvaro Herrera. Backpatch to 9.5 where these variables were first introduced.	2015-12-28 12:34:11 -08:00
Tom Lane	6efbded6e4	Allow omitting one or both boundaries in an array slice specifier. Omitted boundaries represent the upper or lower limit of the corresponding array subscript. This allows simpler specification of many common use-cases. (Revised version of commit `9246af6799`) YUriy Zhuravlev	2015-12-22 21:05:29 -05:00
Robert Haas	0ba3f3bc65	Comment improvements for abbreviated keys. Peter Geoghegan and Robert Haas	2015-12-22 13:57:18 -05:00
Robert Haas	ccd8f97922	postgres_fdw: Consider requesting sorted data so we can do a merge join. When use_remote_estimate is enabled, consider adding ORDER BY to the query we sending to the remote server so that we can use that ordered data for a merge join. Commit `f18c944b61` arranges to push down the query pathkeys, which seems like the case mostly likely to be a win, but testing shows this can sometimes win, too. For a regular table, we know which indexes are present and therefore test whether the ordering provided by each such index is useful. Here, we take the opposite approach: guess what orderings would be useful if they could be generated cheaply, and then ask the remote side what those will cost. Ashutosh Bapat, with very substantial cosmetic revisions by me. Also reviewed by Rushabh Lathia.	2015-12-22 13:46:40 -05:00
Teodor Sigaev	bbbd807097	Revert `9246af6799` because I miss too much. Patch is returned to commitfest process.	2015-12-18 21:35:22 +03:00
Teodor Sigaev	9246af6799	Allow to omit boundaries in array subscript Allow to omiy lower or upper or both boundaries in array subscript for selecting slice of array. Author: YUriy Zhuravlev	2015-12-18 15:18:58 +03:00
Tom Lane	66d947b9d3	Adjust behavior of single-user -j mode for better initdb error reporting. Previously, -j caused the entire input file to be read in and executed as a single command string. That's undesirable, not least because any error causes the entire file to be regurgitated as the "failing query". Some experimentation suggests a better rule: end the command string when we see a semicolon immediately followed by two newlines, ie, an empty line after a query. This serves nicely to break up the existing examples such as information_schema.sql and system_views.sql. A limitation is that it's no longer possible to write such a sequence within a string literal or multiline comment in a file meant to be read with -j; but there are no instances of such a problem within the data currently used by initdb. (If someone does make such a mistake in future, it'll be obvious because they'll get an unterminated-literal or unterminated-comment syntax error.) Other than that, there shouldn't be any negative consequences; you're not forced to end statements that way, it's just a better idea in most cases. In passing, remove src/include/tcop/tcopdebug.h, which is dead code because it's not included anywhere, and hasn't been for more than ten years. One of the debug-support symbols it purported to describe has been unreferenced for at least the same amount of time, and the other is removed by this commit on the grounds that it was useless: forcing -j mode all the time would have broken initdb. The lack of complaints about that, or about the missing inclusion, shows that no one has tried to use TCOP_DONTUSENEWLINE in many years.	2015-12-17 19:34:15 -05:00
Alvaro Herrera	756e7b4c9d	Rework internals of changing a type's ownership This is necessary so that REASSIGN OWNED does the right thing with composite types, to wit, that it also alters ownership of the type's pg_class entry -- previously, the pg_class entry remained owned by the original user, which caused later other failures such as the new owner's inability to use ALTER TYPE to rename an attribute of the affected composite. Also, if the original owner is later dropped, the pg_class entry becomes owned by a non-existant user which is bogus. To fix, create a new routine AlterTypeOwner_oid which knows whether to pass the request to ATExecChangeOwner or deal with it directly, and use that in shdepReassignOwner rather than calling AlterTypeOwnerInternal directly. AlterTypeOwnerInternal is now simpler in that it only modifies the pg_type entry and recurses to handle a possible array type; higher-level tasks are handled by either AlterTypeOwner directly or AlterTypeOwner_oid. I took the opportunity to add a few more objects to the test rig for REASSIGN OWNED, so that more cases are exercised. Additional ones could be added for superuser-only-ownable objects (such as FDWs and event triggers) but I didn't want to push my luck by adding a new superuser to the tests on a backpatchable bug fix. Per bug #13666 reported by Chris Pacejo. Backpatch to 9.5. (I would back-patch this all the way back, except that it doesn't apply cleanly in 9.4 and earlier because `59367fdf9` wasn't backpatched. If we decide that we need this in earlier branches too, we should backpatch both.)	2015-12-17 14:25:41 -03:00
Tom Lane	2ec477dc81	Cope with Readline's failure to track SIGWINCH events outside of input. It emerges that libreadline doesn't notice terminal window size change events unless they occur while collecting input. This is easy to stumble over if you resize the window while using a pager to look at query output, but it can be demonstrated without any pager involvement. The symptom is that queries exceeding one line are misdisplayed during subsequent input cycles, because libreadline has the wrong idea of the screen dimensions. The safest, simplest way to fix this is to call rl_reset_screen_size() just before calling readline(). That causes an extra ioctl(TIOCGWINSZ) for every command; but since it only happens when reading from a tty, the performance impact should be negligible. A more valid objection is that this still leaves a tiny window during entry to readline() wherein delivery of SIGWINCH will be missed; but the practical consequences of that are probably negligible. In any case, there doesn't seem to be any good way to avoid the race, since readline exposes no functions that seem safe to call from a generic signal handler --- rl_reset_screen_size() certainly isn't. It turns out that we also need an explicit rl_initialize() call, else rl_reset_screen_size() dumps core when called before the first readline() call. rl_reset_screen_size() is not present in old versions of libreadline, so we need a configure test for that. (rl_initialize() is present at least back to readline 4.0, so we won't bother with a test for it.) We would need a configure test anyway since libedit's emulation of libreadline doesn't currently include such a function. Fortunately, libedit seems not to have any corresponding bug. Merlin Moncure, adjusted a bit by me	2015-12-16 16:59:35 -05:00
Robert Haas	6150a1b08a	Move buffer I/O and content LWLocks out of the main tranche. Move the content lock directly into the BufferDesc, so that locking and pinning a buffer touches only one cache line rather than two. Adjust the definition of BufferDesc slightly so that this doesn't make the BufferDesc any larger than one cache line (at least on platforms where a spinlock is only 1 or 2 bytes). We can't fit the I/O locks into the BufferDesc and stay within one cache line, so move those to a completely separate tranche. This leaves a relatively limited number of LWLocks in the main tranche, so increase the padding of those remaining locks to a full cache line, rather than allowing adjacent locks to share a cache line, hopefully reducing false sharing. Performance testing shows that these changes make little difference on laptop-class machines, but help significantly on larger servers, especially those with more than 2 sockets. Andres Freund, originally based on an earlier patch by Simon Riggs. Review and cosmetic adjustments (including heavy rewriting of the comments) by me.	2015-12-15 13:32:54 -05:00
Robert Haas	3fed417452	Provide a way to predefine LWLock tranche IDs. It's a bit cumbersome to use LWLockNewTrancheId(), because the returned value needs to be shared between backends so that each backend can call LWLockRegisterTranche() with the correct ID. So, for built-in tranches, use a hard-coded value instead. This is motivated by an upcoming patch adding further built-in tranches. Andres Freund and Robert Haas	2015-12-15 11:48:19 -05:00

... 3 4 5 6 7 ...

7436 commits