postgresql/src/common
John Naylor 911588a3f8 Add fast path for validating UTF-8 text
Our previous validator used a traditional algorithm that performed
comparison and branching one byte at a time. It's useful in that
we always know exactly how many bytes we have validated, but that
precision comes at a cost. Input validation can show up prominently
in profiles of COPY FROM, and future improvements to COPY FROM such
as parallelism or faster line parsing will put more pressure on input
validation. Hence, add fast paths for both ASCII and multibyte UTF-8:

Use bitwise operations to check 16 bytes at a time for ASCII. If
that fails, use a "shift-based" DFA on those bytes to handle the
general case, including multibyte. These paths are relatively free
of branches and thus robust against all kinds of byte patterns. With
these algorithms, UTF-8 validation is several times faster, depending
on platform and the input byte distribution.

The previous coding in pg_utf8_verifystr() is retained for short
strings and for when the fast path returns an error.

Review, performance testing, and additional hacking by: Heikki
Linakangas, Vladimir Sitnikov, Amit Khandekar, Thomas Munro, and
Greg Stark

Discussion:
https://www.postgresql.org/message-id/CAFBsxsEV_SzH%2BOLyCiyon%3DiwggSyMh_eF6A3LU2tiWf3Cy2ZQg%40mail.gmail.com
2021-12-20 10:07:29 -04:00
..
unicode Extend collection of Unicode combining characters to beyond the BMP 2021-08-26 13:07:34 -04:00
.gitignore Replace the data structure used for keyword lookup. 2019-01-06 17:02:57 -05:00
archive.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
base64.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
checksum_helper.c Add result size as argument of pg_cryptohash_final() for overflow checks 2021-02-15 10:18:34 +09:00
config_info.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
controldata_utils.c Centralize timestamp computation of control file on updates 2021-11-29 13:36:13 +09:00
cryptohash.c Add result size as argument of pg_cryptohash_final() for overflow checks 2021-02-15 10:18:34 +09:00
cryptohash_openssl.c Add result size as argument of pg_cryptohash_final() for overflow checks 2021-02-15 10:18:34 +09:00
d2s.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
d2s_full_table.h Update copyright for 2021 2021-01-02 13:06:25 -05:00
d2s_intrinsics.h Update copyright for 2021 2021-01-02 13:06:25 -05:00
digit_table.h Change floating-point output format for improved performance. 2019-02-13 15:20:33 +00:00
encnames.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
exec.c Factor out system call names from error messages 2021-04-23 14:21:37 +02:00
f2s.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
fe_memutils.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
file_perm.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
file_utils.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
hashfn.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
hmac.c Fix memory leak in pg_hmac 2021-10-01 22:47:05 +02:00
hmac_openssl.c Adjust locations which have an incorrect copyright year 2021-06-04 12:19:50 +12:00
ip.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
jsonapi.c Simplify error handing of jsonapi.c for the frontend 2021-07-02 09:35:12 +09:00
keywords.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
kwlookup.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
link-canary.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
logging.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
Makefile Replace random(), pg_erand48(), etc with a better PRNG API and algorithm. 2021-11-28 21:33:07 -05:00
md5.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
md5_common.c Add result size as argument of pg_cryptohash_final() for overflow checks 2021-02-15 10:18:34 +09:00
md5_int.h Update copyright for 2021 2021-01-02 13:06:25 -05:00
pg_get_line.c Provide a variant of simple_prompt() that can be interrupted by ^C. 2021-11-17 19:09:54 -05:00
pg_lzcompress.c Fix typos and grammar in code comments 2021-09-27 14:21:28 +09:00
pg_prng.c Replace random(), pg_erand48(), etc with a better PRNG API and algorithm. 2021-11-28 21:33:07 -05:00
pgfnames.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
protocol_openssl.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
psprintf.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
relpath.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
restricted_token.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
rmtree.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
ryu_common.h Update copyright for 2021 2021-01-02 13:06:25 -05:00
saslprep.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
scram-common.c Refactor HMAC implementations 2021-04-03 17:30:49 +09:00
sha1.c Adjust locations which have an incorrect copyright year 2021-06-04 12:19:50 +12:00
sha1_int.h Adjust locations which have an incorrect copyright year 2021-06-04 12:19:50 +12:00
sha2.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
sha2_int.h Update copyright for 2021 2021-01-02 13:06:25 -05:00
sprompt.c Allow psql's other uses of simple_prompt() to be interrupted by ^C. 2021-11-19 12:11:46 -05:00
string.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
stringinfo.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
unicode_norm.c Fix buffer overrun in unicode string normalization with empty input 2021-11-11 15:00:59 +09:00
username.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
wait_error.c Update copyright for 2021 2021-01-02 13:06:25 -05:00
wchar.c Add fast path for validating UTF-8 text 2021-12-20 10:07:29 -04:00