Commit graph

301 commits

Author SHA1 Message Date
Gleb Smirnoff
c4804b6b0b Unbreak TFO, that was broken with 8d5719aa74. These two assignments
are unneccessary and used to be there before TFO as an invariant.  With
TFO and after 8d5719aa74 the "so" value is still needed.

Reported & tested by:	tuexen
Fixes:	8d5719aa74
2021-06-22 16:03:44 -07:00
Michael Tuexen
9e644c2300 tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D29469
2021-04-18 16:16:42 +02:00
Gleb Smirnoff
cb8d7c44d6 tcp_syncache: add net.inet.tcp.syncache.see_other sysctl
A security feature from c06f087ccb appeared to be a huge bottleneck
under SYN flood. To mitigate that add a sysctl that would make
syncache(4) globally visible, ignoring UID/GID, jail(2) and mac(4)
checks. When turned on, we won't need to call crhold() on the listening
socket credential for every incoming SYN packet.

Reviewed by:	bz
2021-04-15 15:26:48 -07:00
Gleb Smirnoff
8d5719aa74 syncache: simplify syncache_add() KPI to return struct socket pointer
directly, not overwriting the listen socket pointer argument.
Not a functional change.
2021-04-12 08:27:40 -07:00
Gleb Smirnoff
08d9c92027 tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by:	rrs
Differential revision:	https://reviews.freebsd.org/D29576
2021-04-12 08:25:31 -07:00
Richard Scheffenegger
2593f858d7 A TCP server has to take into consideration, if TCP_NOOPT is preventing
the negotiation of TCP features. This affects most TCP options but
adherance to RFC7323 with the timestamp option will prevent a session
from getting established.

PR:	253576
Reviewed By:	tuexen, #transport
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28652
2021-02-25 19:12:20 +01:00
Michael Tuexen
d2b3ceddcc tcp: add sysctl to tolerate TCP segments missing timestamps
When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.

Reviewed by:		jtl@, rscheff@
Differential Revision:	https://reviews.freebsd.org/D28142
Sponsored by:		Netflix, Inc.
PR:			252449
MFC after:		1 week
2021-01-14 19:28:25 +01:00
Michael Tuexen
75fcd27ac2 Fix two occurences of a typo in a comment introduced in r367530.
Reported by:		lstewart@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D27148
2020-11-23 10:13:56 +00:00
Michael Tuexen
283c76c7c3 RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
  the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
  for the timestamp option has not been negotiated.
This patch enforces the above.

PR:			250499
Reviewed by:		gnn, rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc
Differential Revision:	https://reviews.freebsd.org/D27148
2020-11-09 21:49:40 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Michael Tuexen
cf8a49ab6e Fix the following issues related to the TCP SYN-cache:
* Let the accepted TCP/IPv4 socket inherit the configured TTL and
  TOS value.
* Let the accepted TCP/IPv6 socket inherit the configured Hop Limit.
* Use the configured Hop Limit and Traffic Class when sending
  IPv6 packets.

Reviewed by:		rrs, lutz_donnerhacke.de
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D25909
2020-08-10 20:24:48 +00:00
Michael Tuexen
1bea15e601 Improve the ECN negotiation when the TCP SYN-cache is used by making
sure that
* ECN is disabled if the client sends an non-ECN-setup SYN segment.
* ECN is disabled is the ECN-setup SYN-ACK segment is retransmitted more
  than net.inet.tcp.ecn.maxretries times.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D26008
2020-08-08 19:39:38 +00:00
Michael Tuexen
9c04fdfd34 When using automatically generated flow labels and using TCP SYN
cookies, use the same flow label for the segments sent during the
handshake and after the handshake.
This fixes a bug by making sure that sc_flowlabel is always stored in
network byte order.

Reviewed by:		bz@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23957
2020-03-04 16:41:25 +00:00
Michael Tuexen
6605e5791f Don't send an uninitilised traffic class in the IPv6 header, when
sending a TCP segment from the TCP SYN cache (like a SYN-ACK).
This fix initialises it to zero. This is correct for the ECN bits,
but is does not honor the DSCP what an application might have set via
the IPPROTO_IPV6 level socket options IPV6_TCLASS. That will be
fixed separately.

Reviewed by:		Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23900
2020-03-04 12:22:53 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Randall Stewart
481be5de9d White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by:	Netflix Inc.
2020-02-12 13:31:36 +00:00
Randall Stewart
596ae436ef This small fix makes it so we properly follow
the RFC and only enable ECN when both the
CWR and ECT bits our set within the SYN packet.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D23645
2020-02-12 13:04:19 +00:00
Gleb Smirnoff
b955545386 Make ip6_output() and ip_output() require network epoch.
All callers that before may called into these functions
without network epoch now must enter it.
2020-01-22 05:51:22 +00:00
Gleb Smirnoff
bab98355f9 Add some documenting NET_EPOCH_ASSERTs. 2020-01-22 02:37:47 +00:00
Michael Tuexen
fe1274ee39 Fix race when accepting TCP connections.
When expanding a SYN-cache entry to a socket/inp a two step approach was
taken:
1) The local address was filled in, then the inp was added to the hash
   table.
2) The remote address was filled in and the inp was relocated in the
   hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.

Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D22971
2020-01-12 17:52:32 +00:00
Michael Tuexen
3cf38784e2 Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by:		Richard Scheffenegger
Reviewed by:		rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22497
2019-12-01 21:01:33 +00:00
Michael Tuexen
fa49a96419 In order for the TCP Handshake to support ECN++, and further ECN-related
improvements, the ECN bits need to be exposed to the TCP SYNcache.
This change is a minimal modification to the function headers, without any
functional change intended.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22436
2019-12-01 18:05:02 +00:00
Gleb Smirnoff
032677ceb5 Now that there is no R/W lock on PCB list the pcblist sysctls
handlers can be greatly simplified.  All the previous double
cycling and complex locking was added to avoid these functions
holding global PCB locks for extended period of time, preventing
addition of new entries.
2019-11-07 21:27:32 +00:00
Gleb Smirnoff
1a49612526 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
Jonathan T. Looney
0b18fb0798 Add new functionality to switch to using cookies exclusively when we the
syn cache overflows. Whether this is due to an attack or due to the system
having more legitimate connections than the syn cache can hold, this
situation can quickly impact performance.

To make the system perform better during these periods, the code will now
switch to exclusively using cookies until the syn cache stops overflowing.
In order for this to occur, the system must be configured to use the syn
cache with syn cookie fallback. If syn cookies are completely disabled,
this change should have no functional impact.

When the system is exclusively using syn cookies (either due to
configuration or the overflow detection enabled by this change), the
code will now skip acquiring a lock on the syn cache bucket. Additionally,
the code will now skip lookups in several places (such as when the system
receives a RST in response to a SYN|ACK frame).

Reviewed by:	rrs, gallatin (previous version)
Discussed with:	tuexen
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:18:57 +00:00
Jonathan T. Looney
0bee4d631a Access the syncache secret directly from the V_tcp_syncache variable,
rather than indirectly through the backpointer to the tcp_syncache
structure stored in the hashtable bucket.

This also allows us to remove the requirement in syncookie_generate()
and syncookie_lookup() that the syncache hashtable bucket must be
locked.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:06:46 +00:00
Jonathan T. Looney
867e98f8ee Remove the unused sch parameter to the syncache_respond() function. The
use of this parameter was removed in r313330. This commit now removes
passing this now-unused parameter.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:02:34 +00:00
Andrew Gallatin
d2e6258258 Avoid unneeded call to arc4random() in syncache_add()
Don't call arc4random() unconditionally to initialize sc_iss, and
then when syncookies are enabled, just overwrite it with the
return value from from syncookie_generate(). Instead, only call
arc4random() to initialize sc_iss when syncookies are not
enabled.

Note that on a system under a syn flood attack, arc4random()
becomes quite expensive, and the chacha_poly crypto that it calls
is one of the more expensive things happening on the
system. Removing this unneeded arc4random() call reduces CPU from
about 40% to about 35% in my test scenario (Broadwell Xeon, 6Mpps
syn flood attack).

Reviewed by:	rrs, tuxen, bz
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21591
2019-09-11 18:48:26 +00:00
Michael Tuexen
bc35229fad When an ACK segment as the third message of the three way handshake is
received and support for time stamps was negotiated in the SYN/SYNACK
exchange, perform the PAWS check and only expand the syn cache entry if
the check is passed.
Without this check, endpoints may get stuck on the incomplete queue.

Reviewed by:		jtl@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D20374
2019-05-26 17:18:14 +00:00
Andrew Gallatin
50575ce11c Track TCP connection's NUMA domain in the inpcb
Drivers can now pass up numa domain information via the
mbuf numa domain field.  This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.

Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.

Reviewed by:	markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20028
2019-04-25 15:37:28 +00:00
Michael Tuexen
0999766ddf Add sysctl variable net.inet.tcp.rexmit_initial for setting RTO.Initial
used by TCP.

Reviewed by:		rrs@, 0mp@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19355
2019-03-23 21:36:59 +00:00
Michael Tuexen
3b853844d7 Reduce the TCP initial retransmission timeout from 3 seconds to
1 second as allowed by RFC 6298.

Reviewed by:		kbowling@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18941
2019-02-20 18:03:43 +00:00
Michael Tuexen
c6dcb64b18 Use exponential backoff for retransmitting SYN segments as specified
in the TCP RFCs.

Reviewed by:		rrs@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18974
2019-02-20 17:56:38 +00:00
Michael Tuexen
989321df11 Get the arithmetic right...
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-01-24 16:47:18 +00:00
Michael Tuexen
42395cbe31 Kill a trailing whitespace character...
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-01-24 16:43:13 +00:00
Michael Tuexen
34bb795ba1 Update a comment to reflect the current reality.
SYN-cache entries live for abaut 12 seconds, not 45, when default
setting are used.

MFC after:		1 week
Sponsored by:		Netflix, Inc.
2019-01-24 16:40:14 +00:00
Michael Tuexen
6999f6975c Remove debug code which slipped in accidently.
MFC after:		4 weeks
X-MFC with:		r339989
Sponsored by:		Netflix, Inc.
2018-11-01 11:41:40 +00:00
Michael Tuexen
099ab39f44 Improve a comment to refer to the actual sections in the TCP
specification for the comparisons made.
Thanks to lstewart@ for the suggestion.

MFC after:		4 weeks
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D17595
2018-11-01 11:35:28 +00:00
Michael Tuexen
93899d10b4 The handling of RST segments in the SYN-RCVD state exists in the
code paths. Both are not consistent and the one on the syn cache code
does not conform to the relevant specifications (Page 69 of RFC 793
and Section 4.2 of RFC 5961).

This patch fixes this:
* The sequence numbers checks are fixed as specified on
  page Page 69 RFC 793.
* The sysctl variable net.inet.tcp.insecure_rst is now honoured
  and the behaviour as specified in Section 4.2 of RFC 5961.

Approved by:		re (gjb@)
Reviewed by:		bz@, glebius@, rrs@,
Differential Revision:	https://reviews.freebsd.org/D17595
Sponsored by:		Netflix, Inc.
2018-10-18 19:21:18 +00:00
Michael Tuexen
078a49a077 Remove the unused parameter 'locked' from the function
syncache_respond(). There is no functional change. The
parameter became unused in r313330, but wasn't removed.

Approved by:		re (kib@)
MFC after:		1 month
Sponsored by:		Netflix, Inc.
2018-09-23 16:37:32 +00:00
Michael Tuexen
7d4dcc36a8 Fix the inheritance of IPv6 level socket options on TCP sockets.
This was broken for IPv6 listening socket, which are not IPV6_ONLY,
and the accepted TCP connection was using IPv4.

Reviewed by:		bz@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16792
2018-08-21 14:07:36 +00:00
Michael Tuexen
8e02b4e00c Don't expose the uptime via the TCP timestamps.
The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.

Reviewed by:		rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16636
2018-08-19 14:56:10 +00:00
Michael Tuexen
6138da62a9 Add missing send/recv dtrace probes for TCP.
These missing probe are mostly in the syncache and timewait code.

Reviewed by:		markj@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16369
2018-07-30 20:13:38 +00:00
Andrew Turner
5f901c92a8 Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147
2018-07-24 16:35:52 +00:00
Michael Tuexen
43b223f42e When retransmitting TCP SYN-ACK segments with the TCP timestamp option
enabled use an updated timestamp instead of reusing the one used in
the initial TCP SYN-ACK segment.

This patch ensures that an updated timestamp is used when sending the
SYN-ACK from the syncache code. It was already done if the
SYN-ACK was retransmitted from the generic code.

This makes the behaviour consistent and also conformant with
the TCP specification.

Reviewed by:		jtl@, Jason Eggleston
MFC after:		1 month
Sponsored by:		Neflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D15634
2018-06-15 12:28:43 +00:00
Michael Tuexen
c14f9fe5ef Limit the retransmission timer for SYN-ACKs by TCPTV_REXMTMAX.
Use the same logic to handle the SYN-ACK retransmission when sent from
the syn cache code as when sent from the main code.

MFC after:	3 days
Sponsored by:	Netflix, Inc.
2018-06-01 21:24:27 +00:00
Michael Tuexen
badef00d58 Ensure net.inet.tcp.syncache.rexmtlimit is limited by TCP_MAXRXTSHIFT.
If the sysctl variable is set to a value larger than TCP_MAXRXTSHIFT+1,
the array tcp_syn_backoff[] is accessed out of bounds.

Discussed with: jtl@
MFC after:	3 days
Sponsored by:	Netflix, Inc.
2018-06-01 19:58:19 +00:00
Randall Stewart
3ee9c3c4eb This commit brings in the TCP high precision timer system (tcp_hpts).
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.

MFC after:	Never
Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D15020
2018-04-19 13:37:59 +00:00
Michael Tuexen
1574b1e41e Set the inp_vflag consistently for accepted TCP/IPv6 connections when
net.inet6.ip6.v6only=0.

Without this patch, the inp_vflag would have INP_IPV4 and the
INP_IPV6 flags for accepted TCP/IPv6 connections if the sysctl
variable net.inet6.ip6.v6only is 0. This resulted in netstat
to report the source and destination addresses as IPv4 addresses,
even they are IPv6 addresses.

PR:			226421
Reviewed by:		bz, hiren, kib
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D13514
2018-03-16 15:26:07 +00:00
Patrick Kelsey
18a7530938 Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.
The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by:	hiren
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14048
2018-02-26 03:03:41 +00:00