Commit graph

7674 commits

Author SHA1 Message Date
Randall Stewart
2ad584c555 tcp: Inconsistent use of hpts_calling flag
Gleb has noticed there were some inconsistency's in the way the inp_hpts_calls flag was being used. One
such inconsistency results in a bug when we can't allocate enough sendmap entries to entertain a call to
rack_output().. basically a timer won't get started like it should. Also in cleaning this up I find that the
"no_output" side of input needs to be adjusted to make sure we don't try to re-pace too quickly outside
the hpts assurance of 250useconds.

Another thing here is we end up with duplicate calls to tcp_output() which we should not. If packets go
from hpts for processing the input side of tcp will call the output side of tcp on the last packet if it is needed.
This means that when that occurs a second call to tcp_output would be made that is not needed and if pacing
is going on may be harmful.

Lets fix all this and explicitly state the contract that hpts is making with transports that care about the
flag.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39653
2023-04-17 17:10:26 -04:00
Randall Stewart
37229fed38 tcp: Blackbox logging and tcp accounting together can cause a crash.
If you currently turn BB logging on and in combination have TCP Accounting on we can get a
crash where we have no NULL check and we run out of memory. Also lets make sure we
don't do a divide by 0 in calculating any BB ratios.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39622
2023-04-17 13:52:00 -04:00
Gleb Smirnoff
3232b1f4a9 tcp: fix build
The recent 25685b7537 came in conflict with a540cdca31.  Remove the
code that cleans up the old style input queue.  Note that two lines
below we assert that the new style input queue is empty.  The TCP
stacks that use the queue are supposed to flush it in their
tfb_tcp_fb_fini method.
2023-04-17 10:24:20 -07:00
Gleb Smirnoff
a540cdca31 tcp_hpts: use queue(9) STAILQ for the input queue
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D39574
2023-04-17 09:07:23 -07:00
Randall Stewart
3cc7b66732 tcp: stack unloading crash in rack and bbr
Its possible to induce a crash in either rack or bbr. This would be done
if the rack stack were say the default and bbr was being used by a connection.
If the bbr stack is then unloaded and it was active, we will trigger a MPASS assert
in tcp_hpts since the new stack (default rack) would start a timer, and the old stack
(bbr) would have the inp already in hpts.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39576
2023-04-14 15:42:23 -04:00
Randall Stewart
9903bf34f0 tcp: rack pacing has some caveats that need to be obeyed when LRO is missing
n further non-LRO testing I found a case where rack is supposed to be waking up but
it is not now. In this special case it sets the flag rc_ack_can_sendout_data. When that is
set we should not prohibit output.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39565
2023-04-14 09:33:36 -04:00
Randall Stewart
25685b7537 TCP: Misc cleanups of tcp_subr.c
In going through all the TCP stacks I have found we have a few little bugs and niggles that need to
be cleaned up in tcp_subr.c including the following:

a) Set tcp_restoral_thresh to 450 (45%) not 550. This is a better proven value in my testing.
b) Lets track when we try to do pacing but fail via a counter for connections that do pace.
c) If a switch away from the default stack occurs and it fails we need to make sure the time
   scale is in the right mode (just in case the other stack changed it but then failed).
d) Use the TP_RXTCUR() macro when starting the TT_REXMT timer.
e) When we end a default flow lets log that in BBlogs as well as cleanup any t_acktime (disable).
f) When we respond with a RST lets make sure to update the log_end_status properly.
g) When starting a new pcb lets assure that all LRO features are off.
h) When discarding a connection lets make sure that any t_in_pkt's that might be there are freed properly.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39501
2023-04-13 09:29:05 -04:00
Michael Tuexen
2ba2849c82 tcp: fix typo in comment
Reported by:	cc
MFC after:	1 week
Sponsored by:	Netflix, Inc.
2023-04-12 18:08:21 +02:00
Michael Tuexen
c687f21add tcp: make net.inet.tcp.functions_default vnet specific
Reviewed by:		cc, rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D39516
2023-04-12 18:04:27 +02:00
Randall Stewart
1073f41657 tcp_lro: When processing compressed acks lets support the new early wake feature for rack.
During compressed ack and mbuf queuing we determine if we need to wake up. A
new function was added that is optional to the tfb so that the stack itself can also
be asked if a wakeup should happen. This helps compensate for late hpts calls.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39502
2023-04-12 11:35:14 -04:00
Michael Tuexen
73c48d9d8f tcp: fix deregistering stacks when vnets are used
This fixes a bug where stacks could not be deregistered when
end points in the non-default vnet are using it.

Reviewed by:		glebius, zlei
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D39514
2023-04-12 10:52:53 +02:00
Richard Scheffenegger
2169f71277 tcp: use IPV6_FLOWLABEL_LEN
Avoid magic numbers when handling the IPv6 flow ID for
DSCP and ECN fields and use the named variable instead.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D39503
2023-04-11 18:53:51 +02:00
Randall Stewart
a2b33c9a7a tcp: Rack - in the absence of LRO fixed rate pacing (loopback or interfaces with no LRO) does not work correctly.
Rack is capable of fixed rate or dynamic rate pacing. Both of these can get mixed up when
LRO is not available. This is because LRO will hold off waking up the tcp connection to
processing the inbound packets until the pacing timer is up. Without LRO the pacing only
sort-of works. Sometimes we pace correctly, other times not so much.

This set of changes will make it so pacing works properly in the absence of LRO.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39494
2023-04-10 16:33:56 -04:00
John Baldwin
e222461790 rack: mask and tclass are only used for INET6.
This fixes the LINT-NOINET6 build.
2023-04-10 12:21:03 -07:00
John Baldwin
4b6228906f libalias: Mark set but unused variables as unused.
This function is clearly a stub, but it seems better to leave the stub
bits in place than to remove the function entirely.

Differential Revision:	https://reviews.freebsd.org/D39355
2023-04-10 10:35:29 -07:00
Gleb Smirnoff
66fbc19fbd tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb
Just matches rest of the KPI.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D39435
2023-04-07 12:18:10 -07:00
Gleb Smirnoff
35bc0bcc51 tcp: reduce argument list to functions that pass a segment
The socket argument is superfluous, as a tcpcb always has one and
only one socket.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D39434
2023-04-07 12:18:06 -07:00
Gleb Smirnoff
de4368dd84 tcp: retire tfb_tcp_hpts_do_segment()
Isn't in use anymore.  Correct comments that mention it.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D39433
2023-04-07 12:18:02 -07:00
Randall Stewart
945f9a7cc9 tcp: misc cleanup of options for rack as well as socket option logging.
Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack
has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also
there is a t_maxpeak_rate that needs to be removed (its un-used).

Reviewed by: tuexen, cc
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39427
2023-04-07 10:15:29 -04:00
Mateusz Guzik
c67eb393fa tcp_hpts: plug a compiler warn
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-04-05 14:32:13 +00:00
Gleb Smirnoff
84b42df834 rack: fix build on powerpc 2023-04-04 16:35:36 -07:00
Randall Stewart
030434acaf Update rack to the latest code used at NF.
There have been many changes to rack over the last couple of years, including:
     a) Ability when switching stacks to have one stack query another.
     b) Internal use of micro-second timers instead of ticks.
     c) Many changes to pacing in forms of
        1) Improvements to Dynamic Goodput Pacing (DGP)
        2) Improvements to fixed rate paciing
        3) A new feature called hybrid pacing where the requestor can
           get a combination of DGP and fixed rate pacing with deadlines
           for delivery that can dynamically speed things up.
     d) All kinds of bugs found during extensive testing and use of the
        rack stack for streaming video and in fact all data transferred
        by NF

Reviewed by: glebius, gallatin, tuexen
Sponsored By: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D39402
2023-04-04 16:05:46 -04:00
Gleb Smirnoff
2ff8187efd tcp_hpts: remove dead code tcp_drop_in_pkts()
Should have gone in f971e79139.
2023-04-04 12:55:27 -07:00
Randall Stewart
73ee5756de Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.
So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210
2023-04-01 01:46:38 -04:00
Kristof Provost
28921c4f7d carp: allow commands to use interface name rather than index
Get/set commands can now choose to provide the interface name rather
than the interface index. This allows userspace to avoid a call to
if_nametoindex().

Suggested by:	melifaro
Reviewed by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D39359
2023-03-31 11:29:58 +02:00
Richard Scheffenegger
f858eb916f tcp: send SACK rescue retransmission also mid-stream
Previously, SACK rescue retransmissions would only happen
on a loss recovery at the tail end of the send buffer.

This extends the mechanism such that partial ACKs without SACK
mid-stream also trigger a rescue retransmission to try avoid
an otherwise unavoidable retransmission timeout.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D39274
2023-03-28 04:47:01 +02:00
Gleb Smirnoff
78e6c3aacc tcp: update error counter when dropping a packet due to bad source
Use the same counter that ip_input()/ip6_input() use for bad destination
address.  For IPv6 this is already heavily abused ip6s_badscope, which
needs to be split into several separate error counters.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D39234
2023-03-27 18:37:15 -07:00
Kristof Provost
ccff2078af carp: fix source MAC
When we're not in unicast mode we need to change the source MAC address.
The check for this was wrong, because IN_MULTICAST() assumes host
endianness and the address in sc_carpaddr is in network endianness.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-03-28 01:18:18 +02:00
Alexander V. Chernikov
19e43c163c netlink: add netlink KPI to the kernel by default
This change does the following:

Base Netlink KPIs (ability to register the family, parse and/or
 write a Netlink message) are always present in the kernel. Specifically,
* Implementation of genetlink family/group registration/removal,
  some base accessors (netlink_generic_kpi.c, 260 LoC) are compiled in
  unconditionally.
* Basic TLV parser functions (netlink_message_parser.c, 507 LoC) are
  compiled in unconditionally.
* Glue functions (netlink<>rtsock), malloc/core sysctl definitions
 (netlink_glue.c, 259 LoC) are compiled in unconditionally.
* The rest of the KPI _functions_ are defined in the netlink_glue.c,
 but their implementation calls a pointer to either the stub function
 or the actual function, depending on whether the module is loaded or not.

This approach allows to have only 1k LoC out of ~3.7k LoC (current
 sys/netlink implementation) in the kernel, which will not grow further.
It also allows for the generic netlink kernel customers to load
 successfully without requiring Netlink module and operate correctly
 once Netlink module is loaded.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D39269
2023-03-27 13:55:44 +00:00
Andrew Gallatin
abba58766f LRO: Add missing checks for invalid IP addresses
LRO bypasses normal ip_input()/tcp_input() and lacks several checks
that are present in the normal path.  Without these checks, it
is possible to trigger assertions added in b0ccf53f24

Reviewed by: glebius, rrs
Sponsored by: Netflix
2023-03-25 11:56:02 -04:00
Kristof Provost
511a6d5ed3 carp: use if_name()
Reported by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-03-20 14:37:10 +01:00
Kristof Provost
137818006d carp: support unicast
Allow users to configure the address to send carp messages to. This
allows carp to be used in unicast mode, which is useful in certain
virtual configurations (e.g. AWS, VMWare ESXi, ...)

Reviewed by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D38940
2023-03-20 14:37:09 +01:00
Kristof Provost
40e0435964 carp: add netlink interface
Allow carp configuration information to be supplied and retrieved via
netlink.

Reviewed by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D39048
2023-03-20 10:52:27 +01:00
Michael Tuexen
48345048cd sctp: fix typo in assignment 2023-03-18 23:58:50 +01:00
Michael Tuexen
8ed1e2c880 sctp: enforce Kahn's rule during the handshake
Don't take RTT measurements on packets containing INIT or COOKIE-ECHO
chunks, when they were retransmitted.

MFC after:	1 week
2023-03-16 17:40:40 +01:00
Randall Stewart
69c7c81190 Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.
The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831
2023-03-16 11:43:16 -04:00
Zhenlei Huang
49cad3daf2 carp: carp_master_down_locked() requires net epoch
Reviewed by:	kp
Fixes:		1d126e9b94 carp: Widen epoch coverage
MFC after:	1 day
Differential Revision:	https://reviews.freebsd.org/D39113
2023-03-16 18:07:03 +08:00
Michael Tuexen
c91ae48a25 sctp: don't do RTT measurements with cookies
When receiving a cookie, the receiver does not know whether the
peer retransmitted the COOKIE-ECHO chunk or not. Therefore, don't
do an RTT measurement. It might be much too long.
To overcome this limitation, one could do at least two things:
1. Bundle the INIT-ACK chunk with a HEARTBEAT chunk for doing the
   RTT measurement. But this is not allowed.
2. Add a flag to the COOKIE-ECHO chunk, which indicates that it
   is the initial transmission, and not a retransmission. But
   this requires an RFC.

MFC after:	1 week
2023-03-16 10:45:13 +01:00
Michael Tuexen
cee09bda03 sctp: allow disabling of SCTP_ACCEPT_ZERO_CHECKSUM socket option 2023-03-15 22:55:23 +01:00
Michael Tuexen
6026b45aab sctp: improve negotiation of zero checksum feature
Enforce consistency between announcing 0-cksum support and actually
using it in the association. The value from the inp when the
INIT ACK is sent must be used, not the one from the inp when the
cookie is received.
2023-03-15 22:29:52 +01:00
Mina Galić
0b0ae2e4cd jail: convert several functions from int to bool
these functions exclusively return (0) and (1), so convert them to bool

We also convert some networking related jail functions from int to bool
some of which were returning an error that was never used.

Differential Revision: https://reviews.freebsd.org/D29659
Reviewed by: imp, jamie (earlier version)
Pull Request: https://github.com/freebsd/freebsd-src/pull/663
2023-03-14 21:05:33 -06:00
Mark Johnston
aa71d6b4a2 netinet: Disallow unspecified addresses in ICMP-embedded packets
Reported by:	glebius
Reported by:	syzbot+981c528ccb5c5534dffc@syzkaller.appspotmail.com
Reviewed by:	tuexen, glebius
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D38936
2023-03-13 10:45:56 -04:00
Michael Tuexen
4a2b92d99f sctp: initial implementation of draft-tuexen-tsvwg-sctp-zero-checksum 2023-03-10 01:45:46 +01:00
Mark Johnston
713264f6b8 netinet: Tighten checks for unspecified source addresses
The assertions added in commit b0ccf53f24 ("inpcb: Assert against
wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol
layers may pass the unspecified address to in_pcblookup().

Add some checks to filter out such packets before we attempt an inpcb
lookup:
- Disallow the use of an unspecified source address in in_pcbladdr() and
  in6_pcbladdr().
- Disallow IP packets with an unspecified destination address.
- Disallow TCP packets with an unspecified source address, and add an
  assertion to verify the comment claiming that the case of an
  unspecified destination address is handled by the IP layer.

Reported by:	syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com
Reported by:	syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com
Reported by:	syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com
Reviewed by:	glebius, melifaro
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D38570
2023-03-06 15:06:00 -05:00
Fidaullah Noonari
290f7f4a09 in_mcat.c: change multicast not member condition
If there is no source filter entry => block if that's SSM ("exclude"
mode per RFC 3678 clause 3).  If there is an entry => check its action &
block if the action is "exclude".

It would be nice if the test case in this PR were converted into an ATF
test case, but not blocking on that.

Reviewed by: imp, melifaro
Pull Request: https://github.com/freebsd/freebsd-src/pull/601
2023-03-03 22:25:17 -07:00
Gleb Smirnoff
7fc82fd1f8 ipfw: garbage collect ip_fw_chk_ptr
It is a relict left from the old times when ipfw(4) was hooked
into IP stack directly, without pfil(9).
2023-03-03 10:30:15 -08:00
Mark Johnston
317fa5169d netinet: Remove the IP(V6)_RSS_LISTEN_BUCKET socket option
It has no effect, and an exp-run revealed that it is not in use.

PR:		261398 (exp-run)
Reviewed by:	mjg, glebius
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38822
2023-02-28 15:57:21 -05:00
Richard Scheffenegger
399a5655e6 tcp: Make TCP PCAP buffer properly configurable.
Reviewed By:		tuexen, cc, #transport
MFC after:		3 days
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D38824
2023-02-28 20:12:11 +01:00
Mark Johnston
3aff4ccdd7 netinet: Remove IP(V6)_BINDMULTI
This option was added in commit 0a100a6f1e but was never completed.
In particular, there is no logic to map flowids to different listening
sockets, so it accomplishes basically the same thing as SO_REUSEPORT.
Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to
balance among listening sockets using a hash of the 4-tuple and some
optional NUMA policy.

The option was never documented or completed, and an exp-run revealed
nothing using it in the ports tree.  Moreover, it complicates the
already very complicated in_pcbbind_setup(), and the checking in
in_pcbbind_check_bindmulti() is insufficient.  So, let's remove it.

PR:		261398 (exp-run)
Reviewed by:	glebius
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38574
2023-02-27 10:03:11 -05:00
Alfonso
2f201df1f8 Change hw_tls to a bool
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/512
2023-02-25 09:59:11 -07:00