Commit graph

7583 commits

Author SHA1 Message Date
Gordon Bergling
fa7de6dcb9 ip_gre: Fix a common typo in source code comments
- s/addres/address/

MFC after:	3 days
2023-01-19 14:13:02 +01:00
Gordon Bergling
73e994a998 extra_tcp_stacks: Fix a common typo in source code comments
- s/orginal/original/

MFC after:	3 days
2023-01-19 14:11:00 +01:00
Hans Petter Selasky
e0d8add4af tcp_lro: Fix for undefined behaviour.
Make sure the size of the raw[] array in the lro_address union is
correctly set at compile time, so that static code analysis tools
do not report undefined behaviour.

MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-01-13 11:18:19 +01:00
Gordon Bergling
432a398d86 tcp_rack(4): Fix a typo in a source code comment
- s/postion/position/

MFC after:	3 days
2023-01-11 12:02:25 +01:00
Gordon Bergling
d68f154205 tcp_hpts: Fix a typo in a source code comment
- s/subract/subtract/

MFC after:	3 days
2023-01-11 11:33:29 +01:00
Andrew Gallatin
8ea4182995 tcp: Build RACK and BBR stacks as a part of LINT
When RACK and BBR were added to the kernel, they were put
behind 'WITH_EXTRA_TCP_STACKS=1'.   Unfortunately that was
never added to any NOTES file, so RACK & BBR were not compiled
with the various LINT-NOINET, LINT-NOINET6, and LINT-NOIP kernels.
This lead to the stacks sometimes being broken.

This change:

- Fixes RACK so that it compiles with the various LINT-NO* kernels
- Adds WITH_EXTRA_TCP_STACKS=1 to all NOTES kernels so that
   RACK and BBR are compile tested regularly

Sponsored by: Netflix
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D37903
2023-01-10 16:16:43 -05:00
Gleb Smirnoff
aab8c844b9 tcp/ipfw: fix "ipfw fwd localaddr,port"
The ipfw(4) feature of forwarding to local address without modifying
a packet was broken.  The first lookup needs always be a non-wildcard
one, cause its goal is to find an already existing socket.  Otherwise
a local wildcard listener with the same port number may match resulting
in the connection being forwared to wrong port.

Reported by:	Pavel Polyakov <bsd kobyla.org>
Fixes:		d88eb4654f
2023-01-05 14:34:50 -08:00
Randall Stewart
26bdd35c39 rack and bbr not loading if TCP_RATELIMIT is not configured.
So it turns out that rack and bbr still will not load without TCP_RATELIMIT. This needs
to be fixed and lets also at the same time bring tcp_ratelimit up to date where we allow
the transports to set a divisor (though still having a default path with the default
divisor of 1000) for setting the burst size.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D37954
2023-01-05 11:59:52 -05:00
Cheng Cui
57cc27a332 BBLog: improve sysctl variables
Correct the format in sysctl net.inet.tcp.bb.disable_all and
sysctl net.inet.tcp.bb.log_auto_all.
Correct the format and the description in
net.inet.tcp.bb.log_auto_mode.

Reviewed by:		rscheff, tuexen
MFC after:		1 week
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37776
2022-12-24 22:10:31 +01:00
Randall Stewart
2e2a1c3139 Opps take out a stray left behind printf that was
for debugging.. Sorry.
2022-12-14 16:11:39 -05:00
Randall Stewart
e2e088ae86 Rack cannot be loaded without cc_newreno compiled into the kernel.
Right now rack will fail to load due to its hack in accessing symbol names
in cc_newreno. This was fine when newreno was always compiled into the
kernel but now ... not so much. Instead lets fix up rack to use the socket
option queries to get the information it wants and set the parameters. We
also fix the CC parameter so they are always settable.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D37622
2022-12-14 15:37:48 -05:00
Andrew Gallatin
c4a4b2633d allocate inpcb aligned to cachelines
The inpcb struct is one of the most heavily utilized in the kernel
on a busy network server.  By aligning it to a cacheline
boundary, we can ensure that closely related fields in the inpcb
and tcbcb can be predictably located on the same cacheline.  rrs
has already done a lot of this work to put related fields on the
same line for the tcbcb.

In combination with a forthcoming patch to align the start of the tcpcb,
we see a roughly 3% reduction in CPU use on a busy web server serving
traffic over roughly 50,000 TCP connections.

Reviewed by: glebius, markj, tuexen
Differential Revision: https://reviews.freebsd.org/D37687
Sponsored by: Netflix
2022-12-14 14:19:35 -05:00
Gleb Smirnoff
eaabc93764 tcp: retire TCPDEBUG
This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.

We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket.  Also the tcp::debug DTrace probes look at
this flag on a socket.

Reviewed by:		gnn, tuexen
Discussed with:		rscheff, rrs, jtl
Differential revision:	https://reviews.freebsd.org/D37694
2022-12-14 09:54:06 -08:00
Mateusz Guzik
e6fc01f6be tcp: whack the stale declaration of rack_timer_stop
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-12-14 08:48:52 +00:00
Gleb Smirnoff
1b5895c624 tcp: remove a 4.4BSD relic
The actual code to modify this counter was disabled in 2c37256e5a
and later removed in d0390e0570.
2022-12-13 20:21:45 -08:00
Gleb Smirnoff
5050df3f4a tcp: fix counter leak for SYN_RCVD state when syncache_socket() fails
The SYN_RCVD state count is tricky here due to default code path and TFO
being so different.  In the default case the count is incremented when a
syncache entry is added to the the database in syncache_insert().  Later
when connection transitions from syncache entry to a socket in
syncache_expand(), this counter is inherited by the tcpcb.  If socket or
tcpcb allocation failed in syncache_socket() failed the syncache_expand()
is responsible for decrement.  In the TFO case the syncache entry is not
inserted into database and count of SYN_RCVD is first incremented in the
syncache_tfo_expand() after successful socket allocation.  Thus, inside
syncache_socket() we can't tell whether we need to decrement in a case of
a failure or not.  The caller is responsible for this book keeping.

Fixes:	07285bb4c2
Differential revision:	https://reviews.freebsd.org/D37610
2022-12-13 19:31:05 -08:00
Gleb Smirnoff
1aed3b3430 udp: add protocol method declarations to udp_var.h
They are shared between UDP over IPv4 and over IPv6.  To prevent all
possible kernel build failures wrap them in #ifdef _SYS_PROTOSW_H_.
Prompted by feedback from jhb@ and jrtc27@ on c93db4abf4.
2022-12-07 11:51:49 -08:00
Gleb Smirnoff
32920f038a udp: inline udp_output() into udp_send() 2022-12-07 11:51:48 -08:00
Gleb Smirnoff
483fe96511 udp: embed inpcb into udpcb
See similar change to TCP e68b379244 for more context.  For UDP the
change is much simplier, though.
2022-12-07 11:51:42 -08:00
Gleb Smirnoff
0c0d8a4f7e udp: rearrange declarations in udp_var.h into user and _KERNEL halves
Bring everything that belongs to _KERNEL into single block.  Move
sub-includes to its beginning.
2022-12-07 09:55:38 -08:00
Gleb Smirnoff
294a609fc0 udp: destroy UDP and UDP-Lite inpcbinfos in single SYSUNINIT
They are created in a single SYSINIT, there is no reason to destroy
them in separate functions.
2022-12-07 09:55:38 -08:00
Gleb Smirnoff
446ccdd08e tcp: use single locked callout per tcpcb for the TCP timers
Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts.  Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held.  Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain().  There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens.  See
comment above tcp_timer_stop().  Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions.  With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers.  The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Discussed with:		rrs, tuexen, rscheff	(the TCP part of the diff)
Reviewed by:		hselasky, kib, mav	(the callout part)
Differential revision:	https://reviews.freebsd.org/D37321
2022-12-07 09:00:48 -08:00
Gleb Smirnoff
918fa4227d tcp: remove tcp_timer_suspend()
It was a temporary code added together with RACK to fight against
TCP timer races.
2022-12-07 09:00:48 -08:00
Gleb Smirnoff
e68b379244 tcp: embed inpcb into tcpcb
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.

The most import one is the inpcb itself.  With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa.  Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that.  However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort.  The new intotcpcb() macro is ready for such move.

The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.

One interesting side effect is that now TCP data is allocated from
SMR-protected zone.  Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.

Large part of the change was done with sed script:

s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g

Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.

Differential revision:	https://reviews.freebsd.org/D37127
2022-12-07 09:00:48 -08:00
Gleb Smirnoff
0aa120d52f inpcb: allow to provide protocol specific pcb size
The protocol specific structure shall start with inpcb.

Differential revision:	https://reviews.freebsd.org/D37126
2022-12-02 14:10:55 -08:00
John Baldwin
d00c20882f udp[6]_multi_input: Don't unlock freed inp.
If udp[6]_append() returns non-zero, it is because the inp has gone
away (inpcbrele_rlocked returned 1 after running the tunnel function).

Reviewed by:	ae
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D37511
2022-11-30 14:38:51 -08:00
Michael Tuexen
bd4f986644 tcp: remove unused t_rttbest
No functional change intended.

Reviewed by:		rscheff@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D37401
2022-11-16 11:22:13 +01:00
Michael Tuexen
9a71437621 libalias: improve handling of invalid SCTP packets
In case of a paritial chunk only pretend the result is OK if
the packet is not the last fragment and there is a valid association.

PR:		267476
MFC after:	3 days
2022-11-15 21:05:02 +01:00
Richard Scheffenegger
1a70101a87 tcp: account sent/received IP ECN markings independently
Have tcpstats (netstat -s) differentiate between received and sent
ECN-marked packets. Also account for IP ECN bits (on TCP packets)
even when the tcp session has not negotiated ECN support.

Event:			IETF 115 Hackathon
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37314
2022-11-10 11:35:35 +01:00
Richard Scheffenegger
0b00b80149 ipfw: Have NAT steal the TH_RES1 bit, instead of the TH_AE bit
The NAT module use of the tcphdr.th_x2 field now collides with the
use of this TCP header flag as AccECN (AE) bit. Use the topmost
bit instead to allow negotiation of AccECN across a NAT device.

Event:			IETF 115 Hackathon
Reviewed By:		#transport, tuexen
MFC after:		3 days
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37300
2022-11-09 11:19:19 +01:00
Gleb Smirnoff
b40ae8c9fe tcp: fix build without INVARIANTS and VIMAGE
Lines from upcoming changes crept in and broke certain builds.

Fixes:	9eb0e8326d
2022-11-08 12:34:45 -08:00
Gleb Smirnoff
326f455625 tcp: forward declare struct tcpcb in the TCP logging header
This allows to include tcp_log_buf.h without including tcp_var.h.
2022-11-08 10:32:29 -08:00
Gleb Smirnoff
73bebcc5bd inpcb: remove TCP includes, all TCP specific code was moved 2022-11-08 10:24:40 -08:00
Gleb Smirnoff
8840ae2288 tcp: don't store VNET in every tcpcb, take it from the inpcbinfo
Reviewed by:		rscheff
Differential revision:	https://reviews.freebsd.org/D37125
2022-11-08 10:24:40 -08:00
Gleb Smirnoff
ab0ef9455f hpts: move inp initialization from the generic inpcb code to TCP
Differential revision:	https://reviews.freebsd.org/D37124
2022-11-08 10:24:40 -08:00
Gleb Smirnoff
9eb0e8326d tcp: provide macros to access inpcb and socket from a tcpcb
There should be no functional changes with this commit.

Reviewed by:		rscheff
Differential revision:	https://reviews.freebsd.org/D37123
2022-11-08 10:24:40 -08:00
Gleb Smirnoff
f71cb9f748 tcp: inp_socket is valid through the lifetime of a TCP inpcb
The inp_socket is cleared only in in_pcbdetach(), which for TCP is
always accompanied with inp_pcbfree().  An inpcb that went through
in_pcbfree() shall never be returned by any kind of pcb lookup.

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D37062
2022-11-08 10:24:39 -08:00
Gleb Smirnoff
ada90cb978 tcp: remove INP_DROPPED check from notify functions
These functions tcp_notify(), tcp_drop_syn_sent() and tcp_mtudisc()
are called from tcp*_ctlinput*() right after successfull
in_pcblookup*().  They shall never get a pcb that is dropped.
2022-11-08 10:24:39 -08:00
Gleb Smirnoff
f567d55f51 inpcb: don't return INP_DROPPED entries from pcb lookups
The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a
pcb from hash list and mark it as dropped.  The comment suggests that
such pcb won't be returned by lookups.  Indeed, every call to
in_pcblookup*() is accompanied by a check for INP_DROPPED.  Do what
comment suggests: never return such pcbs and remove unnecessary checks.

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D37061
2022-11-08 10:24:39 -08:00
Richard Scheffenegger
dc9daa04fb tcp: allow packets to be marked as ECT1 instead of ECT0
This adds the capability for a modular congestion control
to select which variant of ECN-capable-transport it wants to use
when sending out elegible segments. As an initial CC to utilize
this, DCTCP was selected.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24869
2022-11-08 18:36:38 +01:00
Gordon Bergling
bcf8fb7f03 tcp_bbr(4): Fix a typo in a source code comment
- s/retranmitted/retransmitted/

MFC after:	3 days
2022-11-08 14:59:56 +01:00
Michael Tuexen
126f8248cc Unbreak builds having SCTP support compiled in
Including sctp_var.h requires INET to be defined if IPv4 support
is needed.
2022-11-07 08:50:51 +01:00
Michael Tuexen
f83db6441a sctp: minor changes due to upstreaming of Glebs recent changes 2022-11-06 23:06:40 +01:00
Richard Scheffenegger
37bf391d3c tcp: make tcp_packets_this_ack() only visible in kernel scope 2022-11-06 13:51:57 +01:00
Richard Scheffenegger
004bb636ca tcp: Move sysctl OIDs related to ECN to tcp_ecn.c
Keep all ECN related code in (mostly) one place.

No functional change.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37285
2022-11-06 12:38:42 +01:00
Richard Scheffenegger
b1258b7643 tcp: add conservative d.cep accounting algorithm
Accurate ECN asks to conservatively estimate, when the
ACE counter may have wrapped due to a single ACK covering a larger
number of segments. This is described in Annex A.2 of the
accurate-ecn draft.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37281
2022-11-06 12:05:22 +01:00
Richard Scheffenegger
22c81cc516 tcp: add AccECN CE packet counters to tcpinfo
Provide diagnostics information around AccECN into
the tcpinfo struct.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37280
2022-11-06 11:56:02 +01:00
Richard Scheffenegger
3708c3d370 tcp: reserve tcp_info counters for AccECN
Marking all new fields unused (__xxx).

No functional change.

Reviewed By:		tuexen, rrs, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37016
2022-11-04 10:18:53 +01:00
Mark Johnston
d93ec8cb13 inpcb: Allow SO_REUSEPORT_LB to be used in jails
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed
process.  It is trivial to support this option in VNET jails, but it's
also useful in traditional jails.

This patch enables LB groups in jails with the following semantics:
- all PCBs in a group must belong to the same jail,
- PCB lookup prefers jailed groups to non-jailed groups

This is a straightforward extension of the semantics used for individual
listening sockets.  One pre-existing quirk of the lbgroup implementation
is that non-jailed lbgroups are searched before jailed listening
sockets; that is preserved with this change.

Discussed with:	glebius
MFC after:	1 month
Sponsored by:	Modirum MDPay
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D37029
2022-11-02 13:46:24 -04:00
Mark Johnston
a152dd8634 inpcb: Remove a PCB from its LB group upon a subsequent error
If a memory allocation failure causes bind to fail, we should take the
inpcb back out of its LB group since it's not prepared to handle
connections.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D37027
2022-11-02 13:46:24 -04:00