Make sure the size of the raw[] array in the lro_address union is
correctly set at compile time, so that static code analysis tools
do not report undefined behaviour.
MFC after: 1 week
Sponsored by: NVIDIA Networking
When RACK and BBR were added to the kernel, they were put
behind 'WITH_EXTRA_TCP_STACKS=1'. Unfortunately that was
never added to any NOTES file, so RACK & BBR were not compiled
with the various LINT-NOINET, LINT-NOINET6, and LINT-NOIP kernels.
This lead to the stacks sometimes being broken.
This change:
- Fixes RACK so that it compiles with the various LINT-NO* kernels
- Adds WITH_EXTRA_TCP_STACKS=1 to all NOTES kernels so that
RACK and BBR are compile tested regularly
Sponsored by: Netflix
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D37903
The ipfw(4) feature of forwarding to local address without modifying
a packet was broken. The first lookup needs always be a non-wildcard
one, cause its goal is to find an already existing socket. Otherwise
a local wildcard listener with the same port number may match resulting
in the connection being forwared to wrong port.
Reported by: Pavel Polyakov <bsd kobyla.org>
Fixes: d88eb4654f
So it turns out that rack and bbr still will not load without TCP_RATELIMIT. This needs
to be fixed and lets also at the same time bring tcp_ratelimit up to date where we allow
the transports to set a divisor (though still having a default path with the default
divisor of 1000) for setting the burst size.
Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D37954
Correct the format in sysctl net.inet.tcp.bb.disable_all and
sysctl net.inet.tcp.bb.log_auto_all.
Correct the format and the description in
net.inet.tcp.bb.log_auto_mode.
Reviewed by: rscheff, tuexen
MFC after: 1 week
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37776
Right now rack will fail to load due to its hack in accessing symbol names
in cc_newreno. This was fine when newreno was always compiled into the
kernel but now ... not so much. Instead lets fix up rack to use the socket
option queries to get the information it wants and set the parameters. We
also fix the CC parameter so they are always settable.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D37622
The inpcb struct is one of the most heavily utilized in the kernel
on a busy network server. By aligning it to a cacheline
boundary, we can ensure that closely related fields in the inpcb
and tcbcb can be predictably located on the same cacheline. rrs
has already done a lot of this work to put related fields on the
same line for the tcbcb.
In combination with a forthcoming patch to align the start of the tcpcb,
we see a roughly 3% reduction in CPU use on a busy web server serving
traffic over roughly 50,000 TCP connections.
Reviewed by: glebius, markj, tuexen
Differential Revision: https://reviews.freebsd.org/D37687
Sponsored by: Netflix
This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.
We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket. Also the tcp::debug DTrace probes look at
this flag on a socket.
Reviewed by: gnn, tuexen
Discussed with: rscheff, rrs, jtl
Differential revision: https://reviews.freebsd.org/D37694
The SYN_RCVD state count is tricky here due to default code path and TFO
being so different. In the default case the count is incremented when a
syncache entry is added to the the database in syncache_insert(). Later
when connection transitions from syncache entry to a socket in
syncache_expand(), this counter is inherited by the tcpcb. If socket or
tcpcb allocation failed in syncache_socket() failed the syncache_expand()
is responsible for decrement. In the TFO case the syncache entry is not
inserted into database and count of SYN_RCVD is first incremented in the
syncache_tfo_expand() after successful socket allocation. Thus, inside
syncache_socket() we can't tell whether we need to decrement in a case of
a failure or not. The caller is responsible for this book keeping.
Fixes: 07285bb4c2
Differential revision: https://reviews.freebsd.org/D37610
They are shared between UDP over IPv4 and over IPv6. To prevent all
possible kernel build failures wrap them in #ifdef _SYS_PROTOSW_H_.
Prompted by feedback from jhb@ and jrtc27@ on c93db4abf4.
Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.
With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens. See
comment above tcp_timer_stop(). Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.
While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.
While here, clean unused parts of timer KPI for the pluggable TCP stacks.
While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.
Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.
Discussed with: rrs, tuexen, rscheff (the TCP part of the diff)
Reviewed by: hselasky, kib, mav (the callout part)
Differential revision: https://reviews.freebsd.org/D37321
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
If udp[6]_append() returns non-zero, it is because the inp has gone
away (inpcbrele_rlocked returned 1 after running the tunnel function).
Reviewed by: ae
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37511
In case of a paritial chunk only pretend the result is OK if
the packet is not the last fragment and there is a valid association.
PR: 267476
MFC after: 3 days
Have tcpstats (netstat -s) differentiate between received and sent
ECN-marked packets. Also account for IP ECN bits (on TCP packets)
even when the tcp session has not negotiated ECN support.
Event: IETF 115 Hackathon
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37314
The NAT module use of the tcphdr.th_x2 field now collides with the
use of this TCP header flag as AccECN (AE) bit. Use the topmost
bit instead to allow negotiation of AccECN across a NAT device.
Event: IETF 115 Hackathon
Reviewed By: #transport, tuexen
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37300
The inp_socket is cleared only in in_pcbdetach(), which for TCP is
always accompanied with inp_pcbfree(). An inpcb that went through
in_pcbfree() shall never be returned by any kind of pcb lookup.
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37062
These functions tcp_notify(), tcp_drop_syn_sent() and tcp_mtudisc()
are called from tcp*_ctlinput*() right after successfull
in_pcblookup*(). They shall never get a pcb that is dropped.
The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a
pcb from hash list and mark it as dropped. The comment suggests that
such pcb won't be returned by lookups. Indeed, every call to
in_pcblookup*() is accompanied by a check for INP_DROPPED. Do what
comment suggests: never return such pcbs and remove unnecessary checks.
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37061
This adds the capability for a modular congestion control
to select which variant of ECN-capable-transport it wants to use
when sending out elegible segments. As an initial CC to utilize
this, DCTCP was selected.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24869
Keep all ECN related code in (mostly) one place.
No functional change.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37285
Accurate ECN asks to conservatively estimate, when the
ACE counter may have wrapped due to a single ACK covering a larger
number of segments. This is described in Annex A.2 of the
accurate-ecn draft.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37281
Provide diagnostics information around AccECN into
the tcpinfo struct.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37280
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed
process. It is trivial to support this option in VNET jails, but it's
also useful in traditional jails.
This patch enables LB groups in jails with the following semantics:
- all PCBs in a group must belong to the same jail,
- PCB lookup prefers jailed groups to non-jailed groups
This is a straightforward extension of the semantics used for individual
listening sockets. One pre-existing quirk of the lbgroup implementation
is that non-jailed lbgroups are searched before jailed listening
sockets; that is preserved with this change.
Discussed with: glebius
MFC after: 1 month
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37029
If a memory allocation failure causes bind to fail, we should take the
inpcb back out of its LB group since it's not prepared to handle
connections.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37027