Just like we already do for IPv6 set the PFIL_FWD flag when we're forwarding
IPv4 traffic. This allows firewalls to make more precise decisions.
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D48824
pfil hooks (i.e. firewalls) may pass, modify or free the mbuf passed
to them. (E.g. when rejecting a packet, or when gathering up packets
for reassembly).
If the hook returns PFIL_PASS the mbuf must still be present. Assert
this in pfil_mem_common() and ensure that ipfilter follows this
convention. pf and ipfw already did.
Similarly, if the hook returns PFIL_DROPPED or PFIL_CONSUMED the mbuf
must have been freed (or now be owned by the firewall for further
processing, like packet scheduling or reassembly).
This allows us to remove a few extraneous NULL checks.
Suggested by: tuexen
Reviewed by: tuexen, zlei
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D43617
This removes the if_output calls in the pf(4) code that escape further
processing by defering the forwarding execution to the network stack
using on/off style sysctls for both IPv4 and IPv6.
Also see: https://reviews.freebsd.org/D8877
This commit also includes the original refactoring changes
This change allows the kernel to operate with the default netisr cpu-affinity settings while having RSS compiled in. Normally, RSS changes quite a bit of the behaviour of the kernel dispatch service - this change allows for reducing impact on incompatible hardware while preserving the option to boost throughput speeds based on packet flow CPU affinity.
Make sure to compile the following options in the kernel:
options RSS
As well as setting the following sysctls:
net.inet.rss.enabled: 1
net.isr.bindthreads: 1
net.isr.maxthreads: -1 (automatically sets it to the number of CPUs)
And optionally (to force a 1:1 mapping between CPUs and buckets):
net.inet.rss.bits: 3 (for 8 CPUs)
net.inet.rss.bits: 2 (for 4 CPUs)
etc.
Set pin_default_swi to 0 by default in the RSS case.
This patch provides UDP encapsulation of ESP packets over IPv6.
Ports the IPv4 code to IPv6 and adds support for IPv6 in udpencap.c
As required by the RFC and unlike in IPv4 encapsulation,
UDP checksums are calculated.
Co-authored-by: Aurelien Cazuc <aurelien.cazuc.external@stormshield.eu>
Sponsored-by: Stormshield
Sponsored-by: Wiktel
Sponsored-by: Klara, Inc.
Fix KASSERT in 80044c78 causing build failures
Move the KASSERT to where struct ip6_hdr is populated
Fixes: 80044c785c
Reported-by: bapt
Reviewed-by: markj
Sponsored-by: Klara, Inc.
Based on a patch originally found in m0n0wall, expanded
to IPv6 and aligned with FreeBSD's IP input path.
The limit may not be correctly accounted for on the WAN
interface due to dummynet counting the packet again even
though it was already processed.
The problem here is that there's no proper way to reinject
the packet at the point where it was previously removed
from so we make the assumption that ip input was already
done (including pfil) and more or less directly move to
packet output processing.
While here move the passin label up to take the extra check
but avoiding a second label. Also remove the spurious tag
read for forward check since we don't use it and we should
really trust the mbuf flag.
Do not clear the SCTP_ADDR_IFA_UNUSEABLE flag, if it was set due
to the address being deprecated. Also don't declare tentative
addresses as unusable.
While there, cleanup the code.
PR: 230242
(cherry picked from commit 9639de2a6f)
Only call sctp_gather_internal_ifa_flags() for IPv6 addresses and
also compile this code only, when IPv6 is supported.
This fixes the compilation of IPv4 only kernels.
Reported by: bz@
Fixes: 6ab4b0c0df ("sctp: initilize local address flags correctly")
(cherry picked from commit 99c58ad021)
add a new sysctl, net.link.bridge.member_ifaddrs, which defaults to 1.
if it is set to 1, bridge behaviour is unchanged.
if it is set to 0:
- an interface which has AF_INET6 or AF_INET addresses assigned cannot
be added to a bridge.
- an interface in a bridge cannot have an AF_INET6 or AF_INET address
assigned to it.
- the bridge will no longer consider the lladdrs on bridge members to be
local addresses, i.e. frames sent to member lladdrs will not be
processed by the host.
update bridge.4 to document this behaviour, as well as the existing
recommendation that IP addresses should not be configured on bridge
members anyway, even if it currently partially works.
in testing, setting this to 0 on a bridge with 50 member interfaces
improved throughput by 22% (4.61Gb/s -> 5.67Gb/s) across two member
epairs due to eliding the bridge member list walk in GRAB_OUR_PACKETS.
Reviewed by: kp, des
Approved by: des (mentor)
Differential Revision: https://reviews.freebsd.org/D49995
(cherry picked from commit 0a1294f6c6)
PR: 286539
MFC after: 3 days
Approved by: re (cperciva)
(cherry picked from commit 75d173a848)
(cherry picked from commit 02dde7c43fe76a5dcdc170de1c2740a31629e106)
When doing a limited retransmit, allow up to 2 * MSS - 1 if the
Nagle algorithm has been disabled.
PR: 282605
Approved by: re (cperciva)
Reviewed by: cc, Peter Lei
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D49922
(cherry picked from commit 934caaec3a)
(cherry picked from commit 0906658c3409996b26518e67df48c01052ef934c)
Clear the black box logging containing union rather than the u_bbr
structure for clarity and consistency. Currently u_bbr, u_raw, and
u64_raw are the same size.
No functional change intended.
Reviewed by: tuexen
Sponsored by: Netflix, Inc.
(cherry picked from commit 382af4d38b)
beta and beta_ecn were stored using a variable of type struct newreno
in struct rack_control. Later, struct newreno was extended and now
contains several more fields.
This results in a memory inefficiency and also in copying around
uninitialized memory.
This patch fixes this by storing beta and beta_ecn individually in
struct rack_control.
Please note that the newreno_flags field was only stored and never
used. Therefore, this is not stored anymore in struct rack_control.
No functional change intended.
CID: 1523796
Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D49578
(cherry picked from commit f6deb9ea0a)
Remove the macros COPY_STAT and COPY_STAT_T, since they do not improve
the readability of the code. No functional change intended.
Thanks to glebius@ for suggesting the change.
Reviewed by: glebius, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D49670
(cherry picked from commit 46fc12741e)
In general we are working towards making public headers self-contained.
cdefs.h is included for __packed; just assume that types.h includes
cdefs.h as that's a very common assumption.
PR: 285924
Reviewed by: emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49735
(cherry picked from commit 31d3a94bdd)
struct tcp_log_rack is not used, therefore remove it.
Reviewed by: Peter Lei
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D49669
(cherry picked from commit b1c62081fe)
The sendfile black box logging struct is much smaller than the
encompassing stack specific logging union. Be sure to clear the
trailing unused memory when logging.
Reviewed by: tuexen
Sponsored by: Netflix, Inc.
(cherry picked from commit 3bd1e85fc1)
Initialize the fields in the tcp_log_buffer in the sequence they
appear in the structure and add the initialization of tlb_flex1,
tlb_flex2, and _pad[].
Reviewed by: rrs, Peter Lei
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D49652
(cherry picked from commit 94acddd2ad)
These routines were all assuming that the sysctl handler has some new
value, but this is not the case. SYSCTL_IN() returns 0 in this
scenario, so they were all operating on an uninitialized address. This
is mostly harmless, but trips KMSAN checks, so let's fix them.
Reviewed by: zlei, rrs, glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49348
(cherry picked from commit 3ff865c6a7)
If timestamps are enabled, the actions performed by a retransmission
timeout were rolled back, when they should not.
It is needed to make sure the incoming segment advances SND.UNA.
To do this, remove the incorrect upfront check and extend the check in
the fast path to handle also the case of timestamps.
PR: 282605
Reviewed by: cc, rscheff, Peter Lei
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D49414
(cherry picked from commit fbcf3b74e8)
The SUS doesn't mention this error code as a possible one [1]. The FreeBSD
manual page specifies a possible ECONNRESET for close(2):
[ECONNRESET] The underlying object was a stream socket that was
shut down by the peer before all pending data was
delivered.
In the past it had been EINVAL (see 21367f630d), and this EINVAL was
added as a safety measure in 623dce13c6. After conversion to
ECONNRESET it had been documented in the manual page in 78e3a7fdd5, but
I bet wasn't ever tested to actually be ever returned, cause the
tcp-testsuite[2] didn't exist back then. So documentation is incorrect
since 2006, if my bet wins. Anyway, in the modern FreeBSD the condition
described above doesn't end up with ECONNRESET error code from close(2).
The error condition is reported via SO_ERROR socket option, though. This
can be checked using the tcp-testsuite, temporarily disabling the
getsockopt(SO_ERROR) lines using sed command [3]. Most of these
getsockopt(2)s are followed by '+0.00 close(3) = 0', which will confirm
that close(2) doesn't return ECONNRESET even on a socket that has the
error stored, neither it is returned in the case described in the manual
page. The latter case is covered by multiple tests residing in tcp-
testsuite/state-event-engine/rcv-rst-*.
However, the deleted block of code could be entered in a race condition
between close(2) and processing of incoming packet, when connection had
already been half-closed with shutdown(SHUT_WR) and sits in TCPS_LAST_ACK.
This was reported in the bug 146845. With the block deleted, we will
continue into tcp_disconnect() which has proper handling of INP_DROPPED.
The race explanation follows. The connection is in TCPS_LAST_ACK. The
network input thread acquires the tcpcb lock first, sets INP_DROPPED,
acquires the socket lock in soisdisconnected() and clears SS_ISCONNECTED.
Meanwhile, the syscall thread goes through sodisconnect() which checks for
SS_ISCONNECTED locklessly(!). The check passes and the thread blocks on
the tcpcb lock in tcp_usr_disconnect(). Once input thread releases the
lock, the syscall thread observes INP_DROPPED and returns ECONNRESET.
- Thread 1: tcp_do_segment()->tcp_close()->in_pcbdrop(),soisdisconnected()
- Thread 2: sys_close()...->soclose()->sodisconnect()->tcp_usr_disconnect()
Note that the lockless operation in sodisconnect() isn't correct, but
enforcing the socket lock there will not fix the problem.
[1] https://pubs.opengroup.org/onlinepubs/9799919799/
[2] https://github.com/freebsd-net/tcp-testsuite
[3] sed -i "" -Ee '/\+0\.00 getsockopt\(3, SOL_SOCKET, SO_ERROR, \[ECONNRESET\]/d' $(grep -lr ECONNRESET tcp-testsuite)
PR: 146845
Reviewed by: tuexen, rrs, imp
Differential Revision: https://reviews.freebsd.org/D48148
(cherry picked from commit 053a988497)
One variable that became critical to correctly calculate
the cwnd during limited transmit was not properly reverted
on detection of spurious timeouts.
PR: 282605
Reviewed By: cc, tuexen, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D48652
(cherry picked from commit 6f6c07813b)
The function in_localip() was changed to return bool but the comment was
left unchanged.
Fixes: c8ee75f231 Use network epoch to protect local IPv4 addresses hash
MFC after: 3 days
(cherry picked from commit a5e380e51c)
It's only needed for in_pcb.c and in6_pcb.c, so can go to the private
header.
No functional change intended.
Reported by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
(cherry picked from commit ca94f92c23)
As with net.inet.{tcp,udp}.bind_all_fibs, this causes raw sockets to
accept only packets from the same FIB.
Reviewed by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48707
(cherry picked from commit 4009a98fe8)
In particular, we store a FIB number in both struct socket and in struct
inpcb. When updating the FIB number with setsockopt(SO_SETFIB), make
the update atomic. This is required to support the new bind_all_fibs
mode, since in that mode changing the FIB of a bound socket is not
permitted.
This requires a bit more code, but avoids a layering violation in
sosetopt(), where we hard-code the list of protocol families that
implement SO_SETFIB.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48666
(cherry picked from commit caccbaef8e)
Introduce the net.inet.udp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour. When set to 0, all received
datagrams will be dropped unless an inpcb bound to the same FIB exists.
No functional change intended, as the new behaviour is not enabled by
default.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48664
(cherry picked from commit 08e638c089)
Introduce the net.inet.tcp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour. When set to 0, all TCP
listening sockets are private to their FIB. Inbound connection requests
will only succeed if a matching inpcb is bound to the same FIB as the
request.
No functional change intended, as the new behaviour is not enabled by
default.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48663
(cherry picked from commit 5dc99e9bb9)
Allow protocol layers to look up an inpcb belonging to a particular FIB.
This is indicated by setting INPLOOKUP_FIB; if it is set, the FIB to be
used is obtained from the specificed mbuf or ifnet.
No functional change intended.
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48662
(cherry picked from commit da806e8db6)
Add a flag, INPBIND_FIB, which means that the inpcb is local to its FIB
number. When this flag is specified, duplicate bindings are permitted,
so long as each FIB contains at most one inpcb bound to the same
address/port. If an inpcb is bound with this flag, it'll have the
INP_BOUNDFIB flag set.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48661
(cherry picked from commit bbd0084baf)
This is to enable a mode where duplicate inpcb bindings are permitted,
and we want to look up an inpcb with a particular FIB. Thus, add a
"fib" parameter to in_pcblookup() and related functions, and plumb it
through.
A fib value of RT_ALL_FIBS indicates that the lookup should ignore FIB
numbers when searching. Otherwise, it should refer to a valid FIB
number, and the returned inpcb should belong to the specific FIB. For
now, just add the fib parameter where needed, as there are several
layers to plumb through.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48660
(cherry picked from commit 9a4131629b)
in_pcblookup_hash_wild_* looks up unconnected inpcbs, so there is no
point in passing the foreign address and port, and indeed those
parameters are not used. So, remove them.
No functional change intended.
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D47385
(cherry picked from commit 21d7ac8c79)