pfil hooks (i.e. firewalls) may pass, modify or free the mbuf passed
to them. (E.g. when rejecting a packet, or when gathering up packets
for reassembly).
If the hook returns PFIL_PASS the mbuf must still be present. Assert
this in pfil_mem_common() and ensure that ipfilter follows this
convention. pf and ipfw already did.
Similarly, if the hook returns PFIL_DROPPED or PFIL_CONSUMED the mbuf
must have been freed (or now be owned by the firewall for further
processing, like packet scheduling or reassembly).
This allows us to remove a few extraneous NULL checks.
Suggested by: tuexen
Reviewed by: tuexen, zlei
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D43617
This patch provides UDP encapsulation of ESP packets over IPv6.
Ports the IPv4 code to IPv6 and adds support for IPv6 in udpencap.c
As required by the RFC and unlike in IPv4 encapsulation,
UDP checksums are calculated.
Co-authored-by: Aurelien Cazuc <aurelien.cazuc.external@stormshield.eu>
Sponsored-by: Stormshield
Sponsored-by: Wiktel
Sponsored-by: Klara, Inc.
Fix KASSERT in 80044c78 causing build failures
Move the KASSERT to where struct ip6_hdr is populated
Fixes: 80044c785cb040a2cf73779d23f9e1e81a00c6c3
Reported-by: bapt
Reviewed-by: markj
Sponsored-by: Klara, Inc.
This commit also includes the original refactoring changes
This change allows the kernel to operate with the default netisr cpu-affinity settings while having RSS compiled in. Normally, RSS changes quite a bit of the behaviour of the kernel dispatch service - this change allows for reducing impact on incompatible hardware while preserving the option to boost throughput speeds based on packet flow CPU affinity.
Make sure to compile the following options in the kernel:
options RSS
As well as setting the following sysctls:
net.inet.rss.enabled: 1
net.isr.bindthreads: 1
net.isr.maxthreads: -1 (automatically sets it to the number of CPUs)
And optionally (to force a 1:1 mapping between CPUs and buckets):
net.inet.rss.bits: 3 (for 8 CPUs)
net.inet.rss.bits: 2 (for 4 CPUs)
etc.
Set pin_default_swi to 0 by default in the RSS case.
This removes the if_output calls in the pf(4) code that escape further
processing by defering the forwarding execution to the network stack
using on/off style sysctls for both IPv4 and IPv6.
Also see: https://reviews.freebsd.org/D8877
Based on a patch originally found in m0n0wall, expanded
to IPv6 and aligned with FreeBSD's IP input path.
The limit may not be correctly accounted for on the WAN
interface due to dummynet counting the packet again even
though it was already processed.
The problem here is that there's no proper way to reinject
the packet at the point where it was previously removed
from so we make the assumption that ip input was already
done (including pfil) and more or less directly move to
packet output processing.
While here move the passin label up to take the extra check
but avoiding a second label. Also remove the spurious tag
read for forward check since we don't use it and we should
really trust the mbuf flag.
in6_mapped_sockaddr() and in6_mapped_peeraddr() both define a local
variable named 'inp', but in the non-INET case, this variable is set
and never used, causing a compiler error:
/src/freebsd/src/lf/sys/netinet6/in6_pcb.c:547:16: error:
variable 'inp' set but not used [-Werror,-Wunused-but-set-variable]
547 | struct inpcb *inp;
| ^
/src/freebsd/src/lf/sys/netinet6/in6_pcb.c:573:16: error:
variable 'inp' set but not used [-Werror,-Wunused-but-set-variable]
573 | struct inpcb *inp;
Fix this by guarding all the INET-specific logic, including the variable
definition, behind #ifdef INET.
While here, tweak formatting in in6_mapped_peeraddr() so both functions
are the same.
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/1155
(cherry picked from commit 042fb58d009e7efc5b334b68fffbef9b1f620ec8)
(cherry picked from commit f30c2d86c3)
Approved-by: re (cperciva)
The only element of of in6_addr that is specified in RFC 3493 or
in POSIX.1-2017 is s6_addr, implemented via a #define to a union
member. However, FreeBSD and other BSD systems have additional
definitions for the other union members, s6_addr{8,16,32} which
are defined for the kernel and loader. Some Linux applications
also use them, and they seem to be allowed by the RFC and POSIX.
Remove the current ifdefs, exposing the additional fields to user
level, and replace with #if __BSD_VISIBLE. Add an explanatory
comment expanding on the previous "nonstandard" comment.
Reviewed by: bz
Differential Revision: https://reviews.freebsd.org/D44979
Approved by: re (cperciva)
(cherry picked from commit eb3dbf2dbe22ed6d4df54aebbf23f5b555a21cf1)
(cherry picked from commit a5a2e963f9)
Don't report a BACKUP CARP address as local. These two functions are used
only by source address validation for input packets, controlled by sysctls
net.inet.ip.source_address_validation and
net.inet6.ip6.source_address_validation. For this purpose we definitely
want to treat BACKUP addresses as non local.
This change is conservative and doesn't modify compat in_localip() and
in6_localip(). They are used more widely than the FIB-aware versions.
The change would modify the notion of ipfw(4) 'me' keyword. There might
be other consequences as in_localip() is used by various tunneling
protocols.
PR: 277349
(cherry picked from commit 56f7860087eec14b4a65310b70bd704e79e1b48c)
This patch allows the IPPROTO_UDPLITE-level socket options
UDPLITE_SEND_CSCOV and UDPLITE_RECV_CSCOV to be used on
AF_INET6 sockets in addition to AF_INET sockets.
Reviewed by: ae, rscheff
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42430
(cherry picked from commit 03c3a70abe5e9fa259b954de78ae69229fa9c99f)
First, merge in_pcbdetach() with in_pcbfree(). The comment for
in_pcbdetach() was no longer correct. Then, make sure we remove
the inpcb from the hash before we commit any destructive actions
on it. There are couple functions that rely on the hash lock
skipping SMR + inpcb lock to lookup an inpcb. Although there are
no known functions that similarly rely on the global inpcb list
lock, also do list removal before destructive actions.
PR: 273890
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D43122
(cherry picked from commit a13039e2709277b1c3b159e694cc909a5e044151)
The code which removes a fragment queue from the per-VNET hash table was
duplicated three times. Factor it out into a function. No functional
change intended.
Reviewed by: kp, bz
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43228
(cherry picked from commit 0736a38072b52204289c669770a34d0b801a8a7e)
When an application listens IPv6 TCP socket, due to ipfw
forwarding tag it may handle connections for addresses that do not
belongs to the jail or even current host (transparent proxy).
Syncache code can successfully handle TCP handshake for such connections.
When syncache finally accepts connection it uses in6_pcbconnect() to
properly initlize new connection info.
For IPv4 this scenario just works, but for IPv6 it fails when
local address doesn't belongs to the jail. This check occurs when
in6_pcbladdr() applies IPv6 SAS algorithm.
We need IPv6 SAS when we are connection initiator, but in the above
case connection is already established and both source and destination
addresses are known.
Use unused argument to notify in6_pcbconnect() when we don't need
source address selection. This will fix `ipfw fwd` to jailed IPv6
address.
When we are connection initiator, we stil use IPv6 SAS algorithm and
apply all related restrictions.
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D41685
(cherry picked from commit 0bf5377b6b9642acc85355062b921a07604b7c04)
The following sysctl variables are actually loader tunables. Add sysctl
flag CTLFLAG_TUN to them so that `sysctl -T` will report them correctly.
1. net.inet6.ip6.auto_linklocal
2. net.inet6.ip6.accept_rtadv
3. net.inet6.ip6.no_radr
No functional change intended.
Reviewed by: glebius
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D41928
(cherry picked from commit 03dac3e37993801dab4418087bfedacce0526e66)
Since f71cb9f748 socket stays connnected with inpcb through latter's
lifetime and there is no reason to complicate things and copy these
flags.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D41198
The code added in c89c8a1029 in order
to compensate possible misalignment caused by prepending the IP4/6
header with an EtherIP one got broken at some point by a rewrite of
gif(4). For better or worse, 8018ac153f
relaxed the alignment of struct ip from 32 bit to 16 bit, though. As
a result, a 16 bit offset of the IPv4 header induced by the addition
of the 16 bit EtherIP one no longer is a problem in the first place.
The alignment of struct ip6_hdr currently is even only 8 bit, making
it even less problematic with regards to possible misalignment.
Thus, remove the code for handling misalignment in in{,6}_gif_output()
altogether again.
While at it, replace the 3 bcopy(9) calls in gif(4) with memcpy(9) as
there's no need to handle overlap here.
The mac_ipacl policy module enables fine-grained control over IP address
configuration within VNET jails from the base system.
It allows the root user to define rules governing IP addresses for
jails and their interfaces using the sysctl interface.
Requested by: multiple
Sponsored by: Google, Inc. (GSoC 2019)
MFC after: 2 months
Reviewed by: bz, dch (both earlier versions)
Differential Revision: https://reviews.freebsd.org/D20967
Resolve a race condition where we'd lose the Solicited-node multicast
group subscription if we assigned the same IPv6 address twice.
PR: 233683
Reviewed by: ae
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D41124
Ipsec needs access to packet headers to determine if a policy is
applicable. It seems that typically IP headers are mapped, but the code
is arguably needs to check this before blindly accessing them. Then,
operations like m_unshare() and m_makespace() are not yet ready for
unmapped mbufs.
Ensure that the packet is mapped before calling into IPSEC_OUTPUT().
PR: 272616
Reviewed by: jhb, markj
Sponsored by: NVidia networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41112
Pre-declare struct ucred, to fix build issues on the MINIMAL config:
In file included from /usr/src/sys/netpfil/pf/pfsync_nv.c:40:
/usr/src/sys/netinet6/ip6_var.h:384:31: error: declaration of 'struct ucred' will not be visible outside of this function [-Werror,-Wvisibility]
struct ip6_pktopts *, struct ucred *, int);
^
/usr/src/sys/netinet6/ip6_var.h:408:28: error: declaration of 'struct ucred' will not be visible outside of this function [-Werror,-Wvisibility]
struct inpcb *, struct ucred *, int, struct in6_addr *, int *);
^
2 errors generated.
This ensures that in6_cksum_partial() can be applied to unmapped mbufs,
which can happen at least when icmp6_reflect() quotes a packet.
The basic idea is to restructure in6_cksum_partial() to operate on one
mbuf at a time. If the buffer length is odd or unaligned, an extra
residual byte may be returned, to be incorporated into the checksum when
processing the next buffer.
PR: 268400
Reviewed by: cy
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D40598
Having it configurable adds more flexibility, especially
for the systems with low amount of memory.
Additionally, it allows to speedup frag6/ tests execution.
Reviewed by: kp, markj, bz
Differential Revision: https://reviews.freebsd.org/D35755
MFC after: 2 weeks
Some context on the current IPv6 interface setup & address management:
There are two data path for IPv6 initialisation in context of assigning
LL addresses:
1) Userland explicitly requests IFF_UP for the interface w/o any addresses.
if_up() then calls in6_if_up(), which calls in6_ifattach().
The latter sets up some initial ND/IN6 state and disables IPv6 for the
interface if it’s not loopback. If the interface is loopback, then it
adds ::1/128 and LL addresses via in6_ifattach_loopback().
Then, devd notification is generated (if the VNET is the default one),
which triggers rc.network ifconfig_up(), causing ifdisabled to be removed
via SIOCSIFINFO_IN6 from ifconfig. The kernel SIOCSIFINFO_IN6 handler
calls in6_if_up() once again and it assigns the interface link-local address.
2) Userland adds IPv4 or IPv6 address to the interface. SIOCAIFADDR[_IN6]
kernel handler calls IPv4/IPv6 protocol handler to add the address.
Both then call if_ioctl() with SIOCSIFADDR. Ethernet/loopback ioctl handlers
silently sets IFF_UP for the interface. Finally, if.c:ifioctl() wrapper code
compares old and new interface flags and, if IFF_UP is added, it explicitly
calls in6_if_up(), which adds link-local address if either the original
address is IPv6 or the interface is loopback.
In the latter case, “formal” interface-up notifications are missing.
The kernel does not trigger event handler event, does not call carp hook
and does not provide any userland notification.
This diff unifies the event handling in both scenarios, providing the
necessary notifications to the kernel and userland.
Reviewed By: kp
Differential Revision: https://reviews.freebsd.org/D40332
MFC after: 2 weeks
Redirect rules use PFIL_IN and PFIL_OUT events to allow packet filter
rules to change the destination address and port for a connection.
Typically, the rule triggers on an input event when a packet is received
by a router and the destination address and/or port is changed to
implement the redirect. When a reply packet on this connection is output
to the network, the rule triggers again, reversing the modification.
When the connection is initiated on the same host as the packet filter,
it is initially output via lo0 which queues it for input processing.
This causes an input event on the lo0 interface, allowing redirect
processing to rewrite the destination and create state for the
connection. However, when the reply is received, no corresponding output
event is generated; instead, the packet is delivered to the higher level
protocol (e.g. tcp or udp) without reversing the redirect, the reply is
not matched to the connection and the packet is dropped (for tcp, a
connection reset is also sent).
This commit fixes the problem by adding a second packet filter call in
the input path. The second call happens right before the handoff to
higher level processing and provides the missing output event to allow
the redirect's reply processing to perform its rewrite. This extra
processing is disabled by default and can be enabled using pfilctl:
pfilctl link -o pf:default-out inet-local
pfilctl link -o pf:default-out6 inet6-local
PR: 268717
Reviewed-by: kp, melifaro
MFC-after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40256
When looking up a listening socket, the SMR-protected lookup routine may
return a jailed socket with no local address. This happens when using
classic jails with more than one IP address; in a single-IP classic
jail, a bound socket's local address is always rewritten to be that of
the jail.
After commit 7b92493ab1, the lookup path failed to check whether the
jail corresponding to a matched wildcard socket actually owns the
address, and would return the match regardless. Restore the omitted
checks.
Fixes: 7b92493ab1 ("inpcb: Avoid inp_cred dereferences in SMR-protected lookup")
Reported by: peter
Reviewed by: bz
Differential Revision: https://reviews.freebsd.org/D40268
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
RFC 4620 is an experimental RFC that can be used to request information
about a host, including:
- the fully-qualified or single-component name
- some set of the Responder's IPv6 unicast addresses
- some set of the Responder's IPv4 unicast addresses
This is not something that should be made available by default.
PR: 257709
Submitted by: ruben@verweg.com
Reviewed by: melifaro
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39778
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d892129 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
These functions will get some additional callers in future revisions.
No functional change intended.
Discussed with: glebius
Tested by: glebius
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38571
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
This is a total hack/bare minimum which follows inet4.
Otherwise 2 threads removing the same address can easily crash.
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D39317
ip6_input() and ip6_destroy() both directly reference ifnet members.
This file was missed in 3d0d5b21
Fixes: 3d0d5b21 ("IfAPI: Explicitly include <net/if_private.h>...")
Sponsored by: Juniper Networks, Inc.
Re-introduce PFIL_FWD, because pf's pf_refragment6() needs to know if
we're ip6_forward()-ing or ip6_output()-ing.
ip6_forward() relies on m->m_pkthdr.rcvif, at least for link-local
traffic (for in6_get_unicast_scopeid()). rcvif is not set for locally
generated traffic (e.g. from icmp6_reflect()), so we need to call the
correct output function.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revisi: https://reviews.freebsd.org/D39061