When the bridge is moved to a different vnet we must remove all of its
member interfaces (and span interfaces), because we don't know if those
will be moved along with it. We don't want to hold references to
interfaces not in our vnet.
Reviewed by: donner@
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D28859
(cherry picked from commit 38c0951386)
VLAN devices have type IFT_L2VLAN, so the STP code mistakenly believed
they couldn't be used for STP. That's not the case, so add the
ITF_L2VLAN to the check.
Reviewed by: donner@
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D28857
(cherry picked from commit 711ed156b9)
iflib_rxeof() was counting everything twice. This was introduced when
pfil hooks were added to the iflib receive path. We want to count rx
packets/bytes before the pfil hooks are executed, so remove the counter
adjustments that are executed after.
PR: 253583
Reviewed by: gallatin, erj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28900
(cherry picked from commit b6999635b1)
In commit 38bfc6dee3 we added an IFDI_DETACH() call to
iflib_pseudo_deregister() since it looked like it was missing. One is
present in the error-handling path of iflib_pseudo_register(). However,
the detach actually comes from the DEVICE_DETACH() method for the
above-mentioned device_t, so now we're calling IFDI_DETACH() twice when
destroying a pseudo interface.
Fix the problem by not calling IFDI_DETACH() from the device detach
routine. This way we can ensure that iflib de-initialization always
happens in a consistent order. It also ensures that you can't do silly
things like "devctl detach <pseudo ifnet>", which would previously
detach the driver without tearing down the corresponding ifnet.
PR: 253541
Reviewed by: erj
Fixes: 38bfc6dee3
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28774
(cherry picked from commit 0f9544d03e)
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for media changes.
iflib_if_init_locked() calls stop then init, so fixes the problem.
PR: 253473
Sponsored by: Juniper Networks, Inc., Klara, Inc.
(cherry picked from commit 922cf8ac43)
Widen the ifnet_detach_sxlock to cover the entire vnet sysuninit code.
This ensures that we can't end up having the vnet_sysuninit free the UDP
pcb while the detach code is running and trying to purge the UDP pcb.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28530
(cherry picked from commit 6d2a10d96f)
Memory and PCI resources are freed with no particular order. This could
cause use-after-frees when detaching following a failed attach. For
instance, iflib_tx_structures_free() frees ctx->ifc_txqs[] but
iflib_tqg_detach() attempts to access this array. Similarly, adapter
queues gets freed by IFDI_QUEUES_FREE() but IFDI_DETACH() attempts to
access adapter queues to free PCI resources.
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27634
(cherry picked from commit 38bfc6dee3)
The case of adding interface route by specifying interface
address as the gateway was missed during code refactoring.
Re-add it back by copying non-AF_LINK gateway data when RTF_GATEWAY
is not set.
Reviewed by: donner
(cherry picked from commit 8170a7d438)
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.
(cherry-picked from commit 3f43ada98c)
Currently, if the immutable algorithm like bsearch or radix_lockless
receives rtable update notification, it schedules algorithm rebuild.
This rebuild is executed by the callout after ~50 milliseconds.
It is possible that a script adding an interface address and than route
with the gateway bound to that address will fail. It can happen due
to the fact that fib is not updated by the time the route addition
request arrives.
Fix this by allowing synchronous algorithm rebuilds based on certain
conditions. By default, these conditions assume:
1) less than net.route.algo.fib_sync_limit=100 routes
2) routes without gateway.
* Move algo instance build entirely under rib WLOCK.
Rib lock is only used for control plane (except radix algo, but there
are no rebuilds).
* Add rib_walk_ext_locked() function to allow RIB iteration with
rib lock already held.
* Fix rare potential callout use-after-free for fds by binding fd
callout to the relevant rib rmlock. In that case, callout_stop()
under rib WLOCK guarantees no callout will be executed afterwards.
to support subscriptions during RIB modifications.
Add new subscriptions to the beginning of the lists instead of
the end. This fixes the situation when new subscription is created
int the callback for the existing subscription, leading to the
subscription notification handler pick it.
* Move per-prefix debug lines under LOG_DEBUG2
* Create fib instance counter to distingush log messages between
instances
* Add more messages on rebuild reason.
MFC after: 3 days
The initial plan was to remove rib_lookup_info() before
FreeBSD 13. As several customers are still remaining,
fix rib_lookup_info() for the multipath use case.
D26436 introduced support for stacked vlans that changed the way vlans
are configured. In particular, this change broke setups that have
same-number vlans as subinterfaces.
Vlan support was initially created assuming "vlanX" semantics. In this paradigm,
automatic number assignment supported by cloning (ifconfig vlan create) was a
natural fit.
When "ifaceX.Y" support was added, allowing to have the same vlan number on
multiple devices, cloning code became more complex, as the is no
unified "vlan" namespace anymore. Such interfaces got the first spare
index from "vlan" cloner. This, in turn, led to the following problem:
ifconfig ix0.333 create -> index 1
ifconfig ix0.444 create -> index 2
ifconfig vlan2 create -> allocation failure
This change fixes such allocations by using cloning indexes only for
"vlanX" interfaces.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D27505
ROUTE_MPATH was added to the GENERIC kernel in r368648.
According to the plan in D27428, it was enabled with `net.route.multipath` sysctl set to 0.
Given enough time has passed, this change enables route multipath by default.
The goal is to ship FreeBSD 13 with multipath turned on.
Reviewed By: donner, olivier
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D28423
The NS_MOREFRAG flag can be set in a netmap slot to represent a
multi-fragment packet. Only the last fragment of a packet does
not have the flag set. On TX rings, the flag may be set by the
userspace application. The kernel will look at the flag and use it
to properly set up the NIC TX descriptors.
On RX rings, the kernel may set the flag if the packet received
was split across multiple netmap buffers. The userspace application
should look at the flag to know when the packet is complete.
Submitted by: rajesh1.kumar_amd.com
Reviewed by: vmaffione
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27799
(cherry picked from commit aceaccab65)
rxd_frag_to_sd() have pf_rv parameter as NULL with the current
code. This patch fixes the NULL pointer dereference in that
case thus avoiding a possible panic.
Submitted by: rajesh1.kumar at amd.com
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D28115
it gets unused.
Currently if_purgeifaddrs() uses in6_purgeaddr() to remove IPv6
ifaddrs. in6_purgeaddr() does not trrigger prefix removal if
number of linked ifas goes to 0, as this is a low-level function.
As a result, if_purgeifaddrs() purges all IPv4/IPv6 addresses but
keeps corresponding IPv6 prefixes.
Fix this by creating higher-level wrapper which handles unused
prefix usecase and use it in if_purgeifaddrs().
Differential revision: https://reviews.freebsd.org/D28128
rtinit[1]() is a function used to add or remove interface address prefix routes,
similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
in reality.
1) IPv6 code does not use it for the ifa routes. There is a separate layer,
nd6_prelist_(), providing interface for maintaining interface routes. Its part,
responsible for the actual route table interaction, mimics rtenty() code.
2) rtinit tries to combine multiple actions in the same function: constructing
proper route attributes and handling iterations over multiple fibs, for the
non-zero net.add_addr_allfibs use case. It notably increases the code complexity.
3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
for p2p connections, host routes and p2p routes are handled in the same way.
Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
aliases.
4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
complicating rib_action() implementation.
5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
ifa messages in certain corner cases.
To address all these points, the following has been done:
* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
responsible for the actual route notifications.
Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses
Differential revision: https://reviews.freebsd.org/D28186
Removed code iterates over if_addrhead and tries to remove
routes for each ifa.
This is exactly the thing that if_purgeaddrs() do, and
if_purgeaddr() is already called in the end.
Reviewed by: glebius
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D28106
In e86bddea9f sys/netpfil/pf/pf.h grew a
declaration of pf_get_ruleset_number. Now delete the old declaration
from sys/net/pfvar.h.
Reviewed by: kp
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D28081
Restore the hwofs functionality temporarily disabled by
7ba6ecf216 to prevent issues with iflib.
This patch brings the necessary changes to iflib to
enable howfs to allow interface restarts without
disrupting netmap applications actively using its
rings.
After this change, it becomes possible for multiple
non-cooperating netmap applications to use non-overlapping
subsets of the available netmap rings without clashing
with each other.
PR: 252453
MFC after: 1 week
The iflib_queues_alloc() allocates isc_nrxqs iflib_dma_info structs
for each rxqset, and links each struct to a different free list.
As a result, it must be isc_nrxqs >= isc_nfl (plus the completion
queue, if present).
Add an assertion to make this constraint explicit.
MFC after: 2 weeks
Since 1d238b07d5, krings are disabled before
a reinit cycle triggered by iflib_netmap_register.
However, this operation is actually necessary also for
any interface reinit triggered by other causes (i.e.,
ifconfig commands).
We achieve this goal by moving the krings enable/disable
operation inside iflib_stop() and iflib_init_locked().
Once here, this change also removes some redundant operations
from iflib_netmap_register(), that are already performed by
iflib_stop().
PR: 252453
MFC after: 1 week
When netmap_fl_refill() is called at initialization time (e.g.,
during netmap_iflib_register()), nic_i must be 0, since the
free list is reinitialized. At the end of the refill cycle, nic_i
must still be zero, because exactly N descriptors (N is the ring size)
are refilled.
This patch therefore fixes the assertions to check on nic_i rather
than on nm_i. The current netmap_reset() may in fact cause nm_i
to be != 0 while the device is resetting: this may happen when
multiple non-cooperating processes open different subsets of the
available netmap rings.
PR: 252518
MFC after: 1 week
When different processes open separate subsets of the
available rings of a same netmap interface, a device
reset may be performed while one of the processes
is actively using some rings (e.g., caused by another
process executing a nmport_open()).
With this patch, such situation will cause the
active process to get a POLLERR, so that it can
have a chance to detect the situation.
We also guarantee that no process is running a txsync
or rxsync (ioctl or poll) while an iflib device reset
is in progress.
PR: 252453
MFC after: 1 week
Doing a 'dd' over iscsi will reliably cause stalls. Tx
cleaning _should_ reliably happen as data is sent.
However, currently if the transmit queue fills it will
wait until the iflib timer (hz/2) runs.
This change causes the the tx taskq thread to be run
if there are completed descriptors.
While here:
- make timer interrupt delay a sysctl
- simplify txd_db_check handling
- comment on INTR types
Background on the change:
Initially doorbell updates were minimized by only writing to the register
on every fourth packet. If txq_drain would return without writing to the
doorbell it scheduled a callout on the next tick to do the doorbell write
to ensure that the write otherwise happened "soon". At that time a sysctl
was added for users to avoid the potential added latency by simply writing
to the doorbell register on every packet. This worked perfectly well for
e1000 and ixgbe ... and appeared to work well on ixl. However, as it
turned out there was a race to this approach that would lockup the ixl MAC.
It was possible for a lower producer index to be written after a higher one.
On e1000 and ixgbe this was harmless - on ixl it was fatal. My initial
response was to add a lock around doorbell writes - fixing the problem but
adding an unacceptable amount of lock contention.
The next iteration was to use transmit interrupts to drive delayed doorbell
writes. If there were no packets in the queue all doorbell writes would be
immediate as the queue started to fill up we could delay doorbell writes
further and further. At the start of drain if we've cleaned any packets we
know we've moved the state machine along and we write the doorbell (an
obvious missing optimization was to skip that doorbell write if db_pending
is zero). This change required that tx interrupts be scheduled periodically
as opposed to just when the hardware txq was full. However, that just leads
to our next problem.
Initially dedicated msix vectors were used for both tx and rx. However, it
was often possible to use up all available vectors before we set up all the
queues we wanted. By having rx and tx share a vector for a given queue we
could halve the number of vectors used by a given configuration. The problem
here is that with this change only e1000 passed the necessary value to have
the fast interrupt drive tx when appropriate.
Reported by: mav@
Tested by: mav@
Reviewed by: gallatin@
MFC after: 1 month
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D27683
Improve caching behaviour by using counter_u64 rather than variables
shared between cores.
The result of converting all counters to counter(9) (i.e. this full
patch series) is a significant improvement in throughput. As tested by
olivier@, on Intel Xeon E5-2697Av4 (16Cores, 32 threads) hardware with
Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4) nics we see:
x FreeBSD 20201223: inet packets-per-second
+ FreeBSD 20201223 with pf patches: inet packets-per-second
+--------------------------------------------------------------------------+
| + |
| xx + |
|xxx +++|
||A| |
| |A||
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 9216962 9526356 9343902 9371057.6 116720.36
+ 5 19427190 19698400 19502922 19546509 109084.92
Difference at 95.0% confidence
1.01755e+07 +/- 164756
108.584% +/- 2.9359%
(Student's t, pooled s = 112967)
Reviewed by: philip
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27763
Factor out allocating and freeing pfi_kkif structures. This will be
useful when we change the counters to be counter_u64, so we don't have
to deal with that complexity in the multiple locations where we allocate
pfi_kkif structures.
No functional change.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27762
This improves the cache behaviour of pf and results in improved
throughput.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27760
The u_* counters are used only to communicate with userspace, as
userspace cannot use counter_u64. As pf_krule is not passed to userspace
these fields are now obsolete.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27759
As part of the split between user and kernel mode structures we're
moving all user space usable definitions into pf.h.
No functional change intended.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27757
Introduce a kernel version of struct pf_src_node (pf_ksrc_node).
This will allow us to improve the in-kernel data structure without
breaking userspace compatibility.
Reviewed by: philip
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27707