2005-01-06 20:45:51 -05:00
|
|
|
/*-
|
2017-11-20 14:43:44 -05:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
|
*
|
1995-09-21 13:55:49 -04:00
|
|
|
* Copyright (c) 1982, 1986, 1991, 1993, 1995
|
2007-02-17 16:02:38 -05:00
|
|
|
* The Regents of the University of California.
|
2009-03-10 20:29:22 -04:00
|
|
|
* Copyright (c) 2007-2009 Robert N. M. Watson
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 15:32:02 -04:00
|
|
|
* Copyright (c) 2010-2011 Juniper Networks, Inc.
|
2023-02-01 12:35:25 -05:00
|
|
|
* Copyright (c) 2021-2022 Gleb Smirnoff <glebius@FreeBSD.org>
|
2007-02-17 16:02:38 -05:00
|
|
|
* All rights reserved.
|
1994-05-24 06:09:53 -04:00
|
|
|
*
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 15:32:02 -04:00
|
|
|
* Portions of this software were developed by Robert N. M. Watson under
|
|
|
|
|
* contract to Juniper Networks, Inc.
|
|
|
|
|
*
|
1994-05-24 06:09:53 -04:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
|
* are met:
|
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 18:42:47 -05:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 06:09:53 -04:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
|
* without specific prior written permission.
|
|
|
|
|
*
|
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
|
*
|
1995-09-21 13:55:49 -04:00
|
|
|
* @(#)in_pcb.c 8.4 (Berkeley) 5/24/95
|
1994-05-24 06:09:53 -04:00
|
|
|
*/
|
|
|
|
|
|
2007-10-07 16:44:24 -04:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
|
2007-02-17 16:02:38 -05:00
|
|
|
#include "opt_ddb.h"
|
1999-12-22 14:13:38 -05:00
|
|
|
#include "opt_ipsec.h"
|
2011-03-12 16:46:37 -05:00
|
|
|
#include "opt_inet.h"
|
1999-12-07 12:39:16 -05:00
|
|
|
#include "opt_inet6.h"
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
#include "opt_ratelimit.h"
|
2020-10-18 13:15:47 -04:00
|
|
|
#include "opt_route.h"
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-14 20:57:50 -04:00
|
|
|
#include "opt_rss.h"
|
1999-12-07 12:39:16 -05:00
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/param.h>
|
2021-12-26 13:47:28 -05:00
|
|
|
#include <sys/hash.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/systm.h>
|
2021-12-26 13:47:28 -05:00
|
|
|
#include <sys/libkern.h>
|
2015-07-29 04:12:05 -04:00
|
|
|
#include <sys/lock.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/malloc.h>
|
|
|
|
|
#include <sys/mbuf.h>
|
2016-01-09 04:34:39 -05:00
|
|
|
#include <sys/eventhandler.h>
|
1999-12-07 12:39:16 -05:00
|
|
|
#include <sys/domain.h>
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
#include <sys/proc.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/protosw.h>
|
2018-04-19 09:37:59 -04:00
|
|
|
#include <sys/smp.h>
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
#include <sys/smr.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/socket.h>
|
|
|
|
|
#include <sys/socketvar.h>
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
#include <sys/sockio.h>
|
2006-11-06 08:42:10 -05:00
|
|
|
#include <sys/priv.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/proc.h>
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 15:32:02 -04:00
|
|
|
#include <sys/refcount.h>
|
This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1999-04-28 07:38:52 -04:00
|
|
|
#include <sys/jail.h>
|
1996-01-19 03:00:58 -05:00
|
|
|
#include <sys/kernel.h>
|
|
|
|
|
#include <sys/sysctl.h>
|
1998-03-28 05:18:26 -05:00
|
|
|
|
2007-02-17 16:02:38 -05:00
|
|
|
#ifdef DDB
|
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
|
#endif
|
|
|
|
|
|
2002-03-20 00:48:55 -05:00
|
|
|
#include <vm/uma.h>
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
#include <vm/vm.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
|
|
|
|
|
#include <net/if.h>
|
2013-10-26 13:58:36 -04:00
|
|
|
#include <net/if_var.h>
|
2023-01-23 10:05:29 -05:00
|
|
|
#include <net/if_private.h>
|
1999-12-07 12:39:16 -05:00
|
|
|
#include <net/if_types.h>
|
2016-06-02 13:51:29 -04:00
|
|
|
#include <net/if_llatbl.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <net/route.h>
|
2015-01-18 13:06:40 -05:00
|
|
|
#include <net/rss_config.h>
|
2009-08-01 15:26:27 -04:00
|
|
|
#include <net/vnet.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2011-04-30 07:04:34 -04:00
|
|
|
#if defined(INET) || defined(INET6)
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <netinet/in.h>
|
|
|
|
|
#include <netinet/in_pcb.h>
|
2021-10-18 13:14:03 -04:00
|
|
|
#include <netinet/in_pcb_var.h>
|
2022-11-08 13:24:40 -05:00
|
|
|
#include <netinet/tcp.h>
|
2019-06-25 07:54:41 -04:00
|
|
|
#ifdef INET
|
|
|
|
|
#include <netinet/in_var.h>
|
2020-04-14 19:06:25 -04:00
|
|
|
#include <netinet/in_fib.h>
|
2019-06-25 07:54:41 -04:00
|
|
|
#endif
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <netinet/ip_var.h>
|
1999-12-07 12:39:16 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
#include <netinet/ip6.h>
|
2011-03-12 16:46:37 -05:00
|
|
|
#include <netinet6/in6_pcb.h>
|
2011-04-30 07:04:34 -04:00
|
|
|
#include <netinet6/in6_var.h>
|
|
|
|
|
#include <netinet6/ip6_var.h>
|
1999-12-07 12:39:16 -05:00
|
|
|
#endif /* INET6 */
|
2020-04-14 19:06:25 -04:00
|
|
|
#include <net/route/nhop.h>
|
2019-06-25 07:54:41 -04:00
|
|
|
#endif
|
1999-12-07 12:39:16 -05:00
|
|
|
|
2017-02-06 03:49:57 -05:00
|
|
|
#include <netipsec/ipsec_support.h>
|
2002-10-15 22:25:05 -04:00
|
|
|
|
2006-10-22 07:52:19 -04:00
|
|
|
#include <security/mac/mac_framework.h>
|
|
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
#define INPCBLBGROUP_SIZMIN 8
|
|
|
|
|
#define INPCBLBGROUP_SIZMAX 256
|
2021-12-02 17:45:04 -05:00
|
|
|
#define INP_FREED 0x00000200 /* See in_pcb.h. */
|
2018-06-06 11:45:57 -04:00
|
|
|
|
1996-01-19 03:00:58 -05:00
|
|
|
/*
|
|
|
|
|
* These configure the range of local port addresses assigned to
|
|
|
|
|
* "unspecified" outgoing connections/packets/whatever.
|
|
|
|
|
*/
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 18:48:30 -04:00
|
|
|
VNET_DEFINE(int, ipport_lowfirstauto) = IPPORT_RESERVED - 1; /* 1023 */
|
|
|
|
|
VNET_DEFINE(int, ipport_lowlastauto) = IPPORT_RESERVEDSTART; /* 600 */
|
|
|
|
|
VNET_DEFINE(int, ipport_firstauto) = IPPORT_EPHEMERALFIRST; /* 10000 */
|
|
|
|
|
VNET_DEFINE(int, ipport_lastauto) = IPPORT_EPHEMERALLAST; /* 65535 */
|
|
|
|
|
VNET_DEFINE(int, ipport_hifirstauto) = IPPORT_HIFIRSTAUTO; /* 49152 */
|
|
|
|
|
VNET_DEFINE(int, ipport_hilastauto) = IPPORT_HILASTAUTO; /* 65535 */
|
1996-01-19 03:00:58 -05:00
|
|
|
|
The ancient and outdated concept of "privileged ports" in UNIX-type
OSes has probably caused more problems than it ever solved. Allow the
user to retire the old behavior by specifying their own privileged
range with,
net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1
net.inet.ip.portrange.reservedlo default = 0
Now you can run that webserver without ever needing root at all. Or
just imagine, an ftpd that can really drop privileges, rather than
just set the euid, and still do PORT data transfers from 20/tcp.
Two edge cases to note,
# sysctl net.inet.ip.portrange.reservedhigh=0
Opens all ports to everyone, and,
# sysctl net.inet.ip.portrange.reservedhigh=65535
Locks all network activity to root only (which could actually have
been achieved before with ipfw(8), but is somewhat more
complicated).
For those who stick to the old religion that 0-1023 belong to root and
root alone, don't touch the knobs (or even lock them by raising
securelevel(8)), and nothing changes.
2003-02-21 00:28:27 -05:00
|
|
|
/*
|
|
|
|
|
* Reserved ports accessible only to root. There are significant
|
|
|
|
|
* security considerations that must be accounted for when changing these,
|
|
|
|
|
* but the security benefits can be great. Please be careful.
|
|
|
|
|
*/
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 18:48:30 -04:00
|
|
|
VNET_DEFINE(int, ipport_reservedhigh) = IPPORT_RESERVED - 1; /* 1023 */
|
|
|
|
|
VNET_DEFINE(int, ipport_reservedlow);
|
The ancient and outdated concept of "privileged ports" in UNIX-type
OSes has probably caused more problems than it ever solved. Allow the
user to retire the old behavior by specifying their own privileged
range with,
net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1
net.inet.ip.portrange.reservedlo default = 0
Now you can run that webserver without ever needing root at all. Or
just imagine, an ftpd that can really drop privileges, rather than
just set the euid, and still do PORT data transfers from 20/tcp.
Two edge cases to note,
# sysctl net.inet.ip.portrange.reservedhigh=0
Opens all ports to everyone, and,
# sysctl net.inet.ip.portrange.reservedhigh=65535
Locks all network activity to root only (which could actually have
been achieved before with ipfw(8), but is somewhat more
complicated).
For those who stick to the old religion that 0-1023 belong to root and
root alone, don't touch the knobs (or even lock them by raising
securelevel(8)), and nothing changes.
2003-02-21 00:28:27 -05:00
|
|
|
|
2022-10-31 11:57:11 -04:00
|
|
|
/* Enable random ephemeral port allocation by default. */
|
|
|
|
|
VNET_DEFINE(int, ipport_randomized) = 1;
|
2004-04-22 04:32:14 -04:00
|
|
|
|
2011-05-30 14:07:35 -04:00
|
|
|
#ifdef INET
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
static struct inpcb *in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo,
|
|
|
|
|
struct in_addr faddr, u_int fport_arg,
|
|
|
|
|
struct in_addr laddr, u_int lport_arg,
|
2023-02-09 15:59:27 -05:00
|
|
|
int lookupflags, uint8_t numa_domain);
|
2011-04-30 07:04:34 -04:00
|
|
|
|
1996-08-12 10:05:54 -04:00
|
|
|
#define RANGECHK(var, min, max) \
|
|
|
|
|
if ((var) < (min)) { (var) = (min); } \
|
|
|
|
|
else if ((var) > (max)) { (var) = (max); }
|
|
|
|
|
|
|
|
|
|
static int
|
2000-07-04 07:25:35 -04:00
|
|
|
sysctl_net_ipport_check(SYSCTL_HANDLER_ARGS)
|
1996-08-12 10:05:54 -04:00
|
|
|
{
|
2004-04-06 06:59:11 -04:00
|
|
|
int error;
|
|
|
|
|
|
Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
2009-04-30 09:36:26 -04:00
|
|
|
error = sysctl_handle_int(oidp, arg1, arg2, req);
|
2004-04-06 06:59:11 -04:00
|
|
|
if (error == 0) {
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 19:27:27 -04:00
|
|
|
RANGECHK(V_ipport_lowfirstauto, 1, IPPORT_RESERVED - 1);
|
|
|
|
|
RANGECHK(V_ipport_lowlastauto, 1, IPPORT_RESERVED - 1);
|
|
|
|
|
RANGECHK(V_ipport_firstauto, IPPORT_RESERVED, IPPORT_MAX);
|
|
|
|
|
RANGECHK(V_ipport_lastauto, IPPORT_RESERVED, IPPORT_MAX);
|
|
|
|
|
RANGECHK(V_ipport_hifirstauto, IPPORT_RESERVED, IPPORT_MAX);
|
|
|
|
|
RANGECHK(V_ipport_hilastauto, IPPORT_RESERVED, IPPORT_MAX);
|
1996-08-12 10:05:54 -04:00
|
|
|
}
|
2004-04-06 06:59:11 -04:00
|
|
|
return (error);
|
1996-08-12 10:05:54 -04:00
|
|
|
}
|
1996-02-22 16:32:23 -05:00
|
|
|
|
1996-08-12 10:05:54 -04:00
|
|
|
#undef RANGECHK
|
1996-01-19 03:00:58 -05:00
|
|
|
|
2020-02-26 09:26:36 -05:00
|
|
|
static SYSCTL_NODE(_net_inet_ip, IPPROTO_IP, portrange,
|
|
|
|
|
CTLFLAG_RW | CTLFLAG_MPSAFE, 0,
|
2011-11-07 10:43:11 -05:00
|
|
|
"IP Ports");
|
1996-08-12 10:05:54 -04:00
|
|
|
|
2014-11-07 04:39:05 -05:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, lowfirst,
|
2020-02-26 09:26:36 -05:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
|
&VNET_NAME(ipport_lowfirstauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
|
"");
|
2014-11-07 04:39:05 -05:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, lowlast,
|
2020-02-26 09:26:36 -05:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
|
&VNET_NAME(ipport_lowlastauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
|
"");
|
2014-11-07 04:39:05 -05:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, first,
|
2020-02-26 09:26:36 -05:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
|
&VNET_NAME(ipport_firstauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
|
"");
|
2014-11-07 04:39:05 -05:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, last,
|
2020-02-26 09:26:36 -05:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
|
&VNET_NAME(ipport_lastauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
|
"");
|
2014-11-07 04:39:05 -05:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, hifirst,
|
2020-02-26 09:26:36 -05:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
|
&VNET_NAME(ipport_hifirstauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
|
"");
|
2014-11-07 04:39:05 -05:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, hilast,
|
2020-02-26 09:26:36 -05:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
|
&VNET_NAME(ipport_hilastauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
|
"");
|
2014-11-07 04:39:05 -05:00
|
|
|
SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, reservedhigh,
|
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE,
|
|
|
|
|
&VNET_NAME(ipport_reservedhigh), 0, "");
|
|
|
|
|
SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, reservedlow,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 18:48:30 -04:00
|
|
|
CTLFLAG_RW|CTLFLAG_SECURE, &VNET_NAME(ipport_reservedlow), 0, "");
|
2014-11-07 04:39:05 -05:00
|
|
|
SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, randomized,
|
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 18:48:30 -04:00
|
|
|
&VNET_NAME(ipport_randomized), 0, "Enable random port allocation");
|
2019-08-01 10:17:31 -04:00
|
|
|
|
|
|
|
|
#ifdef RATELIMIT
|
2021-01-26 11:54:42 -05:00
|
|
|
counter_u64_t rate_limit_new;
|
|
|
|
|
counter_u64_t rate_limit_chg;
|
2019-08-01 10:17:31 -04:00
|
|
|
counter_u64_t rate_limit_active;
|
|
|
|
|
counter_u64_t rate_limit_alloc_fail;
|
|
|
|
|
counter_u64_t rate_limit_set_ok;
|
|
|
|
|
|
2020-02-26 09:26:36 -05:00
|
|
|
static SYSCTL_NODE(_net_inet_ip, OID_AUTO, rl, CTLFLAG_RD | CTLFLAG_MPSAFE, 0,
|
2019-08-01 10:17:31 -04:00
|
|
|
"IP Rate Limiting");
|
|
|
|
|
SYSCTL_COUNTER_U64(_net_inet_ip_rl, OID_AUTO, active, CTLFLAG_RD,
|
|
|
|
|
&rate_limit_active, "Active rate limited connections");
|
|
|
|
|
SYSCTL_COUNTER_U64(_net_inet_ip_rl, OID_AUTO, alloc_fail, CTLFLAG_RD,
|
|
|
|
|
&rate_limit_alloc_fail, "Rate limited connection failures");
|
|
|
|
|
SYSCTL_COUNTER_U64(_net_inet_ip_rl, OID_AUTO, set_ok, CTLFLAG_RD,
|
|
|
|
|
&rate_limit_set_ok, "Rate limited setting succeeded");
|
2021-01-26 11:54:42 -05:00
|
|
|
SYSCTL_COUNTER_U64(_net_inet_ip_rl, OID_AUTO, newrl, CTLFLAG_RD,
|
|
|
|
|
&rate_limit_new, "Total Rate limit new attempts");
|
|
|
|
|
SYSCTL_COUNTER_U64(_net_inet_ip_rl, OID_AUTO, chgrl, CTLFLAG_RD,
|
|
|
|
|
&rate_limit_chg, "Total Rate limited change attempts");
|
2019-08-01 10:17:31 -04:00
|
|
|
#endif /* RATELIMIT */
|
|
|
|
|
|
2012-01-21 21:13:19 -05:00
|
|
|
#endif /* INET */
|
1995-11-14 15:34:56 -05:00
|
|
|
|
2021-12-26 13:47:28 -05:00
|
|
|
VNET_DEFINE(uint32_t, in_pcbhashseed);
|
|
|
|
|
static void
|
|
|
|
|
in_pcbhashseed_init(void)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
V_in_pcbhashseed = arc4random();
|
|
|
|
|
}
|
|
|
|
|
VNET_SYSINIT(in_pcbhashseed_init, SI_SUB_PROTO_DOMAIN, SI_ORDER_FIRST,
|
|
|
|
|
in_pcbhashseed_init, 0);
|
|
|
|
|
|
2022-10-13 12:03:38 -04:00
|
|
|
static void in_pcbremhash(struct inpcb *);
|
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
/*
|
|
|
|
|
* in_pcb.c: manage the Protocol Control Blocks.
|
|
|
|
|
*
|
2005-07-19 08:24:27 -04:00
|
|
|
* NOTE: It is assumed that most of these functions will be called with
|
|
|
|
|
* the pcbinfo lock held, and often, the inpcb lock held, as these utility
|
|
|
|
|
* functions often modify hash chains or addresses in pcbs.
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
*/
|
|
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
static struct inpcblbgroup *
|
2022-11-02 13:08:07 -04:00
|
|
|
in_pcblbgroup_alloc(struct inpcblbgrouphead *hdr, struct ucred *cred,
|
|
|
|
|
u_char vflag, uint16_t port, const union in_dependaddr *addr, int size,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
uint8_t numa_domain)
|
2018-06-06 11:45:57 -04:00
|
|
|
{
|
|
|
|
|
struct inpcblbgroup *grp;
|
|
|
|
|
size_t bytes;
|
|
|
|
|
|
|
|
|
|
bytes = __offsetof(struct inpcblbgroup, il_inp[size]);
|
|
|
|
|
grp = malloc(bytes, M_PCB, M_ZERO | M_NOWAIT);
|
2022-11-02 13:08:07 -04:00
|
|
|
if (grp == NULL)
|
2018-06-06 11:45:57 -04:00
|
|
|
return (NULL);
|
2022-11-02 13:08:07 -04:00
|
|
|
grp->il_cred = crhold(cred);
|
2018-06-06 11:45:57 -04:00
|
|
|
grp->il_vflag = vflag;
|
|
|
|
|
grp->il_lport = port;
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
grp->il_numa_domain = numa_domain;
|
2018-06-06 11:45:57 -04:00
|
|
|
grp->il_dependladdr = *addr;
|
|
|
|
|
grp->il_inpsiz = size;
|
2018-09-10 15:00:29 -04:00
|
|
|
CK_LIST_INSERT_HEAD(hdr, grp, il_list);
|
2018-06-06 11:45:57 -04:00
|
|
|
return (grp);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
2018-09-10 15:00:29 -04:00
|
|
|
in_pcblbgroup_free_deferred(epoch_context_t ctx)
|
2018-06-06 11:45:57 -04:00
|
|
|
{
|
2018-09-10 15:00:29 -04:00
|
|
|
struct inpcblbgroup *grp;
|
2018-06-06 11:45:57 -04:00
|
|
|
|
2018-09-10 15:00:29 -04:00
|
|
|
grp = __containerof(ctx, struct inpcblbgroup, il_epoch_ctx);
|
2022-11-02 13:08:07 -04:00
|
|
|
crfree(grp->il_cred);
|
2018-09-03 13:39:09 -04:00
|
|
|
free(grp, M_PCB);
|
2018-06-06 11:45:57 -04:00
|
|
|
}
|
|
|
|
|
|
2018-09-10 15:00:29 -04:00
|
|
|
static void
|
|
|
|
|
in_pcblbgroup_free(struct inpcblbgroup *grp)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
CK_LIST_REMOVE(grp, il_list);
|
2020-01-15 01:05:20 -05:00
|
|
|
NET_EPOCH_CALL(in_pcblbgroup_free_deferred, &grp->il_epoch_ctx);
|
2018-09-10 15:00:29 -04:00
|
|
|
}
|
|
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
static struct inpcblbgroup *
|
|
|
|
|
in_pcblbgroup_resize(struct inpcblbgrouphead *hdr,
|
|
|
|
|
struct inpcblbgroup *old_grp, int size)
|
|
|
|
|
{
|
|
|
|
|
struct inpcblbgroup *grp;
|
|
|
|
|
int i;
|
|
|
|
|
|
2022-11-02 13:08:07 -04:00
|
|
|
grp = in_pcblbgroup_alloc(hdr, old_grp->il_cred, old_grp->il_vflag,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
old_grp->il_lport, &old_grp->il_dependladdr, size,
|
|
|
|
|
old_grp->il_numa_domain);
|
2018-11-01 11:51:49 -04:00
|
|
|
if (grp == NULL)
|
2018-06-06 11:45:57 -04:00
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
|
|
KASSERT(old_grp->il_inpcnt < grp->il_inpsiz,
|
|
|
|
|
("invalid new local group size %d and old local group count %d",
|
|
|
|
|
grp->il_inpsiz, old_grp->il_inpcnt));
|
|
|
|
|
|
|
|
|
|
for (i = 0; i < old_grp->il_inpcnt; ++i)
|
|
|
|
|
grp->il_inp[i] = old_grp->il_inp[i];
|
|
|
|
|
grp->il_inpcnt = old_grp->il_inpcnt;
|
|
|
|
|
in_pcblbgroup_free(old_grp);
|
|
|
|
|
return (grp);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* PCB at index 'i' is removed from the group. Pull up the ones below il_inp[i]
|
|
|
|
|
* and shrink group if possible.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
in_pcblbgroup_reorder(struct inpcblbgrouphead *hdr, struct inpcblbgroup **grpp,
|
|
|
|
|
int i)
|
|
|
|
|
{
|
2018-11-01 11:51:49 -04:00
|
|
|
struct inpcblbgroup *grp, *new_grp;
|
2018-06-06 11:45:57 -04:00
|
|
|
|
2018-11-01 11:51:49 -04:00
|
|
|
grp = *grpp;
|
2018-06-06 11:45:57 -04:00
|
|
|
for (; i + 1 < grp->il_inpcnt; ++i)
|
|
|
|
|
grp->il_inp[i] = grp->il_inp[i + 1];
|
|
|
|
|
grp->il_inpcnt--;
|
|
|
|
|
|
|
|
|
|
if (grp->il_inpsiz > INPCBLBGROUP_SIZMIN &&
|
2018-11-01 11:51:49 -04:00
|
|
|
grp->il_inpcnt <= grp->il_inpsiz / 4) {
|
2018-06-06 11:45:57 -04:00
|
|
|
/* Shrink this group. */
|
2018-11-01 11:51:49 -04:00
|
|
|
new_grp = in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz / 2);
|
|
|
|
|
if (new_grp != NULL)
|
2018-06-06 11:45:57 -04:00
|
|
|
*grpp = new_grp;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Add PCB to load balance group for SO_REUSEPORT_LB option.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
in_pcbinslbgrouphash(struct inpcb *inp, uint8_t numa_domain)
|
2018-06-06 11:45:57 -04:00
|
|
|
{
|
2018-09-07 17:11:41 -04:00
|
|
|
const static struct timeval interval = { 60, 0 };
|
|
|
|
|
static struct timeval lastprint;
|
2018-06-06 11:45:57 -04:00
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
|
struct inpcblbgrouphead *hdr;
|
|
|
|
|
struct inpcblbgroup *grp;
|
2018-11-01 11:51:49 -04:00
|
|
|
uint32_t idx;
|
2018-06-06 11:45:57 -04:00
|
|
|
|
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
|
|
|
|
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
/*
|
|
|
|
|
* Don't allow IPv4 mapped INET6 wild socket.
|
|
|
|
|
*/
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) &&
|
|
|
|
|
inp->inp_laddr.s_addr == INADDR_ANY &&
|
|
|
|
|
INP_CHECK_SOCKAF(inp->inp_socket, AF_INET6)) {
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
|
2018-12-05 12:06:00 -05:00
|
|
|
idx = INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_lbgrouphashmask);
|
2018-11-01 11:51:49 -04:00
|
|
|
hdr = &pcbinfo->ipi_lbgrouphashbase[idx];
|
2018-09-10 15:00:29 -04:00
|
|
|
CK_LIST_FOREACH(grp, hdr, il_list) {
|
2022-11-02 13:08:07 -04:00
|
|
|
if (grp->il_cred->cr_prison == inp->inp_cred->cr_prison &&
|
|
|
|
|
grp->il_vflag == inp->inp_vflag &&
|
2018-06-06 11:45:57 -04:00
|
|
|
grp->il_lport == inp->inp_lport &&
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
grp->il_numa_domain == numa_domain &&
|
2018-06-06 11:45:57 -04:00
|
|
|
memcmp(&grp->il_dependladdr,
|
2018-11-01 11:51:49 -04:00
|
|
|
&inp->inp_inc.inc_ie.ie_dependladdr,
|
2022-11-02 13:08:07 -04:00
|
|
|
sizeof(grp->il_dependladdr)) == 0) {
|
2018-06-06 11:45:57 -04:00
|
|
|
break;
|
2022-11-02 13:08:07 -04:00
|
|
|
}
|
2018-06-06 11:45:57 -04:00
|
|
|
}
|
|
|
|
|
if (grp == NULL) {
|
|
|
|
|
/* Create new load balance group. */
|
2022-11-02 13:08:07 -04:00
|
|
|
grp = in_pcblbgroup_alloc(hdr, inp->inp_cred, inp->inp_vflag,
|
2018-06-06 11:45:57 -04:00
|
|
|
inp->inp_lport, &inp->inp_inc.inc_ie.ie_dependladdr,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
INPCBLBGROUP_SIZMIN, numa_domain);
|
2018-11-01 11:51:49 -04:00
|
|
|
if (grp == NULL)
|
2018-06-06 11:45:57 -04:00
|
|
|
return (ENOBUFS);
|
|
|
|
|
} else if (grp->il_inpcnt == grp->il_inpsiz) {
|
|
|
|
|
if (grp->il_inpsiz >= INPCBLBGROUP_SIZMAX) {
|
2018-09-07 17:11:41 -04:00
|
|
|
if (ratecheck(&lastprint, &interval))
|
2018-06-06 11:45:57 -04:00
|
|
|
printf("lb group port %d, limit reached\n",
|
|
|
|
|
ntohs(grp->il_lport));
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Expand this local group. */
|
|
|
|
|
grp = in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz * 2);
|
2018-11-01 11:51:49 -04:00
|
|
|
if (grp == NULL)
|
2018-06-06 11:45:57 -04:00
|
|
|
return (ENOBUFS);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
KASSERT(grp->il_inpcnt < grp->il_inpsiz,
|
2018-11-01 11:51:49 -04:00
|
|
|
("invalid local group size %d and count %d", grp->il_inpsiz,
|
|
|
|
|
grp->il_inpcnt));
|
2018-06-06 11:45:57 -04:00
|
|
|
|
|
|
|
|
grp->il_inp[grp->il_inpcnt] = inp;
|
|
|
|
|
grp->il_inpcnt++;
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Remove PCB from load balance group.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
in_pcbremlbgrouphash(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
|
struct inpcblbgrouphead *hdr;
|
|
|
|
|
struct inpcblbgroup *grp;
|
|
|
|
|
int i;
|
|
|
|
|
|
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
|
|
|
|
|
|
|
|
|
hdr = &pcbinfo->ipi_lbgrouphashbase[
|
2018-12-05 12:06:00 -05:00
|
|
|
INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_lbgrouphashmask)];
|
2018-09-10 15:00:29 -04:00
|
|
|
CK_LIST_FOREACH(grp, hdr, il_list) {
|
2018-06-06 11:45:57 -04:00
|
|
|
for (i = 0; i < grp->il_inpcnt; ++i) {
|
|
|
|
|
if (grp->il_inp[i] != inp)
|
|
|
|
|
continue;
|
|
|
|
|
|
|
|
|
|
if (grp->il_inpcnt == 1) {
|
|
|
|
|
/* We are the last, free this local group. */
|
|
|
|
|
in_pcblbgroup_free(grp);
|
|
|
|
|
} else {
|
|
|
|
|
/* Pull up inpcbs, shrink group if possible. */
|
|
|
|
|
in_pcblbgroup_reorder(hdr, &grp, i);
|
|
|
|
|
}
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
int
|
|
|
|
|
in_pcblbgroup_numa(struct inpcb *inp, int arg)
|
|
|
|
|
{
|
|
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
|
struct inpcblbgrouphead *hdr;
|
|
|
|
|
struct inpcblbgroup *grp;
|
|
|
|
|
int err, i;
|
|
|
|
|
uint8_t numa_domain;
|
|
|
|
|
|
|
|
|
|
switch (arg) {
|
|
|
|
|
case TCP_REUSPORT_LB_NUMA_NODOM:
|
|
|
|
|
numa_domain = M_NODOM;
|
|
|
|
|
break;
|
|
|
|
|
case TCP_REUSPORT_LB_NUMA_CURDOM:
|
|
|
|
|
numa_domain = PCPU_GET(domain);
|
|
|
|
|
break;
|
|
|
|
|
default:
|
|
|
|
|
if (arg < 0 || arg >= vm_ndomains)
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
numa_domain = arg;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
err = 0;
|
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
|
|
|
|
hdr = &pcbinfo->ipi_lbgrouphashbase[
|
|
|
|
|
INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_lbgrouphashmask)];
|
|
|
|
|
CK_LIST_FOREACH(grp, hdr, il_list) {
|
|
|
|
|
for (i = 0; i < grp->il_inpcnt; ++i) {
|
|
|
|
|
if (grp->il_inp[i] != inp)
|
|
|
|
|
continue;
|
|
|
|
|
|
|
|
|
|
if (grp->il_numa_domain == numa_domain) {
|
|
|
|
|
goto abort_with_hash_wlock;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Remove it from the old group. */
|
|
|
|
|
in_pcbremlbgrouphash(inp);
|
|
|
|
|
|
|
|
|
|
/* Add it to the new group based on numa domain. */
|
|
|
|
|
in_pcbinslbgrouphash(inp, numa_domain);
|
|
|
|
|
goto abort_with_hash_wlock;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
err = ENOENT;
|
|
|
|
|
abort_with_hash_wlock:
|
|
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
|
|
|
|
return (err);
|
|
|
|
|
}
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
/* Make sure it is safe to use hashinit(9) on CK_LIST. */
|
|
|
|
|
CTASSERT(sizeof(struct inpcbhead) == sizeof(LIST_HEAD(, inpcb)));
|
|
|
|
|
|
2010-03-14 14:59:11 -04:00
|
|
|
/*
|
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().
Differential revision: https://reviews.freebsd.org/D33542
2022-01-03 13:15:22 -05:00
|
|
|
* Initialize an inpcbinfo - a per-VNET instance of connections db.
|
2010-03-14 14:59:11 -04:00
|
|
|
*/
|
|
|
|
|
void
|
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().
Differential revision: https://reviews.freebsd.org/D33542
2022-01-03 13:15:22 -05:00
|
|
|
in_pcbinfo_init(struct inpcbinfo *pcbinfo, struct inpcbstorage *pcbstor,
|
|
|
|
|
u_int hash_nelements, u_int porthash_nelements)
|
2010-03-14 14:59:11 -04:00
|
|
|
{
|
|
|
|
|
|
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().
Differential revision: https://reviews.freebsd.org/D33542
2022-01-03 13:15:22 -05:00
|
|
|
mtx_init(&pcbinfo->ipi_lock, pcbstor->ips_infolock_name, NULL, MTX_DEF);
|
|
|
|
|
mtx_init(&pcbinfo->ipi_hash_lock, pcbstor->ips_hashlock_name,
|
|
|
|
|
NULL, MTX_DEF);
|
2010-03-14 14:59:11 -04:00
|
|
|
#ifdef VIMAGE
|
|
|
|
|
pcbinfo->ipi_vnet = curvnet;
|
|
|
|
|
#endif
|
2021-12-02 17:45:04 -05:00
|
|
|
CK_LIST_INIT(&pcbinfo->ipi_listhead);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
pcbinfo->ipi_count = 0;
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
pcbinfo->ipi_hash_exact = hashinit(hash_nelements, M_PCB,
|
|
|
|
|
&pcbinfo->ipi_hashmask);
|
|
|
|
|
pcbinfo->ipi_hash_wild = hashinit(hash_nelements, M_PCB,
|
2010-03-14 14:59:11 -04:00
|
|
|
&pcbinfo->ipi_hashmask);
|
2021-12-02 17:45:04 -05:00
|
|
|
porthash_nelements = imin(porthash_nelements, IPPORT_MAX + 1);
|
2010-03-14 14:59:11 -04:00
|
|
|
pcbinfo->ipi_porthashbase = hashinit(porthash_nelements, M_PCB,
|
|
|
|
|
&pcbinfo->ipi_porthashmask);
|
2018-12-05 12:06:00 -05:00
|
|
|
pcbinfo->ipi_lbgrouphashbase = hashinit(porthash_nelements, M_PCB,
|
2018-06-06 11:45:57 -04:00
|
|
|
&pcbinfo->ipi_lbgrouphashmask);
|
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().
Differential revision: https://reviews.freebsd.org/D33542
2022-01-03 13:15:22 -05:00
|
|
|
pcbinfo->ipi_zone = pcbstor->ips_zone;
|
|
|
|
|
pcbinfo->ipi_portzone = pcbstor->ips_portzone;
|
2021-12-02 17:45:04 -05:00
|
|
|
pcbinfo->ipi_smr = uma_zone_get_smr(pcbinfo->ipi_zone);
|
2010-03-14 14:59:11 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Destroy an inpcbinfo.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcbinfo_destroy(struct inpcbinfo *pcbinfo)
|
|
|
|
|
{
|
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
KASSERT(pcbinfo->ipi_count == 0,
|
|
|
|
|
("%s: ipi_count = %u", __func__, pcbinfo->ipi_count));
|
|
|
|
|
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
hashdestroy(pcbinfo->ipi_hash_exact, M_PCB, pcbinfo->ipi_hashmask);
|
|
|
|
|
hashdestroy(pcbinfo->ipi_hash_wild, M_PCB, pcbinfo->ipi_hashmask);
|
2010-03-14 14:59:11 -04:00
|
|
|
hashdestroy(pcbinfo->ipi_porthashbase, M_PCB,
|
|
|
|
|
pcbinfo->ipi_porthashmask);
|
2018-06-06 11:45:57 -04:00
|
|
|
hashdestroy(pcbinfo->ipi_lbgrouphashbase, M_PCB,
|
|
|
|
|
pcbinfo->ipi_lbgrouphashmask);
|
2021-12-02 17:45:04 -05:00
|
|
|
mtx_destroy(&pcbinfo->ipi_hash_lock);
|
|
|
|
|
mtx_destroy(&pcbinfo->ipi_lock);
|
2010-03-14 14:59:11 -04:00
|
|
|
}
|
|
|
|
|
|
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().
Differential revision: https://reviews.freebsd.org/D33542
2022-01-03 13:15:22 -05:00
|
|
|
/*
|
|
|
|
|
* Initialize a pcbstorage - per protocol zones to allocate inpcbs.
|
|
|
|
|
*/
|
|
|
|
|
static void inpcb_fini(void *, int);
|
|
|
|
|
void
|
|
|
|
|
in_pcbstorage_init(void *arg)
|
|
|
|
|
{
|
|
|
|
|
struct inpcbstorage *pcbstor = arg;
|
|
|
|
|
|
|
|
|
|
pcbstor->ips_zone = uma_zcreate(pcbstor->ips_zone_name,
|
2023-04-20 11:48:33 -04:00
|
|
|
pcbstor->ips_size, NULL, NULL, pcbstor->ips_pcbinit,
|
2022-12-14 14:19:35 -05:00
|
|
|
inpcb_fini, UMA_ALIGN_CACHE, UMA_ZONE_SMR);
|
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().
Differential revision: https://reviews.freebsd.org/D33542
2022-01-03 13:15:22 -05:00
|
|
|
pcbstor->ips_portzone = uma_zcreate(pcbstor->ips_portzone_name,
|
|
|
|
|
sizeof(struct inpcbport), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
|
|
|
|
|
uma_zone_set_smr(pcbstor->ips_portzone,
|
|
|
|
|
uma_zone_get_smr(pcbstor->ips_zone));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Destroy a pcbstorage - used by unloadable protocols.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcbstorage_destroy(void *arg)
|
|
|
|
|
{
|
|
|
|
|
struct inpcbstorage *pcbstor = arg;
|
|
|
|
|
|
|
|
|
|
uma_zdestroy(pcbstor->ips_zone);
|
|
|
|
|
uma_zdestroy(pcbstor->ips_portzone);
|
|
|
|
|
}
|
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
/*
|
|
|
|
|
* Allocate a PCB and associate it with the socket.
|
2006-07-18 18:34:27 -04:00
|
|
|
* On success return with the PCB locked.
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
*/
|
1994-05-24 06:09:53 -04:00
|
|
|
int
|
2006-07-18 18:34:27 -04:00
|
|
|
in_pcballoc(struct socket *so, struct inpcbinfo *pcbinfo)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2006-01-21 20:16:25 -05:00
|
|
|
struct inpcb *inp;
|
2022-04-13 19:08:23 -04:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT) || defined(MAC)
|
2001-07-26 15:19:49 -04:00
|
|
|
int error;
|
2022-04-13 19:08:23 -04:00
|
|
|
#endif
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
inp = uma_zalloc_smr(pcbinfo->ipi_zone, M_NOWAIT);
|
1994-05-24 06:09:53 -04:00
|
|
|
if (inp == NULL)
|
|
|
|
|
return (ENOBUFS);
|
2017-05-24 13:47:16 -04:00
|
|
|
bzero(&inp->inp_start_zero, inp_zero_size);
|
2019-04-25 11:37:28 -04:00
|
|
|
#ifdef NUMA
|
|
|
|
|
inp->inp_numa_domain = M_NODOM;
|
|
|
|
|
#endif
|
1995-04-08 21:29:31 -04:00
|
|
|
inp->inp_pcbinfo = pcbinfo;
|
1994-05-24 06:09:53 -04:00
|
|
|
inp->inp_socket = so;
|
2008-10-04 11:06:34 -04:00
|
|
|
inp->inp_cred = crhold(so->so_cred);
|
Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
2008-05-09 19:03:00 -04:00
|
|
|
inp->inp_inc.inc_fibnum = so->so_fibnum;
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
#ifdef MAC
|
2007-10-24 15:04:04 -04:00
|
|
|
error = mac_inpcb_init(inp, M_NOWAIT);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
if (error != 0)
|
|
|
|
|
goto out;
|
2007-10-24 15:04:04 -04:00
|
|
|
mac_inpcb_create(so, inp);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
#endif
|
2017-02-06 03:49:57 -05:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
|
|
|
|
error = ipsec_init_pcbpolicy(inp);
|
2007-12-22 05:06:11 -05:00
|
|
|
if (error != 0) {
|
|
|
|
|
#ifdef MAC
|
|
|
|
|
mac_inpcb_destroy(inp);
|
|
|
|
|
#endif
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
goto out;
|
2008-03-17 09:04:56 -04:00
|
|
|
}
|
2007-07-03 08:13:45 -04:00
|
|
|
#endif /*IPSEC*/
|
2006-11-30 05:54:54 -05:00
|
|
|
#ifdef INET6
|
2003-02-19 17:32:43 -05:00
|
|
|
if (INP_SOCKAF(so) == AF_INET6) {
|
2022-08-10 14:09:34 -04:00
|
|
|
inp->inp_vflag |= INP_IPV6PROTO | INP_IPV6;
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 19:27:27 -04:00
|
|
|
if (V_ip6_v6only)
|
2003-02-19 17:32:43 -05:00
|
|
|
inp->inp_flags |= IN6P_IPV6_V6ONLY;
|
2022-08-10 14:09:34 -04:00
|
|
|
#ifdef INET
|
|
|
|
|
else
|
|
|
|
|
inp->inp_vflag |= INP_IPV4;
|
|
|
|
|
#endif
|
|
|
|
|
if (V_ip6_auto_flowlabel)
|
|
|
|
|
inp->inp_flags |= IN6P_AUTOFLOWLABEL;
|
|
|
|
|
inp->in6p_hops = -1; /* use kernel default */
|
2003-02-19 17:32:43 -05:00
|
|
|
}
|
2022-08-10 14:09:34 -04:00
|
|
|
#endif
|
|
|
|
|
#if defined(INET) && defined(INET6)
|
|
|
|
|
else
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
|
|
|
|
inp->inp_vflag |= INP_IPV4;
|
2001-06-11 08:39:29 -04:00
|
|
|
#endif
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
inp->inp_smr = SMR_SEQ_INVALID;
|
|
|
|
|
|
2017-03-25 11:06:28 -04:00
|
|
|
/*
|
|
|
|
|
* Routes in inpcb's can cache L2 as well; they are guaranteed
|
|
|
|
|
* to be cleaned up.
|
|
|
|
|
*/
|
|
|
|
|
inp->inp_route.ro_flags = RT_LLE_CACHE;
|
2021-12-02 17:45:04 -05:00
|
|
|
refcount_init(&inp->inp_refcount, 1); /* Reference from socket. */
|
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
|
|
|
|
pcbinfo->ipi_count++;
|
|
|
|
|
inp->inp_gencnt = ++pcbinfo->ipi_gencnt;
|
|
|
|
|
CK_LIST_INSERT_HEAD(&pcbinfo->ipi_listhead, inp, inp_list);
|
|
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
|
|
|
|
so->so_pcb = inp;
|
|
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
|
|
2017-02-14 16:33:10 -05:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT) || defined(MAC)
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
out:
|
2021-12-02 17:45:04 -05:00
|
|
|
uma_zfree_smr(pcbinfo->ipi_zone, inp);
|
2021-12-02 16:35:14 -05:00
|
|
|
return (error);
|
2021-12-02 17:45:04 -05:00
|
|
|
#endif
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
|
|
|
|
|
2011-04-30 07:04:34 -04:00
|
|
|
#ifdef INET
|
1994-05-24 06:09:53 -04:00
|
|
|
int
|
2023-02-15 13:30:16 -05:00
|
|
|
in_pcbbind(struct inpcb *inp, struct sockaddr_in *sin, struct ucred *cred)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2002-10-20 17:44:31 -04:00
|
|
|
int anonport, error;
|
|
|
|
|
|
2023-02-15 13:30:16 -05:00
|
|
|
KASSERT(sin == NULL || sin->sin_family == AF_INET,
|
|
|
|
|
("%s: invalid address family for %p", __func__, sin));
|
|
|
|
|
KASSERT(sin == NULL || sin->sin_len == sizeof(struct sockaddr_in),
|
|
|
|
|
("%s: invalid address length for %p", __func__, sin));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
2003-11-08 18:02:36 -05:00
|
|
|
|
2002-10-20 17:44:31 -04:00
|
|
|
if (inp->inp_lport != 0 || inp->inp_laddr.s_addr != INADDR_ANY)
|
|
|
|
|
return (EINVAL);
|
2023-02-15 13:30:16 -05:00
|
|
|
anonport = sin == NULL || sin->sin_port == 0;
|
|
|
|
|
error = in_pcbbind_setup(inp, sin, &inp->inp_laddr.s_addr,
|
2004-03-27 16:05:46 -05:00
|
|
|
&inp->inp_lport, cred);
|
2002-10-20 17:44:31 -04:00
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
if (in_pcbinshash(inp) != 0) {
|
|
|
|
|
inp->inp_laddr.s_addr = INADDR_ANY;
|
|
|
|
|
inp->inp_lport = 0;
|
|
|
|
|
return (EAGAIN);
|
|
|
|
|
}
|
|
|
|
|
if (anonport)
|
|
|
|
|
inp->inp_flags |= INP_ANONPORT;
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
2011-04-30 07:04:34 -04:00
|
|
|
#endif
|
2002-10-20 17:44:31 -04:00
|
|
|
|
2020-05-18 18:53:12 -04:00
|
|
|
#if defined(INET) || defined(INET6)
|
2014-07-29 19:42:51 -04:00
|
|
|
/*
|
2020-05-18 18:53:12 -04:00
|
|
|
* Assign a local port like in_pcb_lport(), but also used with connect()
|
|
|
|
|
* and a foreign address and port. If fsa is non-NULL, choose a local port
|
|
|
|
|
* that is unused with those, otherwise one that is completely unused.
|
2020-05-18 21:05:13 -04:00
|
|
|
* lsa can be NULL for IPv6.
|
2014-07-29 19:42:51 -04:00
|
|
|
*/
|
2011-03-12 16:46:37 -05:00
|
|
|
int
|
2020-05-18 18:53:12 -04:00
|
|
|
in_pcb_lport_dest(struct inpcb *inp, struct sockaddr *lsa, u_short *lportp,
|
|
|
|
|
struct sockaddr *fsa, u_short fport, struct ucred *cred, int lookupflags)
|
2011-03-12 16:46:37 -05:00
|
|
|
{
|
|
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
|
struct inpcb *tmpinp;
|
|
|
|
|
unsigned short *lastport;
|
2022-10-31 11:57:11 -04:00
|
|
|
int count, error;
|
2011-03-12 16:46:37 -05:00
|
|
|
u_short aux, first, last, lport;
|
|
|
|
|
#ifdef INET
|
2020-05-18 18:53:12 -04:00
|
|
|
struct in_addr laddr, faddr;
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
struct in6_addr *laddr6, *faddr6;
|
2011-03-12 16:46:37 -05:00
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Because no actual state changes occur here, a global write lock on
|
|
|
|
|
* the pcbinfo isn't required.
|
|
|
|
|
*/
|
|
|
|
|
INP_LOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2011-03-12 16:46:37 -05:00
|
|
|
|
|
|
|
|
if (inp->inp_flags & INP_HIGHPORT) {
|
|
|
|
|
first = V_ipport_hifirstauto; /* sysctl */
|
|
|
|
|
last = V_ipport_hilastauto;
|
|
|
|
|
lastport = &pcbinfo->ipi_lasthi;
|
|
|
|
|
} else if (inp->inp_flags & INP_LOWPORT) {
|
2018-12-11 14:32:16 -05:00
|
|
|
error = priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT);
|
2011-03-12 16:46:37 -05:00
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
first = V_ipport_lowfirstauto; /* 1023 */
|
|
|
|
|
last = V_ipport_lowlastauto; /* 600 */
|
|
|
|
|
lastport = &pcbinfo->ipi_lastlow;
|
|
|
|
|
} else {
|
|
|
|
|
first = V_ipport_firstauto; /* sysctl */
|
|
|
|
|
last = V_ipport_lastauto;
|
|
|
|
|
lastport = &pcbinfo->ipi_lastport;
|
|
|
|
|
}
|
2022-10-31 11:57:11 -04:00
|
|
|
|
2011-03-12 16:46:37 -05:00
|
|
|
/*
|
|
|
|
|
* Instead of having two loops further down counting up or down
|
|
|
|
|
* make sure that first is always <= last and go with only one
|
|
|
|
|
* code path implementing all logic.
|
|
|
|
|
*/
|
|
|
|
|
if (first > last) {
|
|
|
|
|
aux = first;
|
|
|
|
|
first = last;
|
|
|
|
|
last = aux;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#ifdef INET
|
2022-07-29 10:23:23 -04:00
|
|
|
laddr.s_addr = INADDR_ANY; /* used by INET6+INET below too */
|
2011-03-19 15:08:54 -04:00
|
|
|
if ((inp->inp_vflag & (INP_IPV4|INP_IPV6)) == INP_IPV4) {
|
2020-05-18 21:05:13 -04:00
|
|
|
if (lsa != NULL)
|
|
|
|
|
laddr = ((struct sockaddr_in *)lsa)->sin_addr;
|
2020-05-18 18:53:12 -04:00
|
|
|
if (fsa != NULL)
|
|
|
|
|
faddr = ((struct sockaddr_in *)fsa)->sin_addr;
|
|
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET6
|
2020-05-18 21:05:13 -04:00
|
|
|
laddr6 = NULL;
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0) {
|
|
|
|
|
if (lsa != NULL)
|
|
|
|
|
laddr6 = &((struct sockaddr_in6 *)lsa)->sin6_addr;
|
2020-05-18 18:53:12 -04:00
|
|
|
if (fsa != NULL)
|
|
|
|
|
faddr6 = &((struct sockaddr_in6 *)fsa)->sin6_addr;
|
2011-03-12 16:46:37 -05:00
|
|
|
}
|
|
|
|
|
#endif
|
2020-05-18 18:53:12 -04:00
|
|
|
|
|
|
|
|
tmpinp = NULL;
|
2011-03-12 16:46:37 -05:00
|
|
|
lport = *lportp;
|
|
|
|
|
|
2022-10-31 11:57:11 -04:00
|
|
|
if (V_ipport_randomized)
|
2011-03-12 16:46:37 -05:00
|
|
|
*lastport = first + (arc4random() % (last - first));
|
|
|
|
|
|
|
|
|
|
count = last - first;
|
|
|
|
|
|
|
|
|
|
do {
|
|
|
|
|
if (count-- < 0) /* completely used? */
|
|
|
|
|
return (EADDRNOTAVAIL);
|
|
|
|
|
++*lastport;
|
|
|
|
|
if (*lastport < first || *lastport > last)
|
|
|
|
|
*lastport = first;
|
|
|
|
|
lport = htons(*lastport);
|
|
|
|
|
|
2020-05-18 18:53:12 -04:00
|
|
|
if (fsa != NULL) {
|
|
|
|
|
#ifdef INET
|
|
|
|
|
if (lsa->sa_family == AF_INET) {
|
|
|
|
|
tmpinp = in_pcblookup_hash_locked(pcbinfo,
|
|
|
|
|
faddr, fport, laddr, lport, lookupflags,
|
2023-02-09 15:59:27 -05:00
|
|
|
M_NODOM);
|
2020-05-18 18:53:12 -04:00
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
if (lsa->sa_family == AF_INET6) {
|
|
|
|
|
tmpinp = in6_pcblookup_hash_locked(pcbinfo,
|
|
|
|
|
faddr6, fport, laddr6, lport, lookupflags,
|
2023-02-09 15:59:27 -05:00
|
|
|
M_NODOM);
|
2020-05-18 18:53:12 -04:00
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
} else {
|
2011-03-12 16:46:37 -05:00
|
|
|
#ifdef INET6
|
2022-07-29 10:23:23 -04:00
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0) {
|
2020-05-18 18:53:12 -04:00
|
|
|
tmpinp = in6_pcblookup_local(pcbinfo,
|
|
|
|
|
&inp->in6p_laddr, lport, lookupflags, cred);
|
2022-07-29 10:23:23 -04:00
|
|
|
#ifdef INET
|
|
|
|
|
if (tmpinp == NULL &&
|
|
|
|
|
(inp->inp_vflag & INP_IPV4))
|
|
|
|
|
tmpinp = in_pcblookup_local(pcbinfo,
|
|
|
|
|
laddr, lport, lookupflags, cred);
|
|
|
|
|
#endif
|
|
|
|
|
}
|
2011-03-12 16:46:37 -05:00
|
|
|
#endif
|
|
|
|
|
#if defined(INET) && defined(INET6)
|
2020-05-18 18:53:12 -04:00
|
|
|
else
|
2011-03-12 16:46:37 -05:00
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
2020-05-18 18:53:12 -04:00
|
|
|
tmpinp = in_pcblookup_local(pcbinfo, laddr,
|
|
|
|
|
lport, lookupflags, cred);
|
2011-03-12 16:46:37 -05:00
|
|
|
#endif
|
2020-05-18 18:53:12 -04:00
|
|
|
}
|
2011-03-12 16:46:37 -05:00
|
|
|
} while (tmpinp != NULL);
|
|
|
|
|
|
|
|
|
|
*lportp = lport;
|
|
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
2013-07-04 14:38:00 -04:00
|
|
|
|
2020-05-18 18:53:12 -04:00
|
|
|
/*
|
|
|
|
|
* Select a local port (number) to use.
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
in_pcb_lport(struct inpcb *inp, struct in_addr *laddrp, u_short *lportp,
|
|
|
|
|
struct ucred *cred, int lookupflags)
|
|
|
|
|
{
|
|
|
|
|
struct sockaddr_in laddr;
|
|
|
|
|
|
|
|
|
|
if (laddrp) {
|
|
|
|
|
bzero(&laddr, sizeof(laddr));
|
|
|
|
|
laddr.sin_family = AF_INET;
|
|
|
|
|
laddr.sin_addr = *laddrp;
|
|
|
|
|
}
|
|
|
|
|
return (in_pcb_lport_dest(inp, laddrp ? (struct sockaddr *) &laddr :
|
|
|
|
|
NULL, lportp, NULL, 0, cred, lookupflags));
|
|
|
|
|
}
|
|
|
|
|
|
2013-07-04 14:38:00 -04:00
|
|
|
/*
|
|
|
|
|
* Return cached socket options.
|
|
|
|
|
*/
|
2018-06-06 11:45:57 -04:00
|
|
|
int
|
2013-07-04 14:38:00 -04:00
|
|
|
inp_so_options(const struct inpcb *inp)
|
|
|
|
|
{
|
2018-06-06 11:45:57 -04:00
|
|
|
int so_options;
|
2013-07-04 14:38:00 -04:00
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
so_options = 0;
|
2013-07-04 14:38:00 -04:00
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
if ((inp->inp_flags2 & INP_REUSEPORT_LB) != 0)
|
|
|
|
|
so_options |= SO_REUSEPORT_LB;
|
|
|
|
|
if ((inp->inp_flags2 & INP_REUSEPORT) != 0)
|
|
|
|
|
so_options |= SO_REUSEPORT;
|
|
|
|
|
if ((inp->inp_flags2 & INP_REUSEADDR) != 0)
|
|
|
|
|
so_options |= SO_REUSEADDR;
|
|
|
|
|
return (so_options);
|
2013-07-04 14:38:00 -04:00
|
|
|
}
|
2011-03-12 16:46:37 -05:00
|
|
|
#endif /* INET || INET6 */
|
|
|
|
|
|
2014-07-12 01:40:13 -04:00
|
|
|
#ifdef INET
|
2002-10-20 17:44:31 -04:00
|
|
|
/*
|
|
|
|
|
* Set up a bind operation on a PCB, performing port allocation
|
|
|
|
|
* as required, but do not actually modify the PCB. Callers can
|
|
|
|
|
* either complete the bind by setting inp_laddr/inp_lport and
|
|
|
|
|
* calling in_pcbinshash(), or they can just use the resulting
|
|
|
|
|
* port and address to authorise the sending of a once-off packet.
|
|
|
|
|
*
|
|
|
|
|
* On error, the values of *laddrp and *lportp are not changed.
|
|
|
|
|
*/
|
|
|
|
|
int
|
2023-02-15 13:30:16 -05:00
|
|
|
in_pcbbind_setup(struct inpcb *inp, struct sockaddr_in *sin, in_addr_t *laddrp,
|
2006-01-21 20:16:25 -05:00
|
|
|
u_short *lportp, struct ucred *cred)
|
2002-10-20 17:44:31 -04:00
|
|
|
{
|
|
|
|
|
struct socket *so = inp->inp_socket;
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
2002-10-20 17:44:31 -04:00
|
|
|
struct in_addr laddr;
|
1994-05-24 06:09:53 -04:00
|
|
|
u_short lport = 0;
|
2011-05-23 11:23:18 -04:00
|
|
|
int lookupflags = 0, reuseport = (so->so_options & SO_REUSEPORT);
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
int error;
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
/*
|
|
|
|
|
* XXX: Maybe we could let SO_REUSEPORT_LB set SO_REUSEPORT bit here
|
|
|
|
|
* so that we don't have to add to the (already messy) code below.
|
|
|
|
|
*/
|
|
|
|
|
int reuseport_lb = (so->so_options & SO_REUSEPORT_LB);
|
|
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
/*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
* No state changes, so read locks are sufficient here.
|
2008-04-17 17:38:18 -04:00
|
|
|
*/
|
2003-11-08 18:02:36 -05:00
|
|
|
INP_LOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2003-11-08 18:02:36 -05:00
|
|
|
|
2002-10-20 17:44:31 -04:00
|
|
|
laddr.s_addr = *laddrp;
|
2023-02-15 13:30:16 -05:00
|
|
|
if (sin != NULL && laddr.s_addr != INADDR_ANY)
|
1994-05-24 06:09:53 -04:00
|
|
|
return (EINVAL);
|
2018-06-06 11:45:57 -04:00
|
|
|
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT|SO_REUSEPORT_LB)) == 0)
|
2011-05-23 11:23:18 -04:00
|
|
|
lookupflags = INPLOOKUP_WILDCARD;
|
2023-02-15 13:30:16 -05:00
|
|
|
if (sin == NULL) {
|
2009-02-05 09:25:53 -05:00
|
|
|
if ((error = prison_local_ip4(cred, &laddr)) != 0)
|
|
|
|
|
return (error);
|
|
|
|
|
} else {
|
2021-05-03 12:51:04 -04:00
|
|
|
KASSERT(sin->sin_family == AF_INET,
|
|
|
|
|
("%s: invalid family for address %p", __func__, sin));
|
|
|
|
|
KASSERT(sin->sin_len == sizeof(*sin),
|
|
|
|
|
("%s: invalid length for address %p", __func__, sin));
|
|
|
|
|
|
2009-02-05 09:06:09 -05:00
|
|
|
error = prison_local_ip4(cred, &sin->sin_addr);
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
2002-10-20 17:44:31 -04:00
|
|
|
if (sin->sin_port != *lportp) {
|
|
|
|
|
/* Don't allow the port to change. */
|
|
|
|
|
if (*lportp != 0)
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
lport = sin->sin_port;
|
|
|
|
|
}
|
|
|
|
|
/* NB: lport is left as 0 if the port isn't being changed. */
|
1994-05-24 06:09:53 -04:00
|
|
|
if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr))) {
|
|
|
|
|
/*
|
|
|
|
|
* Treat SO_REUSEADDR as SO_REUSEPORT for multicast;
|
|
|
|
|
* allow complete duplication of binding if
|
|
|
|
|
* SO_REUSEPORT is set, or if SO_REUSEADDR is set
|
|
|
|
|
* and a multicast address is bound on both
|
|
|
|
|
* new and duplicated sockets.
|
|
|
|
|
*/
|
2013-07-12 15:08:33 -04:00
|
|
|
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) != 0)
|
1994-05-24 06:09:53 -04:00
|
|
|
reuseport = SO_REUSEADDR|SO_REUSEPORT;
|
2018-06-06 11:45:57 -04:00
|
|
|
/*
|
|
|
|
|
* XXX: How to deal with SO_REUSEPORT_LB here?
|
|
|
|
|
* Treat same as SO_REUSEPORT for now.
|
|
|
|
|
*/
|
|
|
|
|
if ((so->so_options &
|
|
|
|
|
(SO_REUSEADDR|SO_REUSEPORT_LB)) != 0)
|
|
|
|
|
reuseport_lb = SO_REUSEADDR|SO_REUSEPORT_LB;
|
1994-05-24 06:09:53 -04:00
|
|
|
} else if (sin->sin_addr.s_addr != INADDR_ANY) {
|
|
|
|
|
sin->sin_port = 0; /* yech... */
|
2001-11-05 19:48:01 -05:00
|
|
|
bzero(&sin->sin_zero, sizeof(sin->sin_zero));
|
2009-01-09 12:16:18 -05:00
|
|
|
/*
|
2018-06-06 11:45:57 -04:00
|
|
|
* Is the address a local IP address?
|
2009-06-01 06:30:00 -04:00
|
|
|
* If INP_BINDANY is set, then the socket may be bound
|
2009-01-09 13:38:57 -05:00
|
|
|
* to any endpoint address, local or not.
|
2009-01-09 12:16:18 -05:00
|
|
|
*/
|
2009-06-01 06:30:00 -04:00
|
|
|
if ((inp->inp_flags & INP_BINDANY) == 0 &&
|
2018-06-06 11:45:57 -04:00
|
|
|
ifa_ifwithaddr_check((struct sockaddr *)sin) == 0)
|
1994-05-24 06:09:53 -04:00
|
|
|
return (EADDRNOTAVAIL);
|
|
|
|
|
}
|
2002-10-20 17:44:31 -04:00
|
|
|
laddr = sin->sin_addr;
|
1994-05-24 06:09:53 -04:00
|
|
|
if (lport) {
|
|
|
|
|
struct inpcb *t;
|
2006-04-04 08:26:07 -04:00
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
/* GROSS */
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 19:27:27 -04:00
|
|
|
if (ntohs(lport) <= V_ipport_reservedhigh &&
|
|
|
|
|
ntohs(lport) >= V_ipport_reservedlow &&
|
2018-12-11 14:32:16 -05:00
|
|
|
priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT))
|
1995-09-21 13:55:49 -04:00
|
|
|
return (EACCES);
|
2006-06-27 07:35:53 -04:00
|
|
|
if (!IN_MULTICAST(ntohl(sin->sin_addr.s_addr)) &&
|
2018-12-11 14:32:16 -05:00
|
|
|
priv_check_cred(inp->inp_cred, PRIV_NETINET_REUSEPORT) != 0) {
|
2008-07-10 09:31:11 -04:00
|
|
|
t = in_pcblookup_local(pcbinfo, sin->sin_addr,
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
lport, INPLOOKUP_WILDCARD, cred);
|
2003-02-19 17:32:43 -05:00
|
|
|
/*
|
|
|
|
|
* XXX
|
|
|
|
|
* This entire block sorely needs a rewrite.
|
|
|
|
|
*/
|
2023-02-27 09:52:28 -05:00
|
|
|
if (t != NULL &&
|
2004-05-20 02:35:02 -04:00
|
|
|
(so->so_type != SOCK_STREAM ||
|
|
|
|
|
ntohl(t->inp_faddr.s_addr) == INADDR_ANY) &&
|
2002-05-31 07:52:35 -04:00
|
|
|
(ntohl(sin->sin_addr.s_addr) != INADDR_ANY ||
|
|
|
|
|
ntohl(t->inp_laddr.s_addr) != INADDR_ANY ||
|
2018-06-06 11:45:57 -04:00
|
|
|
(t->inp_flags2 & INP_REUSEPORT) ||
|
|
|
|
|
(t->inp_flags2 & INP_REUSEPORT_LB) == 0) &&
|
2008-10-04 11:06:34 -04:00
|
|
|
(inp->inp_cred->cr_uid !=
|
|
|
|
|
t->inp_cred->cr_uid))
|
2002-05-31 07:52:35 -04:00
|
|
|
return (EADDRINUSE);
|
1998-03-01 14:39:29 -05:00
|
|
|
}
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
t = in_pcblookup_local(pcbinfo, sin->sin_addr,
|
2011-05-23 11:23:18 -04:00
|
|
|
lport, lookupflags, cred);
|
2023-02-27 09:52:28 -05:00
|
|
|
if (t != NULL && (reuseport & inp_so_options(t)) == 0 &&
|
2022-10-06 22:22:23 -04:00
|
|
|
(reuseport_lb & inp_so_options(t)) == 0) {
|
2006-11-30 05:54:54 -05:00
|
|
|
#ifdef INET6
|
2002-05-31 07:52:35 -04:00
|
|
|
if (ntohl(sin->sin_addr.s_addr) !=
|
|
|
|
|
INADDR_ANY ||
|
|
|
|
|
ntohl(t->inp_laddr.s_addr) !=
|
|
|
|
|
INADDR_ANY ||
|
2011-11-06 05:47:20 -05:00
|
|
|
(inp->inp_vflag & INP_IPV6PROTO) == 0 ||
|
|
|
|
|
(t->inp_vflag & INP_IPV6PROTO) == 0)
|
2006-11-30 05:54:54 -05:00
|
|
|
#endif
|
2018-06-06 11:45:57 -04:00
|
|
|
return (EADDRINUSE);
|
1999-12-07 12:39:16 -05:00
|
|
|
}
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
|
|
|
|
}
|
2002-10-20 17:44:31 -04:00
|
|
|
if (*lportp != 0)
|
|
|
|
|
lport = *lportp;
|
1996-02-22 16:32:23 -05:00
|
|
|
if (lport == 0) {
|
2017-02-10 00:58:16 -05:00
|
|
|
error = in_pcb_lport(inp, &laddr, &lport, cred, lookupflags);
|
2011-03-12 16:46:37 -05:00
|
|
|
if (error != 0)
|
|
|
|
|
return (error);
|
1996-02-22 16:32:23 -05:00
|
|
|
}
|
2002-10-20 17:44:31 -04:00
|
|
|
*laddrp = laddr.s_addr;
|
|
|
|
|
*lportp = lport;
|
1994-05-24 06:09:53 -04:00
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
1995-02-08 15:22:09 -05:00
|
|
|
/*
|
2002-10-21 09:55:50 -04:00
|
|
|
* Connect from a socket to a specified address.
|
|
|
|
|
* Both address and port must be specified in argument sin.
|
|
|
|
|
* If don't have a local address for this socket yet,
|
|
|
|
|
* then pick one.
|
1995-02-08 15:22:09 -05:00
|
|
|
*/
|
2002-10-21 09:55:50 -04:00
|
|
|
int
|
2023-02-03 14:33:36 -05:00
|
|
|
in_pcbconnect(struct inpcb *inp, struct sockaddr_in *sin, struct ucred *cred,
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
bool rehash __unused)
|
2002-10-21 09:55:50 -04:00
|
|
|
{
|
|
|
|
|
u_short lport, fport;
|
|
|
|
|
in_addr_t laddr, faddr;
|
|
|
|
|
int anonport, error;
|
|
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
KASSERT(in_nullhost(inp->inp_faddr),
|
|
|
|
|
("%s: inp is already connected", __func__));
|
2004-08-11 00:35:20 -04:00
|
|
|
|
2002-10-21 09:55:50 -04:00
|
|
|
lport = inp->inp_lport;
|
|
|
|
|
laddr = inp->inp_laddr.s_addr;
|
|
|
|
|
anonport = (lport == 0);
|
2023-02-03 14:33:36 -05:00
|
|
|
error = in_pcbconnect_setup(inp, sin, &laddr, &lport, &faddr, &fport,
|
2023-02-03 14:33:36 -05:00
|
|
|
cred);
|
2002-10-21 09:55:50 -04:00
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
inp->inp_faddr.s_addr = faddr;
|
|
|
|
|
inp->inp_fport = fport;
|
|
|
|
|
|
2002-10-21 09:55:50 -04:00
|
|
|
/* Do the initial binding of the local address if required. */
|
|
|
|
|
if (inp->inp_laddr.s_addr == INADDR_ANY && inp->inp_lport == 0) {
|
|
|
|
|
inp->inp_lport = lport;
|
|
|
|
|
inp->inp_laddr.s_addr = laddr;
|
|
|
|
|
if (in_pcbinshash(inp) != 0) {
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
inp->inp_laddr.s_addr = inp->inp_faddr.s_addr =
|
|
|
|
|
INADDR_ANY;
|
|
|
|
|
inp->inp_lport = inp->inp_fport = 0;
|
2002-10-21 09:55:50 -04:00
|
|
|
return (EAGAIN);
|
|
|
|
|
}
|
2020-01-12 12:52:32 -05:00
|
|
|
} else {
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
inp->inp_lport = lport;
|
|
|
|
|
inp->inp_laddr.s_addr = laddr;
|
|
|
|
|
if ((inp->inp_flags & INP_INHASHLIST) != 0)
|
|
|
|
|
in_pcbrehash(inp);
|
|
|
|
|
else
|
|
|
|
|
in_pcbinshash(inp);
|
2020-01-12 12:52:32 -05:00
|
|
|
}
|
2007-07-01 07:41:27 -04:00
|
|
|
|
2002-10-21 09:55:50 -04:00
|
|
|
if (anonport)
|
|
|
|
|
inp->inp_flags |= INP_ANONPORT;
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
1995-02-08 15:22:09 -05:00
|
|
|
|
2008-10-03 08:21:21 -04:00
|
|
|
/*
|
|
|
|
|
* Do proper source address selection on an unbound socket in case
|
|
|
|
|
* of connect. Take jails into account as well.
|
|
|
|
|
*/
|
2014-04-24 08:52:31 -04:00
|
|
|
int
|
2008-10-03 08:21:21 -04:00
|
|
|
in_pcbladdr(struct inpcb *inp, struct in_addr *faddr, struct in_addr *laddr,
|
|
|
|
|
struct ucred *cred)
|
|
|
|
|
{
|
|
|
|
|
struct ifaddr *ifa;
|
|
|
|
|
struct sockaddr *sa;
|
2020-04-14 19:06:25 -04:00
|
|
|
struct sockaddr_in *sin, dst;
|
|
|
|
|
struct nhop_object *nh;
|
2008-10-03 08:21:21 -04:00
|
|
|
int error;
|
|
|
|
|
|
2020-01-22 01:10:41 -05:00
|
|
|
NET_EPOCH_ASSERT();
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
KASSERT(laddr != NULL, ("%s: laddr NULL", __func__));
|
2022-11-02 13:03:41 -04:00
|
|
|
|
2010-01-17 07:57:11 -05:00
|
|
|
/*
|
|
|
|
|
* Bypass source address selection and use the primary jail IP
|
|
|
|
|
* if requested.
|
|
|
|
|
*/
|
2022-11-02 13:03:41 -04:00
|
|
|
if (!prison_saddrsel_ip4(cred, laddr))
|
2010-01-17 07:57:11 -05:00
|
|
|
return (0);
|
|
|
|
|
|
2008-10-03 08:21:21 -04:00
|
|
|
error = 0;
|
|
|
|
|
|
2020-04-14 19:06:25 -04:00
|
|
|
nh = NULL;
|
|
|
|
|
bzero(&dst, sizeof(dst));
|
|
|
|
|
sin = &dst;
|
2008-10-03 08:21:21 -04:00
|
|
|
sin->sin_family = AF_INET;
|
|
|
|
|
sin->sin_len = sizeof(struct sockaddr_in);
|
|
|
|
|
sin->sin_addr.s_addr = faddr->s_addr;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* If route is known our src addr is taken from the i/f,
|
|
|
|
|
* else punt.
|
|
|
|
|
*
|
|
|
|
|
* Find out route to destination.
|
|
|
|
|
*/
|
|
|
|
|
if ((inp->inp_socket->so_options & SO_DONTROUTE) == 0)
|
2020-04-14 19:06:25 -04:00
|
|
|
nh = fib4_lookup(inp->inp_inc.inc_fibnum, *faddr,
|
|
|
|
|
0, NHR_NONE, 0);
|
2008-10-03 08:21:21 -04:00
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* If we found a route, use the address corresponding to
|
|
|
|
|
* the outgoing interface.
|
2020-02-12 08:31:36 -05:00
|
|
|
*
|
2008-10-03 08:21:21 -04:00
|
|
|
* Otherwise assume faddr is reachable on a directly connected
|
|
|
|
|
* network and try to find a corresponding interface to take
|
|
|
|
|
* the source address from.
|
|
|
|
|
*/
|
2020-04-14 19:06:25 -04:00
|
|
|
if (nh == NULL || nh->nh_ifp == NULL) {
|
2009-06-23 16:19:09 -04:00
|
|
|
struct in_ifaddr *ia;
|
2008-10-03 08:21:21 -04:00
|
|
|
struct ifnet *ifp;
|
|
|
|
|
|
2014-09-11 16:21:03 -04:00
|
|
|
ia = ifatoia(ifa_ifwithdstaddr((struct sockaddr *)sin,
|
2014-09-16 11:28:19 -04:00
|
|
|
inp->inp_socket->so_fibnum));
|
2018-05-23 17:02:14 -04:00
|
|
|
if (ia == NULL) {
|
2014-09-11 16:21:03 -04:00
|
|
|
ia = ifatoia(ifa_ifwithnet((struct sockaddr *)sin, 0,
|
2014-09-16 11:28:19 -04:00
|
|
|
inp->inp_socket->so_fibnum));
|
2018-05-23 17:02:14 -04:00
|
|
|
}
|
2008-10-03 08:21:21 -04:00
|
|
|
if (ia == NULL) {
|
|
|
|
|
error = ENETUNREACH;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
2022-11-02 13:03:41 -04:00
|
|
|
if (!prison_flag(cred, PR_IP4)) {
|
2008-10-03 08:21:21 -04:00
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
ifp = ia->ia_ifp;
|
|
|
|
|
ia = NULL;
|
2018-05-18 16:13:34 -04:00
|
|
|
CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
|
2008-10-03 08:21:21 -04:00
|
|
|
sa = ifa->ifa_addr;
|
|
|
|
|
if (sa->sa_family != AF_INET)
|
|
|
|
|
continue;
|
|
|
|
|
sin = (struct sockaddr_in *)sa;
|
2009-02-05 09:06:09 -05:00
|
|
|
if (prison_check_ip4(cred, &sin->sin_addr) == 0) {
|
2008-10-03 08:21:21 -04:00
|
|
|
ia = (struct in_ifaddr *)ifa;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (ia != NULL) {
|
|
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* 3. As a last resort return the 'default' jail address. */
|
2009-02-05 09:06:09 -05:00
|
|
|
error = prison_get_ip4(cred, laddr);
|
2008-10-03 08:21:21 -04:00
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* If the outgoing interface on the route found is not
|
|
|
|
|
* a loopback interface, use the address from that interface.
|
|
|
|
|
* In case of jails do those three steps:
|
|
|
|
|
* 1. check if the interface address belongs to the jail. If so use it.
|
|
|
|
|
* 2. check if we have any address on the outgoing interface
|
|
|
|
|
* belonging to this jail. If so use it.
|
|
|
|
|
* 3. as a last resort return the 'default' jail address.
|
|
|
|
|
*/
|
2020-04-14 19:06:25 -04:00
|
|
|
if ((nh->nh_ifp->if_flags & IFF_LOOPBACK) == 0) {
|
2009-06-23 16:19:09 -04:00
|
|
|
struct in_ifaddr *ia;
|
2009-04-19 18:25:09 -04:00
|
|
|
struct ifnet *ifp;
|
2008-10-03 08:21:21 -04:00
|
|
|
|
|
|
|
|
/* If not jailed, use the default returned. */
|
2022-11-02 13:03:41 -04:00
|
|
|
if (!prison_flag(cred, PR_IP4)) {
|
2020-04-14 19:06:25 -04:00
|
|
|
ia = (struct in_ifaddr *)nh->nh_ifa;
|
2008-10-03 08:21:21 -04:00
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Jailed. */
|
|
|
|
|
/* 1. Check if the iface address belongs to the jail. */
|
2020-04-14 19:06:25 -04:00
|
|
|
sin = (struct sockaddr_in *)nh->nh_ifa->ifa_addr;
|
2009-02-05 09:06:09 -05:00
|
|
|
if (prison_check_ip4(cred, &sin->sin_addr) == 0) {
|
2020-04-14 19:06:25 -04:00
|
|
|
ia = (struct in_ifaddr *)nh->nh_ifa;
|
2008-10-03 08:21:21 -04:00
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* 2. Check if we have any address on the outgoing interface
|
|
|
|
|
* belonging to this jail.
|
|
|
|
|
*/
|
2009-06-23 16:19:09 -04:00
|
|
|
ia = NULL;
|
2020-04-14 19:06:25 -04:00
|
|
|
ifp = nh->nh_ifp;
|
2018-05-18 16:13:34 -04:00
|
|
|
CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
|
2008-10-03 08:21:21 -04:00
|
|
|
sa = ifa->ifa_addr;
|
|
|
|
|
if (sa->sa_family != AF_INET)
|
|
|
|
|
continue;
|
|
|
|
|
sin = (struct sockaddr_in *)sa;
|
2009-02-05 09:06:09 -05:00
|
|
|
if (prison_check_ip4(cred, &sin->sin_addr) == 0) {
|
2008-10-03 08:21:21 -04:00
|
|
|
ia = (struct in_ifaddr *)ifa;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (ia != NULL) {
|
|
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* 3. As a last resort return the 'default' jail address. */
|
2009-02-05 09:06:09 -05:00
|
|
|
error = prison_get_ip4(cred, laddr);
|
2008-10-03 08:21:21 -04:00
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* The outgoing interface is marked with 'loopback net', so a route
|
|
|
|
|
* to ourselves is here.
|
|
|
|
|
* Try to find the interface of the destination address and then
|
|
|
|
|
* take the address from there. That interface is not necessarily
|
|
|
|
|
* a loopback interface.
|
|
|
|
|
* In case of jails, check that it is an address of the jail
|
|
|
|
|
* and if we cannot find, fall back to the 'default' jail address.
|
|
|
|
|
*/
|
2020-04-14 19:06:25 -04:00
|
|
|
if ((nh->nh_ifp->if_flags & IFF_LOOPBACK) != 0) {
|
2009-06-23 16:19:09 -04:00
|
|
|
struct in_ifaddr *ia;
|
2008-10-03 08:21:21 -04:00
|
|
|
|
2020-04-14 19:06:25 -04:00
|
|
|
ia = ifatoia(ifa_ifwithdstaddr(sintosa(&dst),
|
2014-09-16 11:28:19 -04:00
|
|
|
inp->inp_socket->so_fibnum));
|
2008-10-03 08:21:21 -04:00
|
|
|
if (ia == NULL)
|
2020-04-14 19:06:25 -04:00
|
|
|
ia = ifatoia(ifa_ifwithnet(sintosa(&dst), 0,
|
2014-09-16 11:28:19 -04:00
|
|
|
inp->inp_socket->so_fibnum));
|
2009-09-14 18:19:47 -04:00
|
|
|
if (ia == NULL)
|
2020-04-14 19:06:25 -04:00
|
|
|
ia = ifatoia(ifa_ifwithaddr(sintosa(&dst)));
|
2008-10-03 08:21:21 -04:00
|
|
|
|
2022-11-02 13:03:41 -04:00
|
|
|
if (!prison_flag(cred, PR_IP4)) {
|
2008-10-03 08:21:21 -04:00
|
|
|
if (ia == NULL) {
|
|
|
|
|
error = ENETUNREACH;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Jailed. */
|
|
|
|
|
if (ia != NULL) {
|
|
|
|
|
struct ifnet *ifp;
|
|
|
|
|
|
|
|
|
|
ifp = ia->ia_ifp;
|
|
|
|
|
ia = NULL;
|
2018-05-18 16:13:34 -04:00
|
|
|
CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
|
2008-10-03 08:21:21 -04:00
|
|
|
sa = ifa->ifa_addr;
|
|
|
|
|
if (sa->sa_family != AF_INET)
|
|
|
|
|
continue;
|
|
|
|
|
sin = (struct sockaddr_in *)sa;
|
2009-02-05 09:06:09 -05:00
|
|
|
if (prison_check_ip4(cred,
|
|
|
|
|
&sin->sin_addr) == 0) {
|
2008-10-03 08:21:21 -04:00
|
|
|
ia = (struct in_ifaddr *)ifa;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (ia != NULL) {
|
|
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* 3. As a last resort return the 'default' jail address. */
|
2009-02-05 09:06:09 -05:00
|
|
|
error = prison_get_ip4(cred, laddr);
|
2008-10-03 08:21:21 -04:00
|
|
|
goto done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
done:
|
2023-03-06 15:06:00 -05:00
|
|
|
if (error == 0 && laddr->s_addr == INADDR_ANY)
|
|
|
|
|
return (EHOSTUNREACH);
|
2008-10-03 08:21:21 -04:00
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
2002-10-21 09:55:50 -04:00
|
|
|
/*
|
|
|
|
|
* Set up for a connect from a socket to the specified address.
|
|
|
|
|
* On entry, *laddrp and *lportp should contain the current local
|
|
|
|
|
* address and port for the PCB; these are updated to the values
|
|
|
|
|
* that should be placed in inp_laddr and inp_lport to complete
|
|
|
|
|
* the connect.
|
|
|
|
|
*
|
|
|
|
|
* On success, *faddrp and *fportp will be set to the remote address
|
|
|
|
|
* and port. These are not updated in the error case.
|
|
|
|
|
*/
|
1995-02-08 15:22:09 -05:00
|
|
|
int
|
2023-02-03 14:33:36 -05:00
|
|
|
in_pcbconnect_setup(struct inpcb *inp, struct sockaddr_in *sin,
|
2006-01-21 20:16:25 -05:00
|
|
|
in_addr_t *laddrp, u_short *lportp, in_addr_t *faddrp, u_short *fportp,
|
2023-02-03 14:33:36 -05:00
|
|
|
struct ucred *cred)
|
1995-02-08 15:22:09 -05:00
|
|
|
{
|
1994-05-24 06:09:53 -04:00
|
|
|
struct in_ifaddr *ia;
|
2009-02-05 09:06:09 -05:00
|
|
|
struct in_addr laddr, faddr;
|
2002-10-21 09:55:50 -04:00
|
|
|
u_short lport, fport;
|
|
|
|
|
int error;
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2021-05-03 12:51:04 -04:00
|
|
|
KASSERT(sin->sin_family == AF_INET,
|
|
|
|
|
("%s: invalid address family for %p", __func__, sin));
|
|
|
|
|
KASSERT(sin->sin_len == sizeof(*sin),
|
|
|
|
|
("%s: invalid address length for %p", __func__, sin));
|
|
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
/*
|
|
|
|
|
* Because a global state change doesn't actually occur here, a read
|
|
|
|
|
* lock is sufficient.
|
|
|
|
|
*/
|
2020-01-22 01:10:41 -05:00
|
|
|
NET_EPOCH_ASSERT();
|
2004-08-11 00:35:20 -04:00
|
|
|
INP_LOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_LOCK_ASSERT(inp->inp_pcbinfo);
|
2004-08-11 00:35:20 -04:00
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
if (sin->sin_port == 0)
|
|
|
|
|
return (EADDRNOTAVAIL);
|
2002-10-21 09:55:50 -04:00
|
|
|
laddr.s_addr = *laddrp;
|
|
|
|
|
lport = *lportp;
|
|
|
|
|
faddr = sin->sin_addr;
|
|
|
|
|
fport = sin->sin_port;
|
2020-10-18 13:15:47 -04:00
|
|
|
#ifdef ROUTE_MPATH
|
|
|
|
|
if (CALC_FLOWID_OUTBOUND) {
|
|
|
|
|
uint32_t hash_val, hash_type;
|
2008-10-03 08:21:21 -04:00
|
|
|
|
2020-10-18 13:15:47 -04:00
|
|
|
hash_val = fib4_calc_software_hash(laddr, faddr, 0, fport,
|
|
|
|
|
inp->inp_socket->so_proto->pr_protocol, &hash_type);
|
|
|
|
|
|
|
|
|
|
inp->inp_flowid = hash_val;
|
|
|
|
|
inp->inp_flowtype = hash_type;
|
|
|
|
|
}
|
|
|
|
|
#endif
|
2018-05-18 16:13:34 -04:00
|
|
|
if (!CK_STAILQ_EMPTY(&V_in_ifaddrhead)) {
|
1994-05-24 06:09:53 -04:00
|
|
|
/*
|
|
|
|
|
* If the destination address is INADDR_ANY,
|
|
|
|
|
* use the primary local address.
|
|
|
|
|
* If the supplied address is INADDR_BROADCAST,
|
|
|
|
|
* and the primary interface supports broadcast,
|
|
|
|
|
* choose the broadcast address for that interface.
|
|
|
|
|
*/
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
if (faddr.s_addr == INADDR_ANY) {
|
2009-02-05 09:06:09 -05:00
|
|
|
faddr =
|
2018-05-18 16:13:34 -04:00
|
|
|
IA_SIN(CK_STAILQ_FIRST(&V_in_ifaddrhead))->sin_addr;
|
2022-11-02 13:03:41 -04:00
|
|
|
if ((error = prison_get_ip4(cred, &faddr)) != 0)
|
2009-02-05 09:06:09 -05:00
|
|
|
return (error);
|
2009-06-25 07:52:33 -04:00
|
|
|
} else if (faddr.s_addr == (u_long)INADDR_BROADCAST) {
|
2018-05-18 16:13:34 -04:00
|
|
|
if (CK_STAILQ_FIRST(&V_in_ifaddrhead)->ia_ifp->if_flags &
|
2009-06-25 07:52:33 -04:00
|
|
|
IFF_BROADCAST)
|
2018-05-18 16:13:34 -04:00
|
|
|
faddr = satosin(&CK_STAILQ_FIRST(
|
2009-06-25 07:52:33 -04:00
|
|
|
&V_in_ifaddrhead)->ia_broadaddr)->sin_addr;
|
|
|
|
|
}
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
2002-10-21 09:55:50 -04:00
|
|
|
if (laddr.s_addr == INADDR_ANY) {
|
2011-01-08 17:33:46 -05:00
|
|
|
error = in_pcbladdr(inp, &faddr, &laddr, cred);
|
1994-05-24 06:09:53 -04:00
|
|
|
/*
|
|
|
|
|
* If the destination address is multicast and an outgoing
|
2011-01-08 17:33:46 -05:00
|
|
|
* interface has been set as a multicast option, prefer the
|
1994-05-24 06:09:53 -04:00
|
|
|
* address of that interface as our source address.
|
|
|
|
|
*/
|
2002-10-21 09:55:50 -04:00
|
|
|
if (IN_MULTICAST(ntohl(faddr.s_addr)) &&
|
1994-05-24 06:09:53 -04:00
|
|
|
inp->inp_moptions != NULL) {
|
|
|
|
|
struct ip_moptions *imo;
|
|
|
|
|
struct ifnet *ifp;
|
|
|
|
|
|
|
|
|
|
imo = inp->inp_moptions;
|
|
|
|
|
if (imo->imo_multicast_ifp != NULL) {
|
|
|
|
|
ifp = imo->imo_multicast_ifp;
|
2018-05-18 16:13:34 -04:00
|
|
|
CK_STAILQ_FOREACH(ia, &V_in_ifaddrhead, ia_link) {
|
2022-11-02 13:03:41 -04:00
|
|
|
if (ia->ia_ifp == ifp &&
|
2011-01-26 12:31:03 -05:00
|
|
|
prison_check_ip4(cred,
|
2022-11-02 13:03:41 -04:00
|
|
|
&ia->ia_addr.sin_addr) == 0)
|
1994-05-24 06:09:53 -04:00
|
|
|
break;
|
2011-01-26 12:31:03 -05:00
|
|
|
}
|
|
|
|
|
if (ia == NULL)
|
2011-01-08 17:33:46 -05:00
|
|
|
error = EADDRNOTAVAIL;
|
2011-01-26 12:31:03 -05:00
|
|
|
else {
|
2011-01-08 17:33:46 -05:00
|
|
|
laddr = ia->ia_addr.sin_addr;
|
|
|
|
|
error = 0;
|
2009-06-25 07:52:33 -04:00
|
|
|
}
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
|
|
|
|
}
|
2011-01-08 17:33:46 -05:00
|
|
|
if (error)
|
|
|
|
|
return (error);
|
1995-02-08 15:22:09 -05:00
|
|
|
}
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
|
2020-05-18 18:53:12 -04:00
|
|
|
if (lport != 0) {
|
2023-02-03 14:33:36 -05:00
|
|
|
if (in_pcblookup_hash_locked(inp->inp_pcbinfo, faddr,
|
2023-02-09 15:59:27 -05:00
|
|
|
fport, laddr, lport, 0, M_NODOM) != NULL)
|
2020-05-18 18:53:12 -04:00
|
|
|
return (EADDRINUSE);
|
|
|
|
|
} else {
|
|
|
|
|
struct sockaddr_in lsin, fsin;
|
|
|
|
|
|
|
|
|
|
bzero(&lsin, sizeof(lsin));
|
|
|
|
|
bzero(&fsin, sizeof(fsin));
|
|
|
|
|
lsin.sin_family = AF_INET;
|
|
|
|
|
lsin.sin_addr = laddr;
|
|
|
|
|
fsin.sin_family = AF_INET;
|
|
|
|
|
fsin.sin_addr = faddr;
|
|
|
|
|
error = in_pcb_lport_dest(inp, (struct sockaddr *) &lsin,
|
|
|
|
|
&lport, (struct sockaddr *)& fsin, fport, cred,
|
|
|
|
|
INPLOOKUP_WILDCARD);
|
2002-10-21 09:55:50 -04:00
|
|
|
if (error)
|
|
|
|
|
return (error);
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
2002-10-21 09:55:50 -04:00
|
|
|
*laddrp = laddr.s_addr;
|
|
|
|
|
*lportp = lport;
|
|
|
|
|
*faddrp = faddr.s_addr;
|
|
|
|
|
*fportp = fport;
|
1994-05-24 06:09:53 -04:00
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
1994-05-25 05:21:21 -04:00
|
|
|
void
|
2006-01-21 20:16:25 -05:00
|
|
|
in_pcbdisconnect(struct inpcb *inp)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2005-06-01 07:39:42 -04:00
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
KASSERT(inp->inp_smr == SMR_SEQ_INVALID,
|
|
|
|
|
("%s: inp %p was already disconnected", __func__, inp));
|
|
|
|
|
|
|
|
|
|
in_pcbremhash_locked(inp);
|
1994-05-24 06:09:53 -04:00
|
|
|
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
/* See the comment in in_pcbinshash(). */
|
|
|
|
|
inp->inp_smr = smr_advance(inp->inp_pcbinfo->ipi_smr);
|
2023-02-03 10:57:37 -05:00
|
|
|
inp->inp_laddr.s_addr = INADDR_ANY;
|
1994-05-24 06:09:53 -04:00
|
|
|
inp->inp_faddr.s_addr = INADDR_ANY;
|
|
|
|
|
inp->inp_fport = 0;
|
|
|
|
|
}
|
2012-01-21 21:13:19 -05:00
|
|
|
#endif /* INET */
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2006-04-01 11:04:42 -05:00
|
|
|
/*
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
* in_pcbdetach() is responsibe for disassociating a socket from an inpcb.
|
2008-09-29 09:50:17 -04:00
|
|
|
* For most protocols, this will be invoked immediately prior to calling
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
* in_pcbfree(). However, with TCP the inpcb may significantly outlive the
|
|
|
|
|
* socket, in which case in_pcbfree() is deferred.
|
2006-04-01 11:04:42 -05:00
|
|
|
*/
|
1994-05-25 05:21:21 -04:00
|
|
|
void
|
2006-01-21 20:16:25 -05:00
|
|
|
in_pcbdetach(struct inpcb *inp)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2006-04-01 11:04:42 -05:00
|
|
|
|
2008-11-26 07:54:31 -05:00
|
|
|
KASSERT(inp->inp_socket != NULL, ("%s: inp_socket == NULL", __func__));
|
2008-09-29 09:50:17 -04:00
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
#ifdef RATELIMIT
|
|
|
|
|
if (inp->inp_snd_tag != NULL)
|
|
|
|
|
in_pcbdetach_txrtlmt(inp);
|
|
|
|
|
#endif
|
2006-04-01 11:04:42 -05:00
|
|
|
inp->inp_socket->so_pcb = NULL;
|
|
|
|
|
inp->inp_socket = NULL;
|
|
|
|
|
}
|
|
|
|
|
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
/*
|
2021-12-02 17:45:04 -05:00
|
|
|
* inpcb hash lookups are protected by SMR section.
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
*
|
2021-12-02 17:45:04 -05:00
|
|
|
* Once desired pcb has been found, switching from SMR section to a pcb
|
|
|
|
|
* lock is performed with inp_smr_lock(). We can not use INP_(W|R)LOCK
|
|
|
|
|
* here because SMR is a critical section.
|
|
|
|
|
* In 99%+ cases inp_smr_lock() would obtain the lock immediately.
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
*/
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
void
|
2021-12-02 17:45:04 -05:00
|
|
|
inp_lock(struct inpcb *inp, const inp_lookup_t lock)
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
{
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
lock == INPLOOKUP_RLOCKPCB ?
|
|
|
|
|
rw_rlock(&inp->inp_lock) : rw_wlock(&inp->inp_lock);
|
|
|
|
|
}
|
|
|
|
|
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
void
|
2021-12-02 17:45:04 -05:00
|
|
|
inp_unlock(struct inpcb *inp, const inp_lookup_t lock)
|
|
|
|
|
{
|
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
2021-12-02 13:48:48 -05:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
lock == INPLOOKUP_RLOCKPCB ?
|
|
|
|
|
rw_runlock(&inp->inp_lock) : rw_wunlock(&inp->inp_lock);
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
}
|
|
|
|
|
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
int
|
2021-12-02 17:45:04 -05:00
|
|
|
inp_trylock(struct inpcb *inp, const inp_lookup_t lock)
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
{
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 15:32:02 -04:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
return (lock == INPLOOKUP_RLOCKPCB ?
|
|
|
|
|
rw_try_rlock(&inp->inp_lock) : rw_try_wlock(&inp->inp_lock));
|
|
|
|
|
}
|
|
|
|
|
|
2022-11-08 13:24:39 -05:00
|
|
|
static inline bool
|
|
|
|
|
_inp_smr_lock(struct inpcb *inp, const inp_lookup_t lock, const int ignflags)
|
2021-12-02 17:45:04 -05:00
|
|
|
{
|
|
|
|
|
|
|
|
|
|
MPASS(lock == INPLOOKUP_RLOCKPCB || lock == INPLOOKUP_WLOCKPCB);
|
|
|
|
|
SMR_ASSERT_ENTERED(inp->inp_pcbinfo->ipi_smr);
|
|
|
|
|
|
|
|
|
|
if (__predict_true(inp_trylock(inp, lock))) {
|
2022-11-08 13:24:39 -05:00
|
|
|
if (__predict_false(inp->inp_flags & ignflags)) {
|
2021-12-02 17:45:04 -05:00
|
|
|
smr_exit(inp->inp_pcbinfo->ipi_smr);
|
|
|
|
|
inp_unlock(inp, lock);
|
|
|
|
|
return (false);
|
There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.
This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:
- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.
To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.
Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>
2012-10-02 08:03:02 -04:00
|
|
|
}
|
2021-12-02 17:45:04 -05:00
|
|
|
smr_exit(inp->inp_pcbinfo->ipi_smr);
|
|
|
|
|
return (true);
|
There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.
This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:
- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.
To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.
Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>
2012-10-02 08:03:02 -04:00
|
|
|
}
|
2020-02-12 08:31:36 -05:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
if (__predict_true(refcount_acquire_if_not_zero(&inp->inp_refcount))) {
|
|
|
|
|
smr_exit(inp->inp_pcbinfo->ipi_smr);
|
|
|
|
|
inp_lock(inp, lock);
|
|
|
|
|
if (__predict_false(in_pcbrele(inp, lock)))
|
|
|
|
|
return (false);
|
2018-04-19 09:37:59 -04:00
|
|
|
/*
|
2021-12-02 17:45:04 -05:00
|
|
|
* inp acquired through refcount & lock for sure didn't went
|
|
|
|
|
* through uma_zfree(). However, it may have already went
|
|
|
|
|
* through in_pcbfree() and has another reference, that
|
|
|
|
|
* prevented its release by our in_pcbrele().
|
2018-04-19 09:37:59 -04:00
|
|
|
*/
|
2022-11-08 13:24:39 -05:00
|
|
|
if (__predict_false(inp->inp_flags & ignflags)) {
|
2021-12-02 17:45:04 -05:00
|
|
|
inp_unlock(inp, lock);
|
|
|
|
|
return (false);
|
|
|
|
|
}
|
|
|
|
|
return (true);
|
|
|
|
|
} else {
|
|
|
|
|
smr_exit(inp->inp_pcbinfo->ipi_smr);
|
|
|
|
|
return (false);
|
2018-04-19 09:37:59 -04:00
|
|
|
}
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 15:32:02 -04:00
|
|
|
}
|
|
|
|
|
|
2022-11-08 13:24:39 -05:00
|
|
|
bool
|
|
|
|
|
inp_smr_lock(struct inpcb *inp, const inp_lookup_t lock)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* in_pcblookup() family of functions ignore not only freed entries,
|
|
|
|
|
* that may be found due to lockless access to the hash, but dropped
|
|
|
|
|
* entries, too.
|
|
|
|
|
*/
|
|
|
|
|
return (_inp_smr_lock(inp, lock, INP_FREED | INP_DROPPED));
|
|
|
|
|
}
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
/*
|
|
|
|
|
* inp_next() - inpcb hash/list traversal iterator
|
|
|
|
|
*
|
|
|
|
|
* Requires initialized struct inpcb_iterator for context.
|
|
|
|
|
* The structure can be initialized with INP_ITERATOR() or INP_ALL_ITERATOR().
|
|
|
|
|
*
|
|
|
|
|
* - Iterator can have either write-lock or read-lock semantics, that can not
|
|
|
|
|
* be changed later.
|
|
|
|
|
* - Iterator can iterate either over all pcbs list (INP_ALL_LIST), or through
|
|
|
|
|
* a single hash slot. Note: only rip_input() does the latter.
|
|
|
|
|
* - Iterator may have optional bool matching function. The matching function
|
|
|
|
|
* will be executed for each inpcb in the SMR context, so it can not acquire
|
|
|
|
|
* locks and can safely access only immutable fields of inpcb.
|
|
|
|
|
*
|
|
|
|
|
* A fresh initialized iterator has NULL inpcb in its context and that
|
|
|
|
|
* means that inp_next() call would return the very first inpcb on the list
|
|
|
|
|
* locked with desired semantic. In all following calls the context pointer
|
|
|
|
|
* shall hold the current inpcb pointer. The KPI user is not supposed to
|
|
|
|
|
* unlock the current inpcb! Upon end of traversal inp_next() will return NULL
|
|
|
|
|
* and write NULL to its context. After end of traversal an iterator can be
|
|
|
|
|
* reused.
|
|
|
|
|
*
|
|
|
|
|
* List traversals have the following features/constraints:
|
|
|
|
|
* - New entries won't be seen, as they are always added to the head of a list.
|
|
|
|
|
* - Removed entries won't stop traversal as long as they are not added to
|
|
|
|
|
* a different list. This is violated by in_pcbrehash().
|
|
|
|
|
*/
|
|
|
|
|
#define II_LIST_FIRST(ipi, hash) \
|
|
|
|
|
(((hash) == INP_ALL_LIST) ? \
|
|
|
|
|
CK_LIST_FIRST(&(ipi)->ipi_listhead) : \
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
CK_LIST_FIRST(&(ipi)->ipi_hash_exact[(hash)]))
|
2021-12-02 17:45:04 -05:00
|
|
|
#define II_LIST_NEXT(inp, hash) \
|
|
|
|
|
(((hash) == INP_ALL_LIST) ? \
|
|
|
|
|
CK_LIST_NEXT((inp), inp_list) : \
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
CK_LIST_NEXT((inp), inp_hash_exact))
|
2021-12-02 17:45:04 -05:00
|
|
|
#define II_LOCK_ASSERT(inp, lock) \
|
|
|
|
|
rw_assert(&(inp)->inp_lock, \
|
|
|
|
|
(lock) == INPLOOKUP_RLOCKPCB ? RA_RLOCKED : RA_WLOCKED )
|
|
|
|
|
struct inpcb *
|
|
|
|
|
inp_next(struct inpcb_iterator *ii)
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 15:32:02 -04:00
|
|
|
{
|
2021-12-02 17:45:04 -05:00
|
|
|
const struct inpcbinfo *ipi = ii->ipi;
|
|
|
|
|
inp_match_t *match = ii->match;
|
|
|
|
|
void *ctx = ii->ctx;
|
|
|
|
|
inp_lookup_t lock = ii->lock;
|
|
|
|
|
int hash = ii->hash;
|
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
|
|
if (ii->inp == NULL) { /* First call. */
|
|
|
|
|
smr_enter(ipi->ipi_smr);
|
|
|
|
|
/* This is unrolled CK_LIST_FOREACH(). */
|
|
|
|
|
for (inp = II_LIST_FIRST(ipi, hash);
|
|
|
|
|
inp != NULL;
|
|
|
|
|
inp = II_LIST_NEXT(inp, hash)) {
|
|
|
|
|
if (match != NULL && (match)(inp, ctx) == false)
|
|
|
|
|
continue;
|
2022-11-08 13:24:39 -05:00
|
|
|
if (__predict_true(_inp_smr_lock(inp, lock, INP_FREED)))
|
2021-12-02 17:45:04 -05:00
|
|
|
break;
|
|
|
|
|
else {
|
|
|
|
|
smr_enter(ipi->ipi_smr);
|
|
|
|
|
MPASS(inp != II_LIST_FIRST(ipi, hash));
|
|
|
|
|
inp = II_LIST_FIRST(ipi, hash);
|
2022-01-01 12:59:47 -05:00
|
|
|
if (inp == NULL)
|
|
|
|
|
break;
|
2021-12-02 17:45:04 -05:00
|
|
|
}
|
|
|
|
|
}
|
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
2021-12-02 13:48:48 -05:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
if (inp == NULL)
|
|
|
|
|
smr_exit(ipi->ipi_smr);
|
|
|
|
|
else
|
|
|
|
|
ii->inp = inp;
|
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
2021-12-02 13:48:48 -05:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
return (inp);
|
|
|
|
|
}
|
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
2021-12-02 13:48:48 -05:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
/* Not a first call. */
|
|
|
|
|
smr_enter(ipi->ipi_smr);
|
|
|
|
|
restart:
|
|
|
|
|
inp = ii->inp;
|
|
|
|
|
II_LOCK_ASSERT(inp, lock);
|
|
|
|
|
next:
|
|
|
|
|
inp = II_LIST_NEXT(inp, hash);
|
|
|
|
|
if (inp == NULL) {
|
|
|
|
|
smr_exit(ipi->ipi_smr);
|
|
|
|
|
goto found;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (match != NULL && (match)(inp, ctx) == false)
|
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
|
|
if (__predict_true(inp_trylock(inp, lock))) {
|
|
|
|
|
if (__predict_false(inp->inp_flags & INP_FREED)) {
|
|
|
|
|
/*
|
|
|
|
|
* Entries are never inserted in middle of a list, thus
|
|
|
|
|
* as long as we are in SMR, we can continue traversal.
|
|
|
|
|
* Jump to 'restart' should yield in the same result,
|
|
|
|
|
* but could produce unnecessary looping. Could this
|
|
|
|
|
* looping be unbound?
|
|
|
|
|
*/
|
|
|
|
|
inp_unlock(inp, lock);
|
|
|
|
|
goto next;
|
|
|
|
|
} else {
|
|
|
|
|
smr_exit(ipi->ipi_smr);
|
|
|
|
|
goto found;
|
2015-11-25 09:45:43 -05:00
|
|
|
}
|
|
|
|
|
}
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 15:32:02 -04:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
/*
|
|
|
|
|
* Can't obtain lock immediately, thus going hard. Once we exit the
|
|
|
|
|
* SMR section we can no longer jump to 'next', and our only stable
|
|
|
|
|
* anchoring point is ii->inp, which we keep locked for this case, so
|
|
|
|
|
* we jump to 'restart'.
|
|
|
|
|
*/
|
|
|
|
|
if (__predict_true(refcount_acquire_if_not_zero(&inp->inp_refcount))) {
|
|
|
|
|
smr_exit(ipi->ipi_smr);
|
|
|
|
|
inp_lock(inp, lock);
|
|
|
|
|
if (__predict_false(in_pcbrele(inp, lock))) {
|
|
|
|
|
smr_enter(ipi->ipi_smr);
|
|
|
|
|
goto restart;
|
|
|
|
|
}
|
2018-04-19 09:37:59 -04:00
|
|
|
/*
|
2021-12-02 17:45:04 -05:00
|
|
|
* See comment in inp_smr_lock().
|
2018-04-19 09:37:59 -04:00
|
|
|
*/
|
2021-12-02 17:45:04 -05:00
|
|
|
if (__predict_false(inp->inp_flags & INP_FREED)) {
|
|
|
|
|
inp_unlock(inp, lock);
|
|
|
|
|
smr_enter(ipi->ipi_smr);
|
|
|
|
|
goto restart;
|
|
|
|
|
}
|
|
|
|
|
} else
|
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
|
|
found:
|
|
|
|
|
inp_unlock(ii->inp, lock);
|
|
|
|
|
ii->inp = inp;
|
|
|
|
|
|
|
|
|
|
return (ii->inp);
|
2018-06-12 18:18:27 -04:00
|
|
|
}
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
/*
|
|
|
|
|
* in_pcbref() bumps the reference count on an inpcb in order to maintain
|
|
|
|
|
* stability of an inpcb pointer despite the inpcb lock being released or
|
|
|
|
|
* SMR section exited.
|
|
|
|
|
*
|
|
|
|
|
* To free a reference later in_pcbrele_(r|w)locked() must be performed.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcbref(struct inpcb *inp)
|
2018-06-12 18:18:15 -04:00
|
|
|
{
|
2021-12-02 17:45:04 -05:00
|
|
|
u_int old __diagused;
|
2018-06-12 18:18:15 -04:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
old = refcount_acquire(&inp->inp_refcount);
|
|
|
|
|
KASSERT(old > 0, ("%s: refcount 0", __func__));
|
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
2021-12-02 13:48:48 -05:00
|
|
|
}
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
/*
|
|
|
|
|
* Drop a refcount on an inpcb elevated using in_pcbref(), potentially
|
|
|
|
|
* freeing the pcb, if the reference was very last.
|
|
|
|
|
*/
|
|
|
|
|
bool
|
|
|
|
|
in_pcbrele_rlocked(struct inpcb *inp)
|
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
2021-12-02 13:48:48 -05:00
|
|
|
{
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
INP_RLOCK_ASSERT(inp);
|
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
2021-12-02 13:48:48 -05:00
|
|
|
|
2023-02-13 16:26:36 -05:00
|
|
|
if (!refcount_release(&inp->inp_refcount))
|
2021-12-02 17:45:04 -05:00
|
|
|
return (false);
|
|
|
|
|
|
|
|
|
|
MPASS(inp->inp_flags & INP_FREED);
|
|
|
|
|
MPASS(inp->inp_socket == NULL);
|
2023-04-20 11:48:33 -04:00
|
|
|
crfree(inp->inp_cred);
|
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
|
inp->inp_cred = NULL;
|
|
|
|
|
#endif
|
2021-12-02 17:45:04 -05:00
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
|
uma_zfree_smr(inp->inp_pcbinfo->ipi_zone, inp);
|
|
|
|
|
return (true);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
bool
|
|
|
|
|
in_pcbrele_wlocked(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
|
2023-02-13 16:26:36 -05:00
|
|
|
if (!refcount_release(&inp->inp_refcount))
|
2021-12-02 17:45:04 -05:00
|
|
|
return (false);
|
|
|
|
|
|
|
|
|
|
MPASS(inp->inp_flags & INP_FREED);
|
|
|
|
|
MPASS(inp->inp_socket == NULL);
|
2023-04-20 11:48:33 -04:00
|
|
|
crfree(inp->inp_cred);
|
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
|
inp->inp_cred = NULL;
|
|
|
|
|
#endif
|
2021-12-02 17:45:04 -05:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
uma_zfree_smr(inp->inp_pcbinfo->ipi_zone, inp);
|
|
|
|
|
return (true);
|
2018-06-12 18:18:15 -04:00
|
|
|
}
|
|
|
|
|
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
bool
|
|
|
|
|
in_pcbrele(struct inpcb *inp, const inp_lookup_t lock)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
return (lock == INPLOOKUP_RLOCKPCB ?
|
|
|
|
|
in_pcbrele_rlocked(inp) : in_pcbrele_wlocked(inp));
|
|
|
|
|
}
|
|
|
|
|
|
2018-05-20 00:38:04 -04:00
|
|
|
/*
|
|
|
|
|
* Unconditionally schedule an inpcb to be freed by decrementing its
|
|
|
|
|
* reference count, which should occur only after the inpcb has been detached
|
|
|
|
|
* from its socket. If another thread holds a temporary reference (acquired
|
|
|
|
|
* using in_pcbref()) then the free is deferred until that reference is
|
2021-12-02 17:45:04 -05:00
|
|
|
* released using in_pcbrele_(r|w)locked(), but the inpcb is still unlocked.
|
|
|
|
|
* Almost all work, including removal from global lists, is done in this
|
|
|
|
|
* context, where the pcbinfo lock is held.
|
2018-05-20 00:38:04 -04:00
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcbfree(struct inpcb *inp)
|
|
|
|
|
{
|
2018-05-21 12:13:43 -04:00
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
2021-12-02 17:45:04 -05:00
|
|
|
#ifdef INET
|
|
|
|
|
struct ip_moptions *imo;
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
struct ip6_moptions *im6o;
|
|
|
|
|
#endif
|
2018-05-21 12:13:43 -04:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2018-05-20 00:38:04 -04:00
|
|
|
KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__));
|
2021-12-02 17:45:04 -05:00
|
|
|
KASSERT((inp->inp_flags & INP_FREED) == 0,
|
2018-05-20 00:38:04 -04:00
|
|
|
("%s: called twice for pcb %p", __func__, inp));
|
2021-12-02 17:45:04 -05:00
|
|
|
|
|
|
|
|
inp->inp_flags |= INP_FREED;
|
|
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
|
|
|
|
inp->inp_gencnt = ++pcbinfo->ipi_gencnt;
|
|
|
|
|
pcbinfo->ipi_count--;
|
|
|
|
|
CK_LIST_REMOVE(inp, inp_list);
|
|
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
|
|
|
|
|
2022-10-13 12:03:38 -04:00
|
|
|
if (inp->inp_flags & INP_INHASHLIST)
|
|
|
|
|
in_pcbremhash(inp);
|
2018-05-20 00:38:04 -04:00
|
|
|
|
|
|
|
|
RO_INVALIDATE_CACHE(&inp->inp_route);
|
2021-12-02 17:45:04 -05:00
|
|
|
#ifdef MAC
|
|
|
|
|
mac_inpcb_destroy(inp);
|
|
|
|
|
#endif
|
|
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
|
|
|
|
if (inp->inp_sp != NULL)
|
|
|
|
|
ipsec_delete_pcbpolicy(inp);
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
|
|
|
|
if (inp->inp_options)
|
|
|
|
|
(void)m_free(inp->inp_options);
|
|
|
|
|
imo = inp->inp_moptions;
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO) {
|
|
|
|
|
ip6_freepcbopts(inp->in6p_outputopts);
|
|
|
|
|
im6o = inp->in6p_moptions;
|
|
|
|
|
} else
|
|
|
|
|
im6o = NULL;
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
if (__predict_false(in_pcbrele_wlocked(inp) == false)) {
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
}
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
ip6_freemoptions(im6o);
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
|
|
|
|
inp_freemoptions(imo);
|
2021-12-05 11:47:24 -05:00
|
|
|
#endif
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Different protocols initialize their inpcbs differently - giving
|
|
|
|
|
* different name to the lock. But they all are disposed the same.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
inpcb_fini(void *mem, int size)
|
|
|
|
|
{
|
|
|
|
|
struct inpcb *inp = mem;
|
|
|
|
|
|
|
|
|
|
INP_LOCK_DESTROY(inp);
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 15:18:50 -05:00
|
|
|
}
|
|
|
|
|
|
2006-04-25 07:17:35 -04:00
|
|
|
/*
|
2008-09-29 09:50:17 -04:00
|
|
|
* in_pcbdrop() removes an inpcb from hashed lists, releasing its address and
|
|
|
|
|
* port reservation, and preventing it from being returned by inpcb lookups.
|
|
|
|
|
*
|
|
|
|
|
* It is used by TCP to mark an inpcb as unused and avoid future packet
|
|
|
|
|
* delivery or event notification when a socket remains open but TCP has
|
|
|
|
|
* closed. This might occur as a result of a shutdown()-initiated TCP close
|
|
|
|
|
* or a RST on the wire, and allows the port binding to be reused while still
|
|
|
|
|
* maintaining the invariant that so_pcb always points to a valid inpcb until
|
|
|
|
|
* in_pcbdetach().
|
|
|
|
|
*
|
|
|
|
|
* XXXRW: Possibly in_pcbdrop() should also prevent future notifications by
|
|
|
|
|
* in_pcbnotifyall() and in_pcbpurgeif0()?
|
2006-04-25 07:17:35 -04:00
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcbdrop(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2018-07-03 22:47:16 -04:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
|
if (inp->inp_socket != NULL && inp->inp_ppcb != NULL)
|
|
|
|
|
MPASS(inp->inp_refcount > 1);
|
|
|
|
|
#endif
|
2006-04-25 07:17:35 -04:00
|
|
|
|
2009-03-15 05:58:31 -04:00
|
|
|
inp->inp_flags |= INP_DROPPED;
|
2022-10-13 12:03:38 -04:00
|
|
|
if (inp->inp_flags & INP_INHASHLIST)
|
|
|
|
|
in_pcbremhash(inp);
|
2006-04-25 07:17:35 -04:00
|
|
|
}
|
|
|
|
|
|
2011-04-30 07:04:34 -04:00
|
|
|
#ifdef INET
|
2007-05-11 06:20:51 -04:00
|
|
|
/*
|
|
|
|
|
* Common routines to return the socket addresses associated with inpcbs.
|
|
|
|
|
*/
|
2002-08-21 07:57:12 -04:00
|
|
|
struct sockaddr *
|
2006-01-21 20:16:25 -05:00
|
|
|
in_sockaddr(in_port_t port, struct in_addr *addr_p)
|
2002-08-21 07:57:12 -04:00
|
|
|
{
|
|
|
|
|
struct sockaddr_in *sin;
|
|
|
|
|
|
2008-10-23 11:53:51 -04:00
|
|
|
sin = malloc(sizeof *sin, M_SONAME,
|
2003-02-19 00:47:46 -05:00
|
|
|
M_WAITOK | M_ZERO);
|
2002-08-21 07:57:12 -04:00
|
|
|
sin->sin_family = AF_INET;
|
|
|
|
|
sin->sin_len = sizeof(*sin);
|
|
|
|
|
sin->sin_addr = *addr_p;
|
|
|
|
|
sin->sin_port = port;
|
|
|
|
|
|
|
|
|
|
return (struct sockaddr *)sin;
|
|
|
|
|
}
|
|
|
|
|
|
1997-02-18 15:46:36 -05:00
|
|
|
int
|
2007-05-11 06:20:51 -04:00
|
|
|
in_getsockaddr(struct socket *so, struct sockaddr **nam)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2006-01-21 20:16:25 -05:00
|
|
|
struct inpcb *inp;
|
2002-08-21 07:57:12 -04:00
|
|
|
struct in_addr addr;
|
|
|
|
|
in_port_t port;
|
1997-12-25 01:57:36 -05:00
|
|
|
|
1997-05-18 21:28:39 -04:00
|
|
|
inp = sotoinpcb(so);
|
2007-05-11 06:20:51 -04:00
|
|
|
KASSERT(inp != NULL, ("in_getsockaddr: inp == NULL"));
|
2006-04-22 15:10:02 -04:00
|
|
|
|
2008-04-19 10:34:38 -04:00
|
|
|
INP_RLOCK(inp);
|
2002-08-21 07:57:12 -04:00
|
|
|
port = inp->inp_lport;
|
|
|
|
|
addr = inp->inp_laddr;
|
2008-04-19 10:34:38 -04:00
|
|
|
INP_RUNLOCK(inp);
|
1997-12-25 01:57:36 -05:00
|
|
|
|
2002-08-21 07:57:12 -04:00
|
|
|
*nam = in_sockaddr(port, &addr);
|
1997-02-18 15:46:36 -05:00
|
|
|
return 0;
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
|
|
|
|
|
1997-02-18 15:46:36 -05:00
|
|
|
int
|
2007-05-11 06:20:51 -04:00
|
|
|
in_getpeeraddr(struct socket *so, struct sockaddr **nam)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2006-01-21 20:16:25 -05:00
|
|
|
struct inpcb *inp;
|
2002-08-21 07:57:12 -04:00
|
|
|
struct in_addr addr;
|
|
|
|
|
in_port_t port;
|
1997-12-25 01:57:36 -05:00
|
|
|
|
1997-05-18 21:28:39 -04:00
|
|
|
inp = sotoinpcb(so);
|
2007-05-11 06:20:51 -04:00
|
|
|
KASSERT(inp != NULL, ("in_getpeeraddr: inp == NULL"));
|
2006-04-22 15:10:02 -04:00
|
|
|
|
2008-04-19 10:34:38 -04:00
|
|
|
INP_RLOCK(inp);
|
2002-08-21 07:57:12 -04:00
|
|
|
port = inp->inp_fport;
|
|
|
|
|
addr = inp->inp_faddr;
|
2008-04-19 10:34:38 -04:00
|
|
|
INP_RUNLOCK(inp);
|
1997-12-25 01:57:36 -05:00
|
|
|
|
2002-08-21 07:57:12 -04:00
|
|
|
*nam = in_sockaddr(port, &addr);
|
1997-02-18 15:46:36 -05:00
|
|
|
return 0;
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
|
|
|
|
|
1994-05-25 05:21:21 -04:00
|
|
|
void
|
2006-01-21 20:16:25 -05:00
|
|
|
in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr faddr, int errno,
|
|
|
|
|
struct inpcb *(*notify)(struct inpcb *, int))
|
2001-02-22 16:23:45 -05:00
|
|
|
{
|
2008-04-06 17:20:56 -04:00
|
|
|
struct inpcb *inp, *inp_temp;
|
2001-02-22 16:23:45 -05:00
|
|
|
|
2003-02-12 18:55:07 -05:00
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
2021-12-02 17:45:04 -05:00
|
|
|
CK_LIST_FOREACH_SAFE(inp, &pcbinfo->ipi_listhead, inp_list, inp_temp) {
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2001-02-22 16:23:45 -05:00
|
|
|
#ifdef INET6
|
2002-06-10 16:05:46 -04:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0) {
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2001-02-22 16:23:45 -05:00
|
|
|
continue;
|
2002-06-10 16:05:46 -04:00
|
|
|
}
|
2001-02-22 16:23:45 -05:00
|
|
|
#endif
|
|
|
|
|
if (inp->inp_faddr.s_addr != faddr.s_addr ||
|
2002-06-10 16:05:46 -04:00
|
|
|
inp->inp_socket == NULL) {
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2003-02-12 18:55:07 -05:00
|
|
|
continue;
|
2002-06-10 16:05:46 -04:00
|
|
|
}
|
2003-02-12 18:55:07 -05:00
|
|
|
if ((*notify)(inp, errno))
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2001-02-22 16:23:45 -05:00
|
|
|
}
|
2003-02-12 18:55:07 -05:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
2001-02-22 16:23:45 -05:00
|
|
|
}
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
static bool
|
|
|
|
|
inp_v4_multi_match(const struct inpcb *inp, void *v __unused)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) && inp->inp_moptions != NULL)
|
|
|
|
|
return (true);
|
|
|
|
|
else
|
|
|
|
|
return (false);
|
|
|
|
|
}
|
|
|
|
|
|
2001-08-04 13:10:14 -04:00
|
|
|
void
|
2006-01-21 20:16:25 -05:00
|
|
|
in_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp)
|
2001-08-04 13:10:14 -04:00
|
|
|
{
|
2021-12-02 17:45:04 -05:00
|
|
|
struct inpcb_iterator inpi = INP_ITERATOR(pcbinfo, INPLOOKUP_WLOCKPCB,
|
|
|
|
|
inp_v4_multi_match, NULL);
|
2001-08-04 13:10:14 -04:00
|
|
|
struct inpcb *inp;
|
2019-06-25 07:54:41 -04:00
|
|
|
struct in_multi *inm;
|
|
|
|
|
struct in_mfilter *imf;
|
2001-08-04 13:10:14 -04:00
|
|
|
struct ip_moptions *imo;
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
IN_MULTI_LOCK_ASSERT();
|
|
|
|
|
|
|
|
|
|
while ((inp = inp_next(&inpi)) != NULL) {
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
|
2001-08-04 13:10:14 -04:00
|
|
|
imo = inp->inp_moptions;
|
2021-12-02 17:45:04 -05:00
|
|
|
/*
|
|
|
|
|
* Unselect the outgoing interface if it is being
|
|
|
|
|
* detached.
|
|
|
|
|
*/
|
|
|
|
|
if (imo->imo_multicast_ifp == ifp)
|
|
|
|
|
imo->imo_multicast_ifp = NULL;
|
2001-08-04 13:10:14 -04:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
/*
|
|
|
|
|
* Drop multicast group membership if we joined
|
|
|
|
|
* through the interface being detached.
|
|
|
|
|
*
|
|
|
|
|
* XXX This can all be deferred to an epoch_call
|
|
|
|
|
*/
|
2019-06-25 07:54:41 -04:00
|
|
|
restart:
|
2021-12-02 17:45:04 -05:00
|
|
|
IP_MFILTER_FOREACH(imf, &imo->imo_head) {
|
|
|
|
|
if ((inm = imf->imf_inm) == NULL)
|
|
|
|
|
continue;
|
|
|
|
|
if (inm->inm_ifp != ifp)
|
|
|
|
|
continue;
|
|
|
|
|
ip_mfilter_remove(&imo->imo_head, imf);
|
|
|
|
|
in_leavegroup_locked(inm, NULL);
|
|
|
|
|
ip_mfilter_free(imf);
|
|
|
|
|
goto restart;
|
2001-08-04 13:10:14 -04:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
/*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
* Lookup a PCB based on the local address and port. Caller must hold the
|
|
|
|
|
* hash lock. No inpcb locks or references are acquired.
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
*/
|
2006-02-04 02:59:17 -05:00
|
|
|
#define INP_LOOKUP_MAPPED_PCB_COST 3
|
1994-05-24 06:09:53 -04:00
|
|
|
struct inpcb *
|
2006-01-21 20:16:25 -05:00
|
|
|
in_pcblookup_local(struct inpcbinfo *pcbinfo, struct in_addr laddr,
|
2011-05-23 11:23:18 -04:00
|
|
|
u_short lport, int lookupflags, struct ucred *cred)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2006-01-21 20:16:25 -05:00
|
|
|
struct inpcb *inp;
|
2006-02-04 02:59:17 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
int matchwild = 3 + INP_LOOKUP_MAPPED_PCB_COST;
|
|
|
|
|
#else
|
|
|
|
|
int matchwild = 3;
|
|
|
|
|
#endif
|
|
|
|
|
int wildcard;
|
1995-04-10 04:52:45 -04:00
|
|
|
|
2011-05-23 11:23:18 -04:00
|
|
|
KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
|
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2003-11-13 00:16:56 -05:00
|
|
|
|
2011-05-23 11:23:18 -04:00
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) == 0) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
struct inpcbhead *head;
|
|
|
|
|
/*
|
|
|
|
|
* Look for an unconnected (wildcard foreign addr) PCB that
|
|
|
|
|
* matches the local address and port we're looking for.
|
|
|
|
|
*/
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
head = &pcbinfo->ipi_hash_wild[INP_PCBHASH_WILD(lport,
|
2021-12-26 13:47:28 -05:00
|
|
|
pcbinfo->ipi_hashmask)];
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash_wild) {
|
1999-12-07 12:39:16 -05:00
|
|
|
#ifdef INET6
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
/* XXX inp locking */
|
1999-12-21 06:14:12 -05:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
1999-12-07 12:39:16 -05:00
|
|
|
continue;
|
|
|
|
|
#endif
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
if (inp->inp_faddr.s_addr == INADDR_ANY &&
|
|
|
|
|
inp->inp_laddr.s_addr == laddr.s_addr &&
|
|
|
|
|
inp->inp_lport == lport) {
|
|
|
|
|
/*
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
* Found?
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
*/
|
2022-11-02 13:03:41 -04:00
|
|
|
if (prison_equal_ip4(cred->cr_prison,
|
|
|
|
|
inp->inp_cred->cr_prison))
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
return (inp);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
}
|
1995-04-08 21:29:31 -04:00
|
|
|
}
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
/*
|
|
|
|
|
* Not found.
|
|
|
|
|
*/
|
|
|
|
|
return (NULL);
|
|
|
|
|
} else {
|
|
|
|
|
struct inpcbporthead *porthash;
|
|
|
|
|
struct inpcbport *phd;
|
|
|
|
|
struct inpcb *match = NULL;
|
|
|
|
|
/*
|
|
|
|
|
* Best fit PCB lookup.
|
|
|
|
|
*
|
|
|
|
|
* First see if this local port is in use by looking on the
|
|
|
|
|
* port hash list.
|
|
|
|
|
*/
|
2007-04-30 19:12:05 -04:00
|
|
|
porthash = &pcbinfo->ipi_porthashbase[INP_PCBPORTHASH(lport,
|
|
|
|
|
pcbinfo->ipi_porthashmask)];
|
2018-06-12 18:18:20 -04:00
|
|
|
CK_LIST_FOREACH(phd, porthash, phd_hash) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
if (phd->phd_port == lport)
|
1994-05-24 06:09:53 -04:00
|
|
|
break;
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
}
|
|
|
|
|
if (phd != NULL) {
|
|
|
|
|
/*
|
|
|
|
|
* Port is in use by one or more PCBs. Look for best
|
|
|
|
|
* fit.
|
|
|
|
|
*/
|
2018-06-12 18:18:20 -04:00
|
|
|
CK_LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
wildcard = 0;
|
2022-11-02 13:03:41 -04:00
|
|
|
if (!prison_equal_ip4(inp->inp_cred->cr_prison,
|
|
|
|
|
cred->cr_prison))
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
continue;
|
1999-12-07 12:39:16 -05:00
|
|
|
#ifdef INET6
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
/* XXX inp locking */
|
1999-12-21 06:14:12 -05:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
1999-12-07 12:39:16 -05:00
|
|
|
continue;
|
2006-02-04 02:59:17 -05:00
|
|
|
/*
|
|
|
|
|
* We never select the PCB that has
|
|
|
|
|
* INP_IPV6 flag and is bound to :: if
|
|
|
|
|
* we have another PCB which is bound
|
|
|
|
|
* to 0.0.0.0. If a PCB has the
|
|
|
|
|
* INP_IPV6 flag, then we set its cost
|
|
|
|
|
* higher than IPv4 only PCBs.
|
|
|
|
|
*
|
|
|
|
|
* Note that the case only happens
|
|
|
|
|
* when a socket is bound to ::, under
|
|
|
|
|
* the condition that the use of the
|
|
|
|
|
* mapped address is allowed.
|
|
|
|
|
*/
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0)
|
|
|
|
|
wildcard += INP_LOOKUP_MAPPED_PCB_COST;
|
1999-12-07 12:39:16 -05:00
|
|
|
#endif
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY)
|
|
|
|
|
wildcard++;
|
|
|
|
|
if (inp->inp_laddr.s_addr != INADDR_ANY) {
|
|
|
|
|
if (laddr.s_addr == INADDR_ANY)
|
|
|
|
|
wildcard++;
|
|
|
|
|
else if (inp->inp_laddr.s_addr != laddr.s_addr)
|
|
|
|
|
continue;
|
|
|
|
|
} else {
|
|
|
|
|
if (laddr.s_addr != INADDR_ANY)
|
|
|
|
|
wildcard++;
|
|
|
|
|
}
|
|
|
|
|
if (wildcard < matchwild) {
|
|
|
|
|
match = inp;
|
|
|
|
|
matchwild = wildcard;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
if (matchwild == 0)
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
break;
|
|
|
|
|
}
|
1995-03-02 14:29:42 -05:00
|
|
|
}
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
return (match);
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
|
|
|
|
}
|
2006-02-04 02:59:17 -05:00
|
|
|
#undef INP_LOOKUP_MAPPED_PCB_COST
|
1995-04-08 21:29:31 -04:00
|
|
|
|
2022-11-02 13:08:07 -04:00
|
|
|
static bool
|
|
|
|
|
in_pcblookup_lb_numa_match(const struct inpcblbgroup *grp, int domain)
|
|
|
|
|
{
|
|
|
|
|
return (domain == M_NODOM || domain == grp->il_numa_domain);
|
|
|
|
|
}
|
|
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
static struct inpcb *
|
|
|
|
|
in_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo,
|
2023-02-09 15:59:27 -05:00
|
|
|
const struct in_addr *faddr, uint16_t fport, const struct in_addr *laddr,
|
|
|
|
|
uint16_t lport, int domain)
|
2018-06-06 11:45:57 -04:00
|
|
|
{
|
|
|
|
|
const struct inpcblbgrouphead *hdr;
|
|
|
|
|
struct inpcblbgroup *grp;
|
2022-11-02 13:08:07 -04:00
|
|
|
struct inpcblbgroup *jail_exact, *jail_wild, *local_exact, *local_wild;
|
2018-06-06 11:45:57 -04:00
|
|
|
|
|
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
|
|
|
|
|
2018-12-05 12:06:00 -05:00
|
|
|
hdr = &pcbinfo->ipi_lbgrouphashbase[
|
|
|
|
|
INP_PCBPORTHASH(lport, pcbinfo->ipi_lbgrouphashmask)];
|
2018-06-06 11:45:57 -04:00
|
|
|
|
|
|
|
|
/*
|
2022-11-02 13:08:07 -04:00
|
|
|
* Search for an LB group match based on the following criteria:
|
|
|
|
|
* - prefer jailed groups to non-jailed groups
|
|
|
|
|
* - prefer exact source address matches to wildcard matches
|
|
|
|
|
* - prefer groups bound to the specified NUMA domain
|
2018-06-06 11:45:57 -04:00
|
|
|
*/
|
2022-11-02 13:08:07 -04:00
|
|
|
jail_exact = jail_wild = local_exact = local_wild = NULL;
|
2018-09-10 15:00:29 -04:00
|
|
|
CK_LIST_FOREACH(grp, hdr, il_list) {
|
2022-11-02 13:08:07 -04:00
|
|
|
bool injail;
|
|
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
#ifdef INET6
|
|
|
|
|
if (!(grp->il_vflag & INP_IPV4))
|
|
|
|
|
continue;
|
|
|
|
|
#endif
|
2018-09-05 11:04:11 -04:00
|
|
|
if (grp->il_lport != lport)
|
|
|
|
|
continue;
|
2018-06-06 11:45:57 -04:00
|
|
|
|
2022-11-02 13:08:07 -04:00
|
|
|
injail = prison_flag(grp->il_cred, PR_IP4) != 0;
|
|
|
|
|
if (injail && prison_check_ip4_locked(grp->il_cred->cr_prison,
|
|
|
|
|
laddr) != 0)
|
|
|
|
|
continue;
|
|
|
|
|
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
if (grp->il_laddr.s_addr == laddr->s_addr) {
|
2022-11-02 13:08:07 -04:00
|
|
|
if (injail) {
|
|
|
|
|
jail_exact = grp;
|
|
|
|
|
if (in_pcblookup_lb_numa_match(grp, domain))
|
|
|
|
|
/* This is a perfect match. */
|
|
|
|
|
goto out;
|
|
|
|
|
} else if (local_exact == NULL ||
|
|
|
|
|
in_pcblookup_lb_numa_match(grp, domain)) {
|
|
|
|
|
local_exact = grp;
|
|
|
|
|
}
|
2023-02-09 15:59:27 -05:00
|
|
|
} else if (grp->il_laddr.s_addr == INADDR_ANY) {
|
2022-11-02 13:08:07 -04:00
|
|
|
if (injail) {
|
|
|
|
|
if (jail_wild == NULL ||
|
|
|
|
|
in_pcblookup_lb_numa_match(grp, domain))
|
|
|
|
|
jail_wild = grp;
|
|
|
|
|
} else if (local_wild == NULL ||
|
|
|
|
|
in_pcblookup_lb_numa_match(grp, domain)) {
|
|
|
|
|
local_wild = grp;
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
}
|
|
|
|
|
}
|
2018-06-06 11:45:57 -04:00
|
|
|
}
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
|
2022-11-02 13:08:07 -04:00
|
|
|
if (jail_exact != NULL)
|
|
|
|
|
grp = jail_exact;
|
|
|
|
|
else if (jail_wild != NULL)
|
|
|
|
|
grp = jail_wild;
|
|
|
|
|
else if (local_exact != NULL)
|
|
|
|
|
grp = local_exact;
|
|
|
|
|
else
|
|
|
|
|
grp = local_wild;
|
|
|
|
|
if (grp == NULL)
|
|
|
|
|
return (NULL);
|
|
|
|
|
out:
|
|
|
|
|
return (grp->il_inp[INP_PCBLBGROUP_PKTHASH(faddr, lport, fport) %
|
|
|
|
|
grp->il_inpcnt]);
|
2018-06-06 11:45:57 -04:00
|
|
|
}
|
|
|
|
|
|
2023-04-20 11:48:08 -04:00
|
|
|
static bool
|
|
|
|
|
in_pcblookup_exact_match(const struct inpcb *inp, struct in_addr faddr,
|
|
|
|
|
u_short fport, struct in_addr laddr, u_short lport)
|
|
|
|
|
{
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
/* XXX inp locking */
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
|
|
|
|
return (false);
|
|
|
|
|
#endif
|
|
|
|
|
if (inp->inp_faddr.s_addr == faddr.s_addr &&
|
|
|
|
|
inp->inp_laddr.s_addr == laddr.s_addr &&
|
|
|
|
|
inp->inp_fport == fport &&
|
|
|
|
|
inp->inp_lport == lport)
|
|
|
|
|
return (true);
|
|
|
|
|
return (false);
|
|
|
|
|
}
|
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
static struct inpcb *
|
2023-02-09 15:59:27 -05:00
|
|
|
in_pcblookup_hash_exact(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
|
|
|
|
u_short fport, struct in_addr laddr, u_short lport)
|
1995-04-08 21:29:31 -04:00
|
|
|
{
|
|
|
|
|
struct inpcbhead *head;
|
2023-04-20 11:48:08 -04:00
|
|
|
struct inpcb *inp;
|
1995-04-08 21:29:31 -04:00
|
|
|
|
2019-11-07 15:49:56 -05:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
|
|
|
|
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
head = &pcbinfo->ipi_hash_exact[INP_PCBHASH(&faddr, lport, fport,
|
2007-04-30 19:12:05 -04:00
|
|
|
pcbinfo->ipi_hashmask)];
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash_exact) {
|
2023-04-20 11:48:08 -04:00
|
|
|
if (in_pcblookup_exact_match(inp, faddr, fport, laddr, lport))
|
2023-02-07 12:21:52 -05:00
|
|
|
return (inp);
|
1996-10-07 15:06:12 -04:00
|
|
|
}
|
2023-04-20 11:48:08 -04:00
|
|
|
return (NULL);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
typedef enum {
|
|
|
|
|
INPLOOKUP_MATCH_NONE = 0,
|
|
|
|
|
INPLOOKUP_MATCH_WILD = 1,
|
|
|
|
|
INPLOOKUP_MATCH_LADDR = 2,
|
|
|
|
|
} inp_lookup_match_t;
|
|
|
|
|
|
|
|
|
|
static inp_lookup_match_t
|
|
|
|
|
in_pcblookup_wild_match(const struct inpcb *inp, struct in_addr laddr,
|
|
|
|
|
u_short lport)
|
|
|
|
|
{
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
/* XXX inp locking */
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
|
|
|
|
return (INPLOOKUP_MATCH_NONE);
|
|
|
|
|
#endif
|
|
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY || inp->inp_lport != lport)
|
|
|
|
|
return (INPLOOKUP_MATCH_NONE);
|
|
|
|
|
if (inp->inp_laddr.s_addr == INADDR_ANY)
|
|
|
|
|
return (INPLOOKUP_MATCH_WILD);
|
|
|
|
|
if (inp->inp_laddr.s_addr == laddr.s_addr)
|
|
|
|
|
return (INPLOOKUP_MATCH_LADDR);
|
|
|
|
|
return (INPLOOKUP_MATCH_NONE);
|
2023-02-09 15:59:27 -05:00
|
|
|
}
|
2006-11-30 05:54:54 -05:00
|
|
|
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
#define INP_LOOKUP_AGAIN ((struct inpcb *)(uintptr_t)-1)
|
|
|
|
|
|
|
|
|
|
static struct inpcb *
|
|
|
|
|
in_pcblookup_hash_wild_smr(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
|
|
|
|
u_short fport, struct in_addr laddr, u_short lport,
|
|
|
|
|
const inp_lookup_t lockflags)
|
|
|
|
|
{
|
|
|
|
|
struct inpcbhead *head;
|
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
|
|
KASSERT(SMR_ENTERED(pcbinfo->ipi_smr),
|
|
|
|
|
("%s: not in SMR read section", __func__));
|
|
|
|
|
|
|
|
|
|
head = &pcbinfo->ipi_hash_wild[INP_PCBHASH_WILD(lport,
|
|
|
|
|
pcbinfo->ipi_hashmask)];
|
|
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash_wild) {
|
|
|
|
|
inp_lookup_match_t match;
|
|
|
|
|
|
|
|
|
|
match = in_pcblookup_wild_match(inp, laddr, lport);
|
|
|
|
|
if (match == INPLOOKUP_MATCH_NONE)
|
|
|
|
|
continue;
|
|
|
|
|
|
|
|
|
|
if (__predict_true(inp_smr_lock(inp, lockflags))) {
|
2023-05-30 15:15:48 -04:00
|
|
|
match = in_pcblookup_wild_match(inp, laddr, lport);
|
|
|
|
|
if (match != INPLOOKUP_MATCH_NONE &&
|
|
|
|
|
prison_check_ip4_locked(inp->inp_cred->cr_prison,
|
|
|
|
|
&laddr) == 0)
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
return (inp);
|
|
|
|
|
inp_unlock(inp, lockflags);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* The matching socket disappeared out from under us. Fall back
|
|
|
|
|
* to a serialized lookup.
|
|
|
|
|
*/
|
|
|
|
|
return (INP_LOOKUP_AGAIN);
|
|
|
|
|
}
|
|
|
|
|
return (NULL);
|
|
|
|
|
}
|
|
|
|
|
|
2023-02-09 15:59:27 -05:00
|
|
|
static struct inpcb *
|
|
|
|
|
in_pcblookup_hash_wild_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
|
|
|
|
u_short fport, struct in_addr laddr, u_short lport)
|
|
|
|
|
{
|
|
|
|
|
struct inpcbhead *head;
|
|
|
|
|
struct inpcb *inp, *local_wild, *local_exact, *jail_wild;
|
2006-11-30 05:54:54 -05:00
|
|
|
#ifdef INET6
|
2023-02-09 15:59:27 -05:00
|
|
|
struct inpcb *local_wild_mapped;
|
2006-11-30 05:54:54 -05:00
|
|
|
#endif
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
|
2023-02-09 15:59:27 -05:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2022-11-02 13:08:07 -04:00
|
|
|
|
2023-02-09 15:59:27 -05:00
|
|
|
/*
|
|
|
|
|
* Order of socket selection - we always prefer jails.
|
|
|
|
|
* 1. jailed, non-wild.
|
|
|
|
|
* 2. jailed, wild.
|
|
|
|
|
* 3. non-jailed, non-wild.
|
|
|
|
|
* 4. non-jailed, wild.
|
|
|
|
|
*/
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
head = &pcbinfo->ipi_hash_wild[INP_PCBHASH_WILD(lport,
|
2023-02-09 15:59:27 -05:00
|
|
|
pcbinfo->ipi_hashmask)];
|
|
|
|
|
local_wild = local_exact = jail_wild = NULL;
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
local_wild_mapped = NULL;
|
|
|
|
|
#endif
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash_wild) {
|
2023-04-20 11:48:08 -04:00
|
|
|
inp_lookup_match_t match;
|
2023-02-09 15:59:27 -05:00
|
|
|
bool injail;
|
1996-10-07 15:06:12 -04:00
|
|
|
|
2023-04-20 11:48:08 -04:00
|
|
|
match = in_pcblookup_wild_match(inp, laddr, lport);
|
|
|
|
|
if (match == INPLOOKUP_MATCH_NONE)
|
2023-02-09 15:59:27 -05:00
|
|
|
continue;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
|
2023-02-09 15:59:27 -05:00
|
|
|
injail = prison_flag(inp->inp_cred, PR_IP4) != 0;
|
|
|
|
|
if (injail) {
|
|
|
|
|
if (prison_check_ip4_locked(inp->inp_cred->cr_prison,
|
|
|
|
|
&laddr) != 0)
|
|
|
|
|
continue;
|
|
|
|
|
} else {
|
|
|
|
|
if (local_exact != NULL)
|
|
|
|
|
continue;
|
|
|
|
|
}
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
|
2023-04-20 11:48:08 -04:00
|
|
|
if (match == INPLOOKUP_MATCH_LADDR) {
|
2023-02-09 15:59:27 -05:00
|
|
|
if (injail)
|
|
|
|
|
return (inp);
|
|
|
|
|
local_exact = inp;
|
2023-04-20 11:48:08 -04:00
|
|
|
} else {
|
2006-11-30 05:54:54 -05:00
|
|
|
#ifdef INET6
|
2023-02-09 15:59:27 -05:00
|
|
|
/* XXX inp locking, NULL check */
|
|
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO)
|
|
|
|
|
local_wild_mapped = inp;
|
|
|
|
|
else
|
2012-01-21 21:13:19 -05:00
|
|
|
#endif
|
2023-02-09 15:59:27 -05:00
|
|
|
if (injail)
|
|
|
|
|
jail_wild = inp;
|
|
|
|
|
else
|
|
|
|
|
local_wild = inp;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (jail_wild != NULL)
|
|
|
|
|
return (jail_wild);
|
|
|
|
|
if (local_exact != NULL)
|
|
|
|
|
return (local_exact);
|
|
|
|
|
if (local_wild != NULL)
|
|
|
|
|
return (local_wild);
|
2006-11-30 05:54:54 -05:00
|
|
|
#ifdef INET6
|
2023-02-09 15:59:27 -05:00
|
|
|
if (local_wild_mapped != NULL)
|
|
|
|
|
return (local_wild_mapped);
|
2012-01-21 21:13:19 -05:00
|
|
|
#endif
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
return (NULL);
|
1995-04-08 21:29:31 -04:00
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
|
|
|
|
|
/*
|
2023-02-09 15:59:27 -05:00
|
|
|
* Lookup PCB in hash list, using pcbinfo tables. This variation assumes
|
|
|
|
|
* that the caller has either locked the hash list, which usually happens
|
|
|
|
|
* for bind(2) operations, or is in SMR section, which happens when sorting
|
|
|
|
|
* out incoming packets.
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
*/
|
|
|
|
|
static struct inpcb *
|
2023-02-09 15:59:27 -05:00
|
|
|
in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
|
|
|
|
u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags,
|
|
|
|
|
uint8_t numa_domain)
|
|
|
|
|
{
|
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
const u_short fport = fport_arg, lport = lport_arg;
|
|
|
|
|
|
|
|
|
|
KASSERT((lookupflags & ~INPLOOKUP_WILDCARD) == 0,
|
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
|
|
|
|
KASSERT(faddr.s_addr != INADDR_ANY,
|
|
|
|
|
("%s: invalid foreign address", __func__));
|
|
|
|
|
KASSERT(laddr.s_addr != INADDR_ANY,
|
|
|
|
|
("%s: invalid local address", __func__));
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
2023-02-09 15:59:27 -05:00
|
|
|
|
|
|
|
|
inp = in_pcblookup_hash_exact(pcbinfo, faddr, fport, laddr, lport);
|
|
|
|
|
if (inp != NULL)
|
|
|
|
|
return (inp);
|
|
|
|
|
|
|
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
inp = in_pcblookup_lbgroup(pcbinfo, &faddr, fport,
|
|
|
|
|
&laddr, lport, numa_domain);
|
2023-02-09 15:59:27 -05:00
|
|
|
if (inp == NULL) {
|
|
|
|
|
inp = in_pcblookup_hash_wild_locked(pcbinfo, faddr,
|
|
|
|
|
fport, laddr, lport);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return (inp);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static struct inpcb *
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
u_int fport, struct in_addr laddr, u_int lport, int lookupflags,
|
2023-02-09 15:59:27 -05:00
|
|
|
uint8_t numa_domain)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
{
|
|
|
|
|
struct inpcb *inp;
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
const inp_lookup_t lockflags = lookupflags & INPLOOKUP_LOCKMASK;
|
|
|
|
|
|
|
|
|
|
KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
|
|
|
|
|
("%s: LOCKPCB not set", __func__));
|
|
|
|
|
|
|
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
|
|
|
|
inp = in_pcblookup_hash_locked(pcbinfo, faddr, fport, laddr, lport,
|
|
|
|
|
lookupflags & ~INPLOOKUP_LOCKMASK, numa_domain);
|
|
|
|
|
if (inp != NULL && !inp_trylock(inp, lockflags)) {
|
|
|
|
|
in_pcbref(inp);
|
|
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
|
|
|
|
inp_lock(inp, lockflags);
|
|
|
|
|
if (in_pcbrele(inp, lockflags))
|
|
|
|
|
/* XXX-MJ or retry until we get a negative match? */
|
|
|
|
|
inp = NULL;
|
|
|
|
|
} else {
|
|
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
|
|
|
|
}
|
|
|
|
|
return (inp);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static struct inpcb *
|
|
|
|
|
in_pcblookup_hash_smr(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
|
|
|
|
u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags,
|
|
|
|
|
uint8_t numa_domain)
|
|
|
|
|
{
|
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
const inp_lookup_t lockflags = lookupflags & INPLOOKUP_LOCKMASK;
|
|
|
|
|
const u_short fport = fport_arg, lport = lport_arg;
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
|
2023-02-03 10:56:26 -05:00
|
|
|
KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
|
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
|
|
|
|
KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
|
|
|
|
|
("%s: LOCKPCB not set", __func__));
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
smr_enter(pcbinfo->ipi_smr);
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
inp = in_pcblookup_hash_exact(pcbinfo, faddr, fport, laddr, lport);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
if (inp != NULL) {
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
if (__predict_true(inp_smr_lock(inp, lockflags))) {
|
|
|
|
|
/*
|
|
|
|
|
* Revalidate the 4-tuple, the socket could have been
|
|
|
|
|
* disconnected.
|
|
|
|
|
*/
|
|
|
|
|
if (__predict_true(in_pcblookup_exact_match(inp,
|
|
|
|
|
faddr, fport, laddr, lport)))
|
|
|
|
|
return (inp);
|
|
|
|
|
inp_unlock(inp, lockflags);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* We failed to lock the inpcb, or its connection state changed
|
|
|
|
|
* out from under us. Fall back to a precise search.
|
|
|
|
|
*/
|
|
|
|
|
return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
|
|
|
|
|
lookupflags, numa_domain));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
|
|
|
|
inp = in_pcblookup_lbgroup(pcbinfo, &faddr, fport,
|
|
|
|
|
&laddr, lport, numa_domain);
|
|
|
|
|
if (inp != NULL) {
|
|
|
|
|
if (__predict_true(inp_smr_lock(inp, lockflags))) {
|
|
|
|
|
if (__predict_true(in_pcblookup_wild_match(inp,
|
|
|
|
|
laddr, lport) != INPLOOKUP_MATCH_NONE))
|
|
|
|
|
return (inp);
|
|
|
|
|
inp_unlock(inp, lockflags);
|
|
|
|
|
}
|
|
|
|
|
inp = INP_LOOKUP_AGAIN;
|
|
|
|
|
} else {
|
|
|
|
|
inp = in_pcblookup_hash_wild_smr(pcbinfo, faddr, fport,
|
|
|
|
|
laddr, lport, lockflags);
|
|
|
|
|
}
|
|
|
|
|
if (inp == INP_LOOKUP_AGAIN) {
|
|
|
|
|
return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr,
|
|
|
|
|
lport, lookupflags, numa_domain));
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (inp == NULL)
|
2021-12-02 17:45:04 -05:00
|
|
|
smr_exit(pcbinfo->ipi_smr);
|
2019-11-07 15:49:56 -05:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
return (inp);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 12:33:06 -04:00
|
|
|
* Public inpcb lookup routines, accepting a 4-tuple, and optionally, an mbuf
|
|
|
|
|
* from which a pre-calculated hash value may be extracted.
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
*/
|
|
|
|
|
struct inpcb *
|
|
|
|
|
in_pcblookup(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport,
|
2023-02-09 15:59:27 -05:00
|
|
|
struct in_addr laddr, u_int lport, int lookupflags,
|
|
|
|
|
struct ifnet *ifp __unused)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
{
|
2023-02-09 15:59:27 -05:00
|
|
|
return (in_pcblookup_hash_smr(pcbinfo, faddr, fport, laddr, lport,
|
|
|
|
|
lookupflags, M_NODOM));
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
}
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 12:33:06 -04:00
|
|
|
|
|
|
|
|
struct inpcb *
|
|
|
|
|
in_pcblookup_mbuf(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
|
|
|
|
u_int fport, struct in_addr laddr, u_int lport, int lookupflags,
|
2023-02-09 15:59:27 -05:00
|
|
|
struct ifnet *ifp __unused, struct mbuf *m)
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 12:33:06 -04:00
|
|
|
{
|
2023-02-09 15:59:27 -05:00
|
|
|
return (in_pcblookup_hash_smr(pcbinfo, faddr, fport, laddr, lport,
|
|
|
|
|
lookupflags, m->m_pkthdr.numa_domain));
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 12:33:06 -04:00
|
|
|
}
|
2011-04-30 07:04:34 -04:00
|
|
|
#endif /* INET */
|
1995-04-08 21:29:31 -04:00
|
|
|
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
static bool
|
|
|
|
|
in_pcbjailed(const struct inpcb *inp, unsigned int flag)
|
|
|
|
|
{
|
|
|
|
|
return (prison_flag(inp->inp_cred, flag) != 0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Insert the PCB into a hash chain using ordering rules which ensure that
|
|
|
|
|
* in_pcblookup_hash_wild_*() always encounter the highest-ranking PCB first.
|
|
|
|
|
*
|
|
|
|
|
* Specifically, keep jailed PCBs in front of non-jailed PCBs, and keep PCBs
|
2023-04-23 10:36:24 -04:00
|
|
|
* with exact local addresses ahead of wildcard PCBs. Unbound v4-mapped v6 PCBs
|
|
|
|
|
* always appear last no matter whether they are jailed.
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
_in_pcbinshash_wild(struct inpcbhead *pcbhash, struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
struct inpcb *last;
|
|
|
|
|
bool bound, injail;
|
|
|
|
|
|
2023-04-23 10:36:24 -04:00
|
|
|
INP_LOCK_ASSERT(inp);
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
|
|
|
|
|
|
|
|
|
last = NULL;
|
|
|
|
|
bound = inp->inp_laddr.s_addr != INADDR_ANY;
|
2023-04-23 10:36:24 -04:00
|
|
|
if (!bound && (inp->inp_vflag & INP_IPV6PROTO) != 0) {
|
|
|
|
|
CK_LIST_FOREACH(last, pcbhash, inp_hash_wild) {
|
|
|
|
|
if (CK_LIST_NEXT(last, inp_hash_wild) == NULL) {
|
|
|
|
|
CK_LIST_INSERT_AFTER(last, inp, inp_hash_wild);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
CK_LIST_INSERT_HEAD(pcbhash, inp, inp_hash_wild);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
injail = in_pcbjailed(inp, PR_IP4);
|
|
|
|
|
if (!injail) {
|
|
|
|
|
CK_LIST_FOREACH(last, pcbhash, inp_hash_wild) {
|
2023-04-23 10:36:24 -04:00
|
|
|
if (!in_pcbjailed(last, PR_IP4))
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
break;
|
|
|
|
|
if (CK_LIST_NEXT(last, inp_hash_wild) == NULL) {
|
|
|
|
|
CK_LIST_INSERT_AFTER(last, inp, inp_hash_wild);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
} else if (!CK_LIST_EMPTY(pcbhash) &&
|
|
|
|
|
!in_pcbjailed(CK_LIST_FIRST(pcbhash), PR_IP4)) {
|
|
|
|
|
CK_LIST_INSERT_HEAD(pcbhash, inp, inp_hash_wild);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
if (!bound) {
|
|
|
|
|
CK_LIST_FOREACH_FROM(last, pcbhash, inp_hash_wild) {
|
|
|
|
|
if (last->inp_laddr.s_addr == INADDR_ANY)
|
|
|
|
|
break;
|
|
|
|
|
if (CK_LIST_NEXT(last, inp_hash_wild) == NULL) {
|
|
|
|
|
CK_LIST_INSERT_AFTER(last, inp, inp_hash_wild);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (last == NULL)
|
|
|
|
|
CK_LIST_INSERT_HEAD(pcbhash, inp, inp_hash_wild);
|
|
|
|
|
else
|
|
|
|
|
CK_LIST_INSERT_BEFORE(last, inp, inp_hash_wild);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
/*
|
|
|
|
|
* See the comment above _in_pcbinshash_wild().
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
_in6_pcbinshash_wild(struct inpcbhead *pcbhash, struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
struct inpcb *last;
|
|
|
|
|
bool bound, injail;
|
|
|
|
|
|
2023-04-23 10:36:24 -04:00
|
|
|
INP_LOCK_ASSERT(inp);
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
|
|
|
|
|
|
|
|
|
last = NULL;
|
|
|
|
|
bound = !IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr);
|
|
|
|
|
injail = in_pcbjailed(inp, PR_IP6);
|
|
|
|
|
if (!injail) {
|
|
|
|
|
CK_LIST_FOREACH(last, pcbhash, inp_hash_wild) {
|
2023-04-23 10:36:24 -04:00
|
|
|
if (!in_pcbjailed(last, PR_IP6))
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
break;
|
|
|
|
|
if (CK_LIST_NEXT(last, inp_hash_wild) == NULL) {
|
|
|
|
|
CK_LIST_INSERT_AFTER(last, inp, inp_hash_wild);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
} else if (!CK_LIST_EMPTY(pcbhash) &&
|
|
|
|
|
!in_pcbjailed(CK_LIST_FIRST(pcbhash), PR_IP6)) {
|
|
|
|
|
CK_LIST_INSERT_HEAD(pcbhash, inp, inp_hash_wild);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
if (!bound) {
|
|
|
|
|
CK_LIST_FOREACH_FROM(last, pcbhash, inp_hash_wild) {
|
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&last->in6p_laddr))
|
|
|
|
|
break;
|
|
|
|
|
if (CK_LIST_NEXT(last, inp_hash_wild) == NULL) {
|
|
|
|
|
CK_LIST_INSERT_AFTER(last, inp, inp_hash_wild);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (last == NULL)
|
|
|
|
|
CK_LIST_INSERT_HEAD(pcbhash, inp, inp_hash_wild);
|
|
|
|
|
else
|
|
|
|
|
CK_LIST_INSERT_BEFORE(last, inp, inp_hash_wild);
|
|
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
|
1995-04-10 04:52:45 -04:00
|
|
|
/*
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
* Insert PCB onto various hash lists.
|
1995-04-10 04:52:45 -04:00
|
|
|
*/
|
2021-12-02 17:45:04 -05:00
|
|
|
int
|
|
|
|
|
in_pcbinshash(struct inpcb *inp)
|
1995-04-08 21:29:31 -04:00
|
|
|
{
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
struct inpcbhead *pcbhash;
|
|
|
|
|
struct inpcbporthead *pcbporthash;
|
|
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
struct inpcbport *phd;
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
uint32_t hash;
|
|
|
|
|
bool connected;
|
1995-04-08 21:29:31 -04:00
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
2009-03-10 20:29:22 -04:00
|
|
|
KASSERT((inp->inp_flags & INP_INHASHLIST) == 0,
|
|
|
|
|
("in_pcbinshash: INP_INHASHLIST"));
|
2006-04-22 15:15:20 -04:00
|
|
|
|
1999-12-07 12:39:16 -05:00
|
|
|
#ifdef INET6
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
if (inp->inp_vflag & INP_IPV6) {
|
|
|
|
|
hash = INP6_PCBHASH(&inp->in6p_faddr, inp->inp_lport,
|
|
|
|
|
inp->inp_fport, pcbinfo->ipi_hashmask);
|
|
|
|
|
connected = !IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr);
|
|
|
|
|
} else
|
2012-01-21 21:13:19 -05:00
|
|
|
#endif
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
{
|
|
|
|
|
hash = INP_PCBHASH(&inp->inp_faddr, inp->inp_lport,
|
|
|
|
|
inp->inp_fport, pcbinfo->ipi_hashmask);
|
|
|
|
|
connected = !in_nullhost(inp->inp_faddr);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (connected)
|
|
|
|
|
pcbhash = &pcbinfo->ipi_hash_exact[hash];
|
|
|
|
|
else
|
|
|
|
|
pcbhash = &pcbinfo->ipi_hash_wild[hash];
|
1995-04-08 21:29:31 -04:00
|
|
|
|
2007-04-30 19:12:05 -04:00
|
|
|
pcbporthash = &pcbinfo->ipi_porthashbase[
|
|
|
|
|
INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_porthashmask)];
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
|
2018-06-06 11:45:57 -04:00
|
|
|
/*
|
|
|
|
|
* Add entry to load balance group.
|
|
|
|
|
* Only do this if SO_REUSEPORT_LB is set.
|
|
|
|
|
*/
|
2022-11-02 13:05:14 -04:00
|
|
|
if ((inp->inp_flags2 & INP_REUSEPORT_LB) != 0) {
|
|
|
|
|
int error = in_pcbinslbgrouphash(inp, M_NODOM);
|
|
|
|
|
if (error != 0)
|
|
|
|
|
return (error);
|
2018-06-06 11:45:57 -04:00
|
|
|
}
|
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
/*
|
|
|
|
|
* Go through port list and look for a head for this lport.
|
|
|
|
|
*/
|
2018-06-12 18:18:20 -04:00
|
|
|
CK_LIST_FOREACH(phd, pcbporthash, phd_hash) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
if (phd->phd_port == inp->inp_lport)
|
|
|
|
|
break;
|
|
|
|
|
}
|
2022-11-02 13:05:14 -04:00
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
/*
|
|
|
|
|
* If none exists, malloc one and tack it on.
|
|
|
|
|
*/
|
|
|
|
|
if (phd == NULL) {
|
2021-12-02 17:45:04 -05:00
|
|
|
phd = uma_zalloc_smr(pcbinfo->ipi_portzone, M_NOWAIT);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
if (phd == NULL) {
|
2022-11-02 13:05:14 -04:00
|
|
|
if ((inp->inp_flags2 & INP_REUSEPORT_LB) != 0)
|
|
|
|
|
in_pcbremlbgrouphash(inp);
|
|
|
|
|
return (ENOMEM);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
}
|
|
|
|
|
phd->phd_port = inp->inp_lport;
|
2018-06-12 18:18:20 -04:00
|
|
|
CK_LIST_INIT(&phd->phd_pcblist);
|
|
|
|
|
CK_LIST_INSERT_HEAD(pcbporthash, phd, phd_hash);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
}
|
|
|
|
|
inp->inp_phd = phd;
|
2018-06-12 18:18:20 -04:00
|
|
|
CK_LIST_INSERT_HEAD(&phd->phd_pcblist, inp, inp_portlist);
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* The PCB may have been disconnected in the past. Before we can safely
|
|
|
|
|
* make it visible in the hash table, we must wait for all readers which
|
|
|
|
|
* may be traversing this PCB to finish.
|
|
|
|
|
*/
|
|
|
|
|
if (inp->inp_smr != SMR_SEQ_INVALID) {
|
|
|
|
|
smr_wait(pcbinfo->ipi_smr, inp->inp_smr);
|
|
|
|
|
inp->inp_smr = SMR_SEQ_INVALID;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (connected)
|
|
|
|
|
CK_LIST_INSERT_HEAD(pcbhash, inp, inp_hash_exact);
|
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
2023-04-20 11:48:19 -04:00
|
|
|
else {
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0)
|
|
|
|
|
_in6_pcbinshash_wild(pcbhash, inp);
|
|
|
|
|
else
|
|
|
|
|
#endif
|
|
|
|
|
_in_pcbinshash_wild(pcbhash, inp);
|
|
|
|
|
}
|
2009-03-10 20:29:22 -04:00
|
|
|
inp->inp_flags |= INP_INHASHLIST;
|
2021-12-02 16:35:14 -05:00
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
return (0);
|
2021-12-02 16:35:14 -05:00
|
|
|
}
|
|
|
|
|
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
void
|
|
|
|
|
in_pcbremhash_locked(struct inpcb *inp)
|
2022-10-13 12:03:38 -04:00
|
|
|
{
|
|
|
|
|
struct inpcbport *phd = inp->inp_phd;
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
2022-10-13 12:03:38 -04:00
|
|
|
MPASS(inp->inp_flags & INP_INHASHLIST);
|
|
|
|
|
|
2022-11-02 13:08:07 -04:00
|
|
|
if ((inp->inp_flags2 & INP_REUSEPORT_LB) != 0)
|
|
|
|
|
in_pcbremlbgrouphash(inp);
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
#ifdef INET6
|
|
|
|
|
if (inp->inp_vflag & INP_IPV6) {
|
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr))
|
|
|
|
|
CK_LIST_REMOVE(inp, inp_hash_wild);
|
|
|
|
|
else
|
|
|
|
|
CK_LIST_REMOVE(inp, inp_hash_exact);
|
|
|
|
|
} else
|
|
|
|
|
#endif
|
|
|
|
|
{
|
|
|
|
|
if (in_nullhost(inp->inp_faddr))
|
|
|
|
|
CK_LIST_REMOVE(inp, inp_hash_wild);
|
|
|
|
|
else
|
|
|
|
|
CK_LIST_REMOVE(inp, inp_hash_exact);
|
|
|
|
|
}
|
2022-10-13 12:03:38 -04:00
|
|
|
CK_LIST_REMOVE(inp, inp_portlist);
|
|
|
|
|
if (CK_LIST_FIRST(&phd->phd_pcblist) == NULL) {
|
|
|
|
|
CK_LIST_REMOVE(phd, phd_hash);
|
|
|
|
|
uma_zfree_smr(inp->inp_pcbinfo->ipi_portzone, phd);
|
|
|
|
|
}
|
|
|
|
|
inp->inp_flags &= ~INP_INHASHLIST;
|
|
|
|
|
}
|
|
|
|
|
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
static void
|
|
|
|
|
in_pcbremhash(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
INP_HASH_WLOCK(inp->inp_pcbinfo);
|
|
|
|
|
in_pcbremhash_locked(inp);
|
|
|
|
|
INP_HASH_WUNLOCK(inp->inp_pcbinfo);
|
|
|
|
|
}
|
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
/*
|
|
|
|
|
* Move PCB to the proper hash bucket when { faddr, fport } have been
|
|
|
|
|
* changed. NOTE: This does not handle the case of the lport changing (the
|
|
|
|
|
* hashed port list would have to be updated as well), so the lport must
|
|
|
|
|
* not change after in_pcbinshash() has been called.
|
|
|
|
|
*/
|
1995-04-08 21:29:31 -04:00
|
|
|
void
|
2021-12-02 17:45:04 -05:00
|
|
|
in_pcbrehash(struct inpcb *inp)
|
1995-04-08 21:29:31 -04:00
|
|
|
{
|
2003-11-08 18:02:36 -05:00
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
1995-04-08 21:29:31 -04:00
|
|
|
struct inpcbhead *head;
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
uint32_t hash;
|
|
|
|
|
bool connected;
|
1999-12-07 12:39:16 -05:00
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
2009-03-10 20:29:22 -04:00
|
|
|
KASSERT(inp->inp_flags & INP_INHASHLIST,
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
("%s: !INP_INHASHLIST", __func__));
|
|
|
|
|
KASSERT(inp->inp_smr == SMR_SEQ_INVALID,
|
|
|
|
|
("%s: inp was disconnected", __func__));
|
2006-04-22 15:15:20 -04:00
|
|
|
|
1999-12-07 12:39:16 -05:00
|
|
|
#ifdef INET6
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
if (inp->inp_vflag & INP_IPV6) {
|
|
|
|
|
hash = INP6_PCBHASH(&inp->in6p_faddr, inp->inp_lport,
|
|
|
|
|
inp->inp_fport, pcbinfo->ipi_hashmask);
|
|
|
|
|
connected = !IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr);
|
|
|
|
|
} else
|
2012-01-21 21:13:19 -05:00
|
|
|
#endif
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
{
|
|
|
|
|
hash = INP_PCBHASH(&inp->inp_faddr, inp->inp_lport,
|
|
|
|
|
inp->inp_fport, pcbinfo->ipi_hashmask);
|
|
|
|
|
connected = !in_nullhost(inp->inp_faddr);
|
|
|
|
|
}
|
1995-04-08 21:29:31 -04:00
|
|
|
|
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
2023-04-20 11:48:01 -04:00
|
|
|
/*
|
|
|
|
|
* When rehashing, the caller must ensure that either the new or the old
|
|
|
|
|
* foreign address was unspecified.
|
|
|
|
|
*/
|
|
|
|
|
if (connected)
|
|
|
|
|
CK_LIST_REMOVE(inp, inp_hash_wild);
|
|
|
|
|
else
|
|
|
|
|
CK_LIST_REMOVE(inp, inp_hash_exact);
|
|
|
|
|
|
|
|
|
|
if (connected) {
|
|
|
|
|
head = &pcbinfo->ipi_hash_exact[hash];
|
|
|
|
|
CK_LIST_INSERT_HEAD(head, inp, inp_hash_exact);
|
|
|
|
|
} else {
|
|
|
|
|
head = &pcbinfo->ipi_hash_wild[hash];
|
|
|
|
|
CK_LIST_INSERT_HEAD(head, inp, inp_hash_wild);
|
|
|
|
|
}
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
}
|
|
|
|
|
|
2016-03-24 03:54:56 -04:00
|
|
|
/*
|
|
|
|
|
* Check for alternatives when higher level complains
|
|
|
|
|
* about service problems. For now, invalidate cached
|
|
|
|
|
* routing information. If the route was created dynamically
|
|
|
|
|
* (by a redirect), time to try a default gateway again.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_losing(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
2018-01-22 22:15:39 -05:00
|
|
|
RO_INVALIDATE_CACHE(&inp->inp_route);
|
2016-03-24 03:54:56 -04:00
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
/*
|
|
|
|
|
* A set label operation has occurred at the socket layer, propagate the
|
|
|
|
|
* label change into the in_pcb for the socket.
|
|
|
|
|
*/
|
|
|
|
|
void
|
2006-01-21 20:16:25 -05:00
|
|
|
in_pcbsosetlabel(struct socket *so)
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
{
|
|
|
|
|
#ifdef MAC
|
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
2006-04-01 11:04:42 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("in_pcbsosetlabel: so->so_pcb == NULL"));
|
2006-04-22 15:15:20 -04:00
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2004-06-12 22:50:07 -04:00
|
|
|
SOCK_LOCK(so);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
mac_inpcb_sosetlabel(so, inp);
|
2004-06-12 22:50:07 -04:00
|
|
|
SOCK_UNLOCK(so);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-17 19:39:07 -05:00
|
|
|
#endif
|
|
|
|
|
}
|
2005-01-01 20:50:57 -05:00
|
|
|
|
2008-03-23 18:34:16 -04:00
|
|
|
void
|
|
|
|
|
inp_wlock(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2008-03-23 18:34:16 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
|
|
|
|
inp_wunlock(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2008-03-23 18:34:16 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
|
|
|
|
inp_rlock(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
2008-04-19 10:34:38 -04:00
|
|
|
INP_RLOCK(inp);
|
2008-03-23 18:34:16 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
|
|
|
|
inp_runlock(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
2008-04-19 10:34:38 -04:00
|
|
|
INP_RUNLOCK(inp);
|
2008-03-23 18:34:16 -04:00
|
|
|
}
|
|
|
|
|
|
2017-03-08 19:55:19 -05:00
|
|
|
#ifdef INVARIANT_SUPPORT
|
2008-03-23 18:34:16 -04:00
|
|
|
void
|
2008-03-24 16:24:04 -04:00
|
|
|
inp_lock_assert(struct inpcb *inp)
|
2008-03-23 18:34:16 -04:00
|
|
|
{
|
|
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2008-03-23 18:34:16 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
2008-03-24 16:24:04 -04:00
|
|
|
inp_unlock_assert(struct inpcb *inp)
|
2008-03-23 18:34:16 -04:00
|
|
|
{
|
|
|
|
|
|
|
|
|
|
INP_UNLOCK_ASSERT(inp);
|
|
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
|
2008-07-20 20:08:34 -04:00
|
|
|
void
|
2022-10-19 18:15:53 -04:00
|
|
|
inp_apply_all(struct inpcbinfo *pcbinfo,
|
|
|
|
|
void (*func)(struct inpcb *, void *), void *arg)
|
2008-07-20 20:08:34 -04:00
|
|
|
{
|
2022-10-19 18:15:53 -04:00
|
|
|
struct inpcb_iterator inpi = INP_ALL_ITERATOR(pcbinfo,
|
2021-12-02 17:45:04 -05:00
|
|
|
INPLOOKUP_WLOCKPCB);
|
2008-07-20 20:08:34 -04:00
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
2021-12-02 17:45:04 -05:00
|
|
|
while ((inp = inp_next(&inpi)) != NULL)
|
2008-07-20 20:08:34 -04:00
|
|
|
func(inp, arg);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
struct socket *
|
|
|
|
|
inp_inpcbtosocket(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
return (inp->inp_socket);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
struct tcpcb *
|
|
|
|
|
inp_inpcbtotcpcb(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
return ((struct tcpcb *)inp->inp_ppcb);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
int
|
|
|
|
|
inp_ip_tos_get(const struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
return (inp->inp_ip_tos);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
|
|
|
|
inp_ip_tos_set(struct inpcb *inp, int val)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
inp->inp_ip_tos = val;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void
|
2008-07-22 00:23:57 -04:00
|
|
|
inp_4tuple_get(struct inpcb *inp, uint32_t *laddr, uint16_t *lp,
|
2008-07-21 18:11:39 -04:00
|
|
|
uint32_t *faddr, uint16_t *fp)
|
2008-07-20 20:08:34 -04:00
|
|
|
{
|
|
|
|
|
|
2008-07-21 18:11:39 -04:00
|
|
|
INP_LOCK_ASSERT(inp);
|
2008-07-22 00:23:57 -04:00
|
|
|
*laddr = inp->inp_laddr.s_addr;
|
|
|
|
|
*faddr = inp->inp_faddr.s_addr;
|
2008-07-20 20:08:34 -04:00
|
|
|
*lp = inp->inp_lport;
|
|
|
|
|
*fp = inp->inp_fport;
|
|
|
|
|
}
|
|
|
|
|
|
2008-07-20 20:49:34 -04:00
|
|
|
struct inpcb *
|
|
|
|
|
so_sotoinpcb(struct socket *so)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
return (sotoinpcb(so));
|
|
|
|
|
}
|
|
|
|
|
|
Hide struct inpcb, struct tcpcb from the userland.
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
2017-03-21 02:39:49 -04:00
|
|
|
/*
|
|
|
|
|
* Create an external-format (``xinpcb'') structure using the information in
|
|
|
|
|
* the kernel-format in_pcb structure pointed to by inp. This is done to
|
|
|
|
|
* reduce the spew of irrelevant information over this interface, to isolate
|
|
|
|
|
* user code from changes in the kernel structure, and potentially to provide
|
|
|
|
|
* information-hiding if we decide that some of this information should be
|
|
|
|
|
* hidden from users.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcbtoxinpcb(const struct inpcb *inp, struct xinpcb *xi)
|
|
|
|
|
{
|
|
|
|
|
|
2018-11-22 15:49:41 -05:00
|
|
|
bzero(xi, sizeof(*xi));
|
Hide struct inpcb, struct tcpcb from the userland.
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
2017-03-21 02:39:49 -04:00
|
|
|
xi->xi_len = sizeof(struct xinpcb);
|
|
|
|
|
if (inp->inp_socket)
|
|
|
|
|
sotoxsocket(inp->inp_socket, &xi->xi_socket);
|
|
|
|
|
bcopy(&inp->inp_inc, &xi->inp_inc, sizeof(struct in_conninfo));
|
|
|
|
|
xi->inp_gencnt = inp->inp_gencnt;
|
2018-07-10 09:03:06 -04:00
|
|
|
xi->inp_ppcb = (uintptr_t)inp->inp_ppcb;
|
Hide struct inpcb, struct tcpcb from the userland.
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
2017-03-21 02:39:49 -04:00
|
|
|
xi->inp_flow = inp->inp_flow;
|
|
|
|
|
xi->inp_flowid = inp->inp_flowid;
|
|
|
|
|
xi->inp_flowtype = inp->inp_flowtype;
|
|
|
|
|
xi->inp_flags = inp->inp_flags;
|
|
|
|
|
xi->inp_flags2 = inp->inp_flags2;
|
|
|
|
|
xi->in6p_cksum = inp->in6p_cksum;
|
|
|
|
|
xi->in6p_hops = inp->in6p_hops;
|
|
|
|
|
xi->inp_ip_tos = inp->inp_ip_tos;
|
|
|
|
|
xi->inp_vflag = inp->inp_vflag;
|
|
|
|
|
xi->inp_ip_ttl = inp->inp_ip_ttl;
|
|
|
|
|
xi->inp_ip_p = inp->inp_ip_p;
|
|
|
|
|
xi->inp_ip_minttl = inp->inp_ip_minttl;
|
|
|
|
|
}
|
|
|
|
|
|
2022-02-09 06:24:41 -05:00
|
|
|
int
|
|
|
|
|
sysctl_setsockopt(SYSCTL_HANDLER_ARGS, struct inpcbinfo *pcbinfo,
|
|
|
|
|
int (*ctloutput_set)(struct inpcb *, struct sockopt *))
|
|
|
|
|
{
|
|
|
|
|
struct sockopt sopt;
|
|
|
|
|
struct inpcb_iterator inpi = INP_ALL_ITERATOR(pcbinfo,
|
|
|
|
|
INPLOOKUP_WLOCKPCB);
|
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
struct sockopt_parameters *params;
|
|
|
|
|
struct socket *so;
|
|
|
|
|
int error;
|
|
|
|
|
char buf[1024];
|
|
|
|
|
|
|
|
|
|
if (req->oldptr != NULL || req->oldlen != 0)
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
if (req->newptr == NULL)
|
|
|
|
|
return (EPERM);
|
|
|
|
|
if (req->newlen > sizeof(buf))
|
|
|
|
|
return (ENOMEM);
|
|
|
|
|
error = SYSCTL_IN(req, buf, req->newlen);
|
|
|
|
|
if (error != 0)
|
|
|
|
|
return (error);
|
|
|
|
|
if (req->newlen < sizeof(struct sockopt_parameters))
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
params = (struct sockopt_parameters *)buf;
|
|
|
|
|
sopt.sopt_level = params->sop_level;
|
|
|
|
|
sopt.sopt_name = params->sop_optname;
|
|
|
|
|
sopt.sopt_dir = SOPT_SET;
|
|
|
|
|
sopt.sopt_val = params->sop_optval;
|
|
|
|
|
sopt.sopt_valsize = req->newlen - sizeof(struct sockopt_parameters);
|
|
|
|
|
sopt.sopt_td = NULL;
|
2022-02-09 13:53:39 -05:00
|
|
|
#ifdef INET6
|
2022-02-09 06:24:41 -05:00
|
|
|
if (params->sop_inc.inc_flags & INC_ISIPV6) {
|
|
|
|
|
if (IN6_IS_SCOPE_LINKLOCAL(¶ms->sop_inc.inc6_laddr))
|
|
|
|
|
params->sop_inc.inc6_laddr.s6_addr16[1] =
|
|
|
|
|
htons(params->sop_inc.inc6_zoneid & 0xffff);
|
|
|
|
|
if (IN6_IS_SCOPE_LINKLOCAL(¶ms->sop_inc.inc6_faddr))
|
|
|
|
|
params->sop_inc.inc6_faddr.s6_addr16[1] =
|
|
|
|
|
htons(params->sop_inc.inc6_zoneid & 0xffff);
|
|
|
|
|
}
|
2022-02-09 13:53:39 -05:00
|
|
|
#endif
|
2022-02-09 06:24:41 -05:00
|
|
|
if (params->sop_inc.inc_lport != htons(0)) {
|
|
|
|
|
if (params->sop_inc.inc_fport == htons(0))
|
|
|
|
|
inpi.hash = INP_PCBHASH_WILD(params->sop_inc.inc_lport,
|
|
|
|
|
pcbinfo->ipi_hashmask);
|
|
|
|
|
else
|
2022-02-09 13:53:39 -05:00
|
|
|
#ifdef INET6
|
2022-02-09 06:24:41 -05:00
|
|
|
if (params->sop_inc.inc_flags & INC_ISIPV6)
|
|
|
|
|
inpi.hash = INP6_PCBHASH(
|
|
|
|
|
¶ms->sop_inc.inc6_faddr,
|
|
|
|
|
params->sop_inc.inc_lport,
|
|
|
|
|
params->sop_inc.inc_fport,
|
|
|
|
|
pcbinfo->ipi_hashmask);
|
|
|
|
|
else
|
2022-02-09 13:53:39 -05:00
|
|
|
#endif
|
2022-02-09 06:24:41 -05:00
|
|
|
inpi.hash = INP_PCBHASH(
|
|
|
|
|
¶ms->sop_inc.inc_faddr,
|
|
|
|
|
params->sop_inc.inc_lport,
|
|
|
|
|
params->sop_inc.inc_fport,
|
|
|
|
|
pcbinfo->ipi_hashmask);
|
|
|
|
|
}
|
|
|
|
|
while ((inp = inp_next(&inpi)) != NULL)
|
|
|
|
|
if (inp->inp_gencnt == params->sop_id) {
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2022-02-09 06:24:41 -05:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNRESET);
|
|
|
|
|
}
|
|
|
|
|
so = inp->inp_socket;
|
|
|
|
|
KASSERT(so != NULL, ("inp_socket == NULL"));
|
|
|
|
|
soref(so);
|
|
|
|
|
error = (*ctloutput_set)(inp, &sopt);
|
|
|
|
|
sorele(so);
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
if (inp == NULL)
|
|
|
|
|
error = ESRCH;
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
2007-02-17 16:02:38 -05:00
|
|
|
#ifdef DDB
|
|
|
|
|
static void
|
|
|
|
|
db_print_indent(int indent)
|
|
|
|
|
{
|
|
|
|
|
int i;
|
|
|
|
|
|
|
|
|
|
for (i = 0; i < indent; i++)
|
|
|
|
|
db_printf(" ");
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
|
|
|
|
db_print_inconninfo(struct in_conninfo *inc, const char *name, int indent)
|
|
|
|
|
{
|
|
|
|
|
char faddr_str[48], laddr_str[48];
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("%s at %p\n", name, inc);
|
|
|
|
|
|
|
|
|
|
indent += 2;
|
|
|
|
|
|
2007-02-18 03:57:23 -05:00
|
|
|
#ifdef INET6
|
2008-12-17 07:52:34 -05:00
|
|
|
if (inc->inc_flags & INC_ISIPV6) {
|
2007-02-17 16:02:38 -05:00
|
|
|
/* IPv6. */
|
|
|
|
|
ip6_sprintf(laddr_str, &inc->inc6_laddr);
|
|
|
|
|
ip6_sprintf(faddr_str, &inc->inc6_faddr);
|
2012-01-21 21:13:19 -05:00
|
|
|
} else
|
2007-02-18 03:57:23 -05:00
|
|
|
#endif
|
2012-01-21 21:13:19 -05:00
|
|
|
{
|
2007-02-17 16:02:38 -05:00
|
|
|
/* IPv4. */
|
|
|
|
|
inet_ntoa_r(inc->inc_laddr, laddr_str);
|
|
|
|
|
inet_ntoa_r(inc->inc_faddr, faddr_str);
|
|
|
|
|
}
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("inc_laddr %s inc_lport %u\n", laddr_str,
|
|
|
|
|
ntohs(inc->inc_lport));
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("inc_faddr %s inc_fport %u\n", faddr_str,
|
|
|
|
|
ntohs(inc->inc_fport));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
|
|
|
|
db_print_inpflags(int inp_flags)
|
|
|
|
|
{
|
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
|
if (inp_flags & INP_RECVOPTS) {
|
|
|
|
|
db_printf("%sINP_RECVOPTS", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_RECVRETOPTS) {
|
|
|
|
|
db_printf("%sINP_RECVRETOPTS", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_RECVDSTADDR) {
|
|
|
|
|
db_printf("%sINP_RECVDSTADDR", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2017-03-05 23:01:58 -05:00
|
|
|
if (inp_flags & INP_ORIGDSTADDR) {
|
|
|
|
|
db_printf("%sINP_ORIGDSTADDR", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2007-02-17 16:02:38 -05:00
|
|
|
if (inp_flags & INP_HDRINCL) {
|
|
|
|
|
db_printf("%sINP_HDRINCL", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_HIGHPORT) {
|
|
|
|
|
db_printf("%sINP_HIGHPORT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_LOWPORT) {
|
|
|
|
|
db_printf("%sINP_LOWPORT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_ANONPORT) {
|
|
|
|
|
db_printf("%sINP_ANONPORT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_RECVIF) {
|
|
|
|
|
db_printf("%sINP_RECVIF", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_MTUDISC) {
|
|
|
|
|
db_printf("%sINP_MTUDISC", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_RECVTTL) {
|
|
|
|
|
db_printf("%sINP_RECVTTL", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_DONTFRAG) {
|
|
|
|
|
db_printf("%sINP_DONTFRAG", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2012-06-12 10:02:38 -04:00
|
|
|
if (inp_flags & INP_RECVTOS) {
|
|
|
|
|
db_printf("%sINP_RECVTOS", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2007-02-17 16:02:38 -05:00
|
|
|
if (inp_flags & IN6P_IPV6_V6ONLY) {
|
|
|
|
|
db_printf("%sIN6P_IPV6_V6ONLY", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_PKTINFO) {
|
|
|
|
|
db_printf("%sIN6P_PKTINFO", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_HOPLIMIT) {
|
|
|
|
|
db_printf("%sIN6P_HOPLIMIT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_HOPOPTS) {
|
|
|
|
|
db_printf("%sIN6P_HOPOPTS", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_DSTOPTS) {
|
|
|
|
|
db_printf("%sIN6P_DSTOPTS", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_RTHDR) {
|
|
|
|
|
db_printf("%sIN6P_RTHDR", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_RTHDRDSTOPTS) {
|
|
|
|
|
db_printf("%sIN6P_RTHDRDSTOPTS", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_TCLASS) {
|
|
|
|
|
db_printf("%sIN6P_TCLASS", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_AUTOFLOWLABEL) {
|
|
|
|
|
db_printf("%sIN6P_AUTOFLOWLABEL", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2009-03-15 05:58:31 -04:00
|
|
|
if (inp_flags & INP_ONESBCAST) {
|
|
|
|
|
db_printf("%sINP_ONESBCAST", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_DROPPED) {
|
|
|
|
|
db_printf("%sINP_DROPPED", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & INP_SOCKREF) {
|
|
|
|
|
db_printf("%sINP_SOCKREF", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2007-02-17 16:02:38 -05:00
|
|
|
if (inp_flags & IN6P_RFC2292) {
|
|
|
|
|
db_printf("%sIN6P_RFC2292", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_flags & IN6P_MTU) {
|
|
|
|
|
db_printf("IN6P_MTU%s", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
|
|
|
|
db_print_inpvflag(u_char inp_vflag)
|
|
|
|
|
{
|
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
|
if (inp_vflag & INP_IPV4) {
|
|
|
|
|
db_printf("%sINP_IPV4", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_vflag & INP_IPV6) {
|
|
|
|
|
db_printf("%sINP_IPV6", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (inp_vflag & INP_IPV6PROTO) {
|
|
|
|
|
db_printf("%sINP_IPV6PROTO", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2009-05-14 16:59:36 -04:00
|
|
|
static void
|
2007-02-17 16:02:38 -05:00
|
|
|
db_print_inpcb(struct inpcb *inp, const char *name, int indent)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("%s at %p\n", name, inp);
|
|
|
|
|
|
|
|
|
|
indent += 2;
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("inp_flow: 0x%x\n", inp->inp_flow);
|
|
|
|
|
|
|
|
|
|
db_print_inconninfo(&inp->inp_inc, "inp_conninfo", indent);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("inp_ppcb: %p inp_pcbinfo: %p inp_socket: %p\n",
|
|
|
|
|
inp->inp_ppcb, inp->inp_pcbinfo, inp->inp_socket);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("inp_label: %p inp_flags: 0x%x (",
|
|
|
|
|
inp->inp_label, inp->inp_flags);
|
|
|
|
|
db_print_inpflags(inp->inp_flags);
|
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("inp_sp: %p inp_vflag: 0x%x (", inp->inp_sp,
|
|
|
|
|
inp->inp_vflag);
|
|
|
|
|
db_print_inpvflag(inp->inp_vflag);
|
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("inp_ip_ttl: %d inp_ip_p: %d inp_ip_minttl: %d\n",
|
|
|
|
|
inp->inp_ip_ttl, inp->inp_ip_p, inp->inp_ip_minttl);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
if (inp->inp_vflag & INP_IPV6) {
|
|
|
|
|
db_printf("in6p_options: %p in6p_outputopts: %p "
|
|
|
|
|
"in6p_moptions: %p\n", inp->in6p_options,
|
|
|
|
|
inp->in6p_outputopts, inp->in6p_moptions);
|
|
|
|
|
db_printf("in6p_icmp6filt: %p in6p_cksum %d "
|
|
|
|
|
"in6p_hops %u\n", inp->in6p_icmp6filt, inp->in6p_cksum,
|
|
|
|
|
inp->in6p_hops);
|
|
|
|
|
} else
|
|
|
|
|
#endif
|
|
|
|
|
{
|
|
|
|
|
db_printf("inp_ip_tos: %d inp_ip_options: %p "
|
|
|
|
|
"inp_ip_moptions: %p\n", inp->inp_ip_tos,
|
|
|
|
|
inp->inp_options, inp->inp_moptions);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("inp_phd: %p inp_gencnt: %ju\n", inp->inp_phd,
|
|
|
|
|
(uintmax_t)inp->inp_gencnt);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
DB_SHOW_COMMAND(inpcb, db_show_inpcb)
|
|
|
|
|
{
|
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
|
|
if (!have_addr) {
|
|
|
|
|
db_printf("usage: show inpcb <addr>\n");
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
inp = (struct inpcb *)addr;
|
|
|
|
|
|
|
|
|
|
db_print_inpcb(inp, "inpcb", 0);
|
|
|
|
|
}
|
2012-01-21 21:13:19 -05:00
|
|
|
#endif /* DDB */
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
|
|
|
|
|
#ifdef RATELIMIT
|
|
|
|
|
/*
|
|
|
|
|
* Modify TX rate limit based on the existing "inp->inp_snd_tag",
|
|
|
|
|
* if any.
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
in_pcbmodify_txrtlmt(struct inpcb *inp, uint32_t max_pacing_rate)
|
|
|
|
|
{
|
|
|
|
|
union if_snd_tag_modify_params params = {
|
|
|
|
|
.rate_limit.max_rate = max_pacing_rate,
|
2019-08-01 10:17:31 -04:00
|
|
|
.rate_limit.flags = M_NOWAIT,
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
};
|
|
|
|
|
struct m_snd_tag *mst;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
|
if (mst == NULL)
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
2021-09-14 14:43:41 -04:00
|
|
|
if (mst->sw->snd_tag_modify == NULL) {
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
error = EOPNOTSUPP;
|
|
|
|
|
} else {
|
2021-09-14 14:43:41 -04:00
|
|
|
error = mst->sw->snd_tag_modify(mst, ¶ms);
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
}
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Query existing TX rate limit based on the existing
|
|
|
|
|
* "inp->inp_snd_tag", if any.
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
in_pcbquery_txrtlmt(struct inpcb *inp, uint32_t *p_max_pacing_rate)
|
|
|
|
|
{
|
|
|
|
|
union if_snd_tag_query_params params = { };
|
|
|
|
|
struct m_snd_tag *mst;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
|
if (mst == NULL)
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
2021-09-14 14:43:41 -04:00
|
|
|
if (mst->sw->snd_tag_query == NULL) {
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
error = EOPNOTSUPP;
|
|
|
|
|
} else {
|
2021-09-14 14:43:41 -04:00
|
|
|
error = mst->sw->snd_tag_query(mst, ¶ms);
|
|
|
|
|
if (error == 0 && p_max_pacing_rate != NULL)
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
*p_max_pacing_rate = params.rate_limit.max_rate;
|
|
|
|
|
}
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
2017-09-06 09:56:18 -04:00
|
|
|
/*
|
|
|
|
|
* Query existing TX queue level based on the existing
|
|
|
|
|
* "inp->inp_snd_tag", if any.
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
in_pcbquery_txrlevel(struct inpcb *inp, uint32_t *p_txqueue_level)
|
|
|
|
|
{
|
|
|
|
|
union if_snd_tag_query_params params = { };
|
|
|
|
|
struct m_snd_tag *mst;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
|
if (mst == NULL)
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
2021-09-14 14:43:41 -04:00
|
|
|
if (mst->sw->snd_tag_query == NULL)
|
2017-09-06 09:56:18 -04:00
|
|
|
return (EOPNOTSUPP);
|
|
|
|
|
|
2021-09-14 14:43:41 -04:00
|
|
|
error = mst->sw->snd_tag_query(mst, ¶ms);
|
|
|
|
|
if (error == 0 && p_txqueue_level != NULL)
|
2017-09-06 09:56:18 -04:00
|
|
|
*p_txqueue_level = params.rate_limit.queue_level;
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
/*
|
|
|
|
|
* Allocate a new TX rate limit send tag from the network interface
|
|
|
|
|
* given by the "ifp" argument and save it in "inp->inp_snd_tag":
|
|
|
|
|
*/
|
|
|
|
|
int
|
|
|
|
|
in_pcbattach_txrtlmt(struct inpcb *inp, struct ifnet *ifp,
|
2019-08-01 10:17:31 -04:00
|
|
|
uint32_t flowtype, uint32_t flowid, uint32_t max_pacing_rate, struct m_snd_tag **st)
|
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
{
|
|
|
|
|
union if_snd_tag_alloc_params params = {
|
2017-09-06 09:56:18 -04:00
|
|
|
.rate_limit.hdr.type = (max_pacing_rate == -1U) ?
|
|
|
|
|
IF_SND_TAG_TYPE_UNLIMITED : IF_SND_TAG_TYPE_RATE_LIMIT,
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
.rate_limit.hdr.flowid = flowid,
|
|
|
|
|
.rate_limit.hdr.flowtype = flowtype,
|
2020-03-09 09:44:51 -04:00
|
|
|
.rate_limit.hdr.numa_domain = inp->inp_numa_domain,
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
.rate_limit.max_rate = max_pacing_rate,
|
2019-08-01 10:17:31 -04:00
|
|
|
.rate_limit.flags = M_NOWAIT,
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
};
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
|
2021-01-26 09:01:38 -05:00
|
|
|
/*
|
|
|
|
|
* If there is already a send tag, or the INP is being torn
|
|
|
|
|
* down, allocating a new send tag is not allowed. Else send
|
|
|
|
|
* tags may leak.
|
|
|
|
|
*/
|
2022-10-06 22:22:23 -04:00
|
|
|
if (*st != NULL || (inp->inp_flags & INP_DROPPED) != 0)
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
return (EINVAL);
|
|
|
|
|
|
2020-10-29 19:28:39 -04:00
|
|
|
error = m_snd_tag_alloc(ifp, ¶ms, st);
|
2019-08-02 18:43:09 -04:00
|
|
|
#ifdef INET
|
2020-10-29 19:28:39 -04:00
|
|
|
if (error == 0) {
|
|
|
|
|
counter_u64_add(rate_limit_set_ok, 1);
|
|
|
|
|
counter_u64_add(rate_limit_active, 1);
|
|
|
|
|
} else if (error != EOPNOTSUPP)
|
|
|
|
|
counter_u64_add(rate_limit_alloc_fail, 1);
|
2019-08-02 18:43:09 -04:00
|
|
|
#endif
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
2019-08-01 10:17:31 -04:00
|
|
|
void
|
2020-10-29 18:18:56 -04:00
|
|
|
in_pcbdetach_tag(struct m_snd_tag *mst)
|
2019-08-01 10:17:31 -04:00
|
|
|
{
|
|
|
|
|
|
2020-10-29 18:18:56 -04:00
|
|
|
m_snd_tag_rele(mst);
|
2019-08-02 18:43:09 -04:00
|
|
|
#ifdef INET
|
2019-08-01 10:17:31 -04:00
|
|
|
counter_u64_add(rate_limit_active, -1);
|
2019-08-02 18:43:09 -04:00
|
|
|
#endif
|
2019-08-01 10:17:31 -04:00
|
|
|
}
|
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
/*
|
|
|
|
|
* Free an existing TX rate limit tag based on the "inp->inp_snd_tag",
|
|
|
|
|
* if any:
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcbdetach_txrtlmt(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
struct m_snd_tag *mst;
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
|
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
|
inp->inp_snd_tag = NULL;
|
|
|
|
|
|
|
|
|
|
if (mst == NULL)
|
|
|
|
|
return;
|
|
|
|
|
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 18:30:40 -04:00
|
|
|
m_snd_tag_rele(mst);
|
2021-01-26 10:59:42 -05:00
|
|
|
#ifdef INET
|
|
|
|
|
counter_u64_add(rate_limit_active, -1);
|
|
|
|
|
#endif
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
}
|
|
|
|
|
|
2019-08-01 10:17:31 -04:00
|
|
|
int
|
|
|
|
|
in_pcboutput_txrtlmt_locked(struct inpcb *inp, struct ifnet *ifp, struct mbuf *mb, uint32_t max_pacing_rate)
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
{
|
|
|
|
|
int error;
|
|
|
|
|
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 18:30:40 -04:00
|
|
|
/*
|
|
|
|
|
* If the existing send tag is for the wrong interface due to
|
|
|
|
|
* a route change, first drop the existing tag. Set the
|
|
|
|
|
* CHANGED flag so that we will keep trying to allocate a new
|
|
|
|
|
* tag if we fail to allocate one this time.
|
|
|
|
|
*/
|
|
|
|
|
if (inp->inp_snd_tag != NULL && inp->inp_snd_tag->ifp != ifp) {
|
|
|
|
|
in_pcbdetach_txrtlmt(inp);
|
|
|
|
|
inp->inp_flags2 |= INP_RATE_LIMIT_CHANGED;
|
|
|
|
|
}
|
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
/*
|
|
|
|
|
* NOTE: When attaching to a network interface a reference is
|
|
|
|
|
* made to ensure the network interface doesn't go away until
|
|
|
|
|
* all ratelimit connections are gone. The network interface
|
|
|
|
|
* pointers compared below represent valid network interfaces,
|
|
|
|
|
* except when comparing towards NULL.
|
|
|
|
|
*/
|
|
|
|
|
if (max_pacing_rate == 0 && inp->inp_snd_tag == NULL) {
|
|
|
|
|
error = 0;
|
|
|
|
|
} else if (!(ifp->if_capenable & IFCAP_TXRTLMT)) {
|
|
|
|
|
if (inp->inp_snd_tag != NULL)
|
|
|
|
|
in_pcbdetach_txrtlmt(inp);
|
|
|
|
|
error = 0;
|
|
|
|
|
} else if (inp->inp_snd_tag == NULL) {
|
|
|
|
|
/*
|
|
|
|
|
* In order to utilize packet pacing with RSS, we need
|
|
|
|
|
* to wait until there is a valid RSS hash before we
|
|
|
|
|
* can proceed:
|
|
|
|
|
*/
|
|
|
|
|
if (M_HASHTYPE_GET(mb) == M_HASHTYPE_NONE) {
|
|
|
|
|
error = EAGAIN;
|
|
|
|
|
} else {
|
|
|
|
|
error = in_pcbattach_txrtlmt(inp, ifp, M_HASHTYPE_GET(mb),
|
2019-08-01 10:17:31 -04:00
|
|
|
mb->m_pkthdr.flowid, max_pacing_rate, &inp->inp_snd_tag);
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
}
|
|
|
|
|
} else {
|
|
|
|
|
error = in_pcbmodify_txrtlmt(inp, max_pacing_rate);
|
|
|
|
|
}
|
|
|
|
|
if (error == 0 || error == EOPNOTSUPP)
|
|
|
|
|
inp->inp_flags2 &= ~INP_RATE_LIMIT_CHANGED;
|
2019-08-01 10:17:31 -04:00
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* This function should be called when the INP_RATE_LIMIT_CHANGED flag
|
|
|
|
|
* is set in the fast path and will attach/detach/modify the TX rate
|
|
|
|
|
* limit send tag based on the socket's so_max_pacing_rate value.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcboutput_txrtlmt(struct inpcb *inp, struct ifnet *ifp, struct mbuf *mb)
|
|
|
|
|
{
|
|
|
|
|
struct socket *socket;
|
|
|
|
|
uint32_t max_pacing_rate;
|
|
|
|
|
bool did_upgrade;
|
|
|
|
|
|
|
|
|
|
if (inp == NULL)
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
socket = inp->inp_socket;
|
|
|
|
|
if (socket == NULL)
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
if (!INP_WLOCKED(inp)) {
|
|
|
|
|
/*
|
|
|
|
|
* NOTE: If the write locking fails, we need to bail
|
|
|
|
|
* out and use the non-ratelimited ring for the
|
|
|
|
|
* transmit until there is a new chance to get the
|
|
|
|
|
* write lock.
|
|
|
|
|
*/
|
|
|
|
|
if (!INP_TRY_UPGRADE(inp))
|
|
|
|
|
return;
|
|
|
|
|
did_upgrade = 1;
|
|
|
|
|
} else {
|
|
|
|
|
did_upgrade = 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* NOTE: The so_max_pacing_rate value is read unlocked,
|
|
|
|
|
* because atomic updates are not required since the variable
|
|
|
|
|
* is checked at every mbuf we send. It is assumed that the
|
|
|
|
|
* variable read itself will be atomic.
|
|
|
|
|
*/
|
|
|
|
|
max_pacing_rate = socket->so_max_pacing_rate;
|
|
|
|
|
|
2022-04-12 17:58:59 -04:00
|
|
|
in_pcboutput_txrtlmt_locked(inp, ifp, mb, max_pacing_rate);
|
2019-08-01 10:17:31 -04:00
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
if (did_upgrade)
|
|
|
|
|
INP_DOWNGRADE(inp);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Track route changes for TX rate limiting.
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
in_pcboutput_eagain(struct inpcb *inp)
|
|
|
|
|
{
|
|
|
|
|
bool did_upgrade;
|
|
|
|
|
|
|
|
|
|
if (inp == NULL)
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
if (inp->inp_snd_tag == NULL)
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
if (!INP_WLOCKED(inp)) {
|
|
|
|
|
/*
|
|
|
|
|
* NOTE: If the write locking fails, we need to bail
|
|
|
|
|
* out and use the non-ratelimited ring for the
|
|
|
|
|
* transmit until there is a new chance to get the
|
|
|
|
|
* write lock.
|
|
|
|
|
*/
|
|
|
|
|
if (!INP_TRY_UPGRADE(inp))
|
|
|
|
|
return;
|
|
|
|
|
did_upgrade = 1;
|
|
|
|
|
} else {
|
|
|
|
|
did_upgrade = 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* detach rate limiting */
|
|
|
|
|
in_pcbdetach_txrtlmt(inp);
|
|
|
|
|
|
|
|
|
|
/* make sure new mbuf send tag allocation is made */
|
|
|
|
|
inp->inp_flags2 |= INP_RATE_LIMIT_CHANGED;
|
|
|
|
|
|
|
|
|
|
if (did_upgrade)
|
|
|
|
|
INP_DOWNGRADE(inp);
|
|
|
|
|
}
|
2019-08-01 10:17:31 -04:00
|
|
|
|
2019-08-02 18:43:09 -04:00
|
|
|
#ifdef INET
|
2019-08-01 10:17:31 -04:00
|
|
|
static void
|
|
|
|
|
rl_init(void *st)
|
|
|
|
|
{
|
2021-01-26 11:54:42 -05:00
|
|
|
rate_limit_new = counter_u64_alloc(M_WAITOK);
|
|
|
|
|
rate_limit_chg = counter_u64_alloc(M_WAITOK);
|
2019-08-01 10:17:31 -04:00
|
|
|
rate_limit_active = counter_u64_alloc(M_WAITOK);
|
|
|
|
|
rate_limit_alloc_fail = counter_u64_alloc(M_WAITOK);
|
|
|
|
|
rate_limit_set_ok = counter_u64_alloc(M_WAITOK);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
SYSINIT(rl, SI_SUB_PROTO_DOMAININIT, SI_ORDER_ANY, rl_init, NULL);
|
2019-08-02 18:43:09 -04:00
|
|
|
#endif
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 08:31:17 -05:00
|
|
|
#endif /* RATELIMIT */
|