2005-01-06 20:45:51 -05:00
|
|
|
/*-
|
2017-11-20 14:43:44 -05:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
|
*
|
1994-05-24 06:09:53 -04:00
|
|
|
* Copyright (c) 1982, 1986, 1988, 1993
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
* The Regents of the University of California.
|
2007-02-17 16:02:38 -05:00
|
|
|
* Copyright (c) 2006-2007 Robert N. M. Watson
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
* Copyright (c) 2010-2011 Juniper Networks, Inc.
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
* All rights reserved.
|
1994-05-24 06:09:53 -04:00
|
|
|
*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
* Portions of this software were developed by Robert N. M. Watson under
|
|
|
|
|
* contract to Juniper Networks, Inc.
|
|
|
|
|
*
|
1994-05-24 06:09:53 -04:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
|
* are met:
|
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 18:42:47 -05:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 06:09:53 -04:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
|
* without specific prior written permission.
|
|
|
|
|
*
|
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
|
*
|
1995-02-15 20:42:45 -05:00
|
|
|
* From: @(#)tcp_usrreq.c 8.2 (Berkeley) 1/3/94
|
1994-05-24 06:09:53 -04:00
|
|
|
*/
|
|
|
|
|
|
2007-10-07 16:44:24 -04:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
|
2007-02-17 16:02:38 -05:00
|
|
|
#include "opt_ddb.h"
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-10 23:26:04 -05:00
|
|
|
#include "opt_inet.h"
|
2000-01-09 14:17:30 -05:00
|
|
|
#include "opt_inet6.h"
|
2017-02-06 03:49:57 -05:00
|
|
|
#include "opt_ipsec.h"
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
#include "opt_kern_tls.h"
|
1997-09-16 14:36:06 -04:00
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/param.h>
|
|
|
|
|
#include <sys/systm.h>
|
2019-12-02 15:58:04 -05:00
|
|
|
#include <sys/arb.h>
|
2012-02-05 11:53:02 -05:00
|
|
|
#include <sys/limits.h>
|
2002-06-10 16:05:46 -04:00
|
|
|
#include <sys/malloc.h>
|
2015-12-15 19:56:45 -05:00
|
|
|
#include <sys/refcount.h>
|
1995-02-16 19:29:42 -05:00
|
|
|
#include <sys/kernel.h>
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
#include <sys/ktls.h>
|
2019-12-02 15:58:04 -05:00
|
|
|
#include <sys/qmath.h>
|
1995-11-09 15:23:09 -05:00
|
|
|
#include <sys/sysctl.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/mbuf.h>
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
#include <sys/domain.h>
|
|
|
|
|
#endif /* INET6 */
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <sys/socket.h>
|
|
|
|
|
#include <sys/socketvar.h>
|
|
|
|
|
#include <sys/protosw.h>
|
2001-02-21 01:39:57 -05:00
|
|
|
#include <sys/proc.h>
|
|
|
|
|
#include <sys/jail.h>
|
2019-12-02 15:58:04 -05:00
|
|
|
#include <sys/stats.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2007-02-17 16:02:38 -05:00
|
|
|
#ifdef DDB
|
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
|
#endif
|
|
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <net/if.h>
|
2013-10-26 13:58:36 -04:00
|
|
|
#include <net/if_var.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <net/route.h>
|
2009-08-01 15:26:27 -04:00
|
|
|
#include <net/vnet.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
|
|
|
|
|
#include <netinet/in.h>
|
2015-09-13 11:50:55 -04:00
|
|
|
#include <netinet/in_kdtrace.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <netinet/in_pcb.h>
|
2011-04-30 07:21:29 -04:00
|
|
|
#include <netinet/in_systm.h>
|
1995-03-16 13:17:34 -05:00
|
|
|
#include <netinet/in_var.h>
|
2022-02-03 13:50:56 -05:00
|
|
|
#include <netinet/ip.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <netinet/ip_var.h>
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
2011-04-30 07:21:29 -04:00
|
|
|
#include <netinet/ip6.h>
|
|
|
|
|
#include <netinet6/in6_pcb.h>
|
2000-01-09 14:17:30 -05:00
|
|
|
#include <netinet6/ip6_var.h>
|
2005-07-25 08:31:43 -04:00
|
|
|
#include <netinet6/scope6_var.h>
|
2000-01-09 14:17:30 -05:00
|
|
|
#endif
|
2016-01-21 17:34:51 -05:00
|
|
|
#include <netinet/tcp.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <netinet/tcp_fsm.h>
|
|
|
|
|
#include <netinet/tcp_seq.h>
|
|
|
|
|
#include <netinet/tcp_timer.h>
|
|
|
|
|
#include <netinet/tcp_var.h>
|
2018-03-22 05:40:08 -04:00
|
|
|
#include <netinet/tcp_log_buf.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
#include <netinet/tcpip.h>
|
2016-01-27 12:59:39 -05:00
|
|
|
#include <netinet/cc/cc.h>
|
2018-02-25 21:53:22 -05:00
|
|
|
#include <netinet/tcp_fastopen.h>
|
2018-04-19 11:03:48 -04:00
|
|
|
#include <netinet/tcp_hpts.h>
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-13 20:35:37 -04:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
|
#include <netinet/tcp_pcap.h>
|
|
|
|
|
#endif
|
2012-06-19 03:34:13 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
2007-12-18 17:59:07 -05:00
|
|
|
#include <netinet/tcp_offload.h>
|
2012-06-19 03:34:13 -04:00
|
|
|
#endif
|
2017-02-06 03:49:57 -05:00
|
|
|
#include <netipsec/ipsec_support.h>
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2019-12-02 15:58:04 -05:00
|
|
|
#include <vm/vm.h>
|
|
|
|
|
#include <vm/vm_param.h>
|
|
|
|
|
#include <vm/pmap.h>
|
|
|
|
|
#include <vm/vm_extern.h>
|
|
|
|
|
#include <vm/vm_map.h>
|
|
|
|
|
#include <vm/vm_page.h>
|
|
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
/*
|
|
|
|
|
* TCP protocol interface to socket abstraction.
|
|
|
|
|
*/
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
2023-02-03 14:33:36 -05:00
|
|
|
static int tcp_connect(struct tcpcb *, struct sockaddr_in *,
|
2002-03-24 05:19:10 -05:00
|
|
|
struct thread *td);
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif /* INET */
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
2023-02-03 14:33:36 -05:00
|
|
|
static int tcp6_connect(struct tcpcb *, struct sockaddr_in6 *,
|
2002-03-24 05:19:10 -05:00
|
|
|
struct thread *td);
|
2000-01-09 14:17:30 -05:00
|
|
|
#endif /* INET6 */
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
static void tcp_disconnect(struct tcpcb *);
|
|
|
|
|
static void tcp_usrclosed(struct tcpcb *);
|
2004-11-26 13:58:46 -05:00
|
|
|
static void tcp_fill_info(struct tcpcb *, struct tcp_info *);
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2020-05-04 16:19:57 -04:00
|
|
|
static int tcp_pru_options_support(struct tcpcb *tp, int flags);
|
|
|
|
|
|
2023-02-21 06:07:35 -05:00
|
|
|
static void
|
|
|
|
|
tcp_bblog_pru(struct tcpcb *tp, uint32_t pru, int error)
|
|
|
|
|
{
|
|
|
|
|
struct tcp_log_buffer *lgb;
|
|
|
|
|
|
2023-05-06 05:12:06 -04:00
|
|
|
KASSERT(tp != NULL, ("tcp_bblog_pru: tp == NULL"));
|
2023-02-21 06:07:35 -05:00
|
|
|
INP_WLOCK_ASSERT(tptoinpcb(tp));
|
2023-03-16 11:43:16 -04:00
|
|
|
if (tcp_bblogging_on(tp)) {
|
|
|
|
|
lgb = tcp_log_event(tp, NULL, NULL, NULL, TCP_LOG_PRU, error,
|
2023-02-21 06:07:35 -05:00
|
|
|
0, NULL, false, NULL, NULL, 0, NULL);
|
|
|
|
|
} else {
|
|
|
|
|
lgb = NULL;
|
|
|
|
|
}
|
|
|
|
|
if (lgb != NULL) {
|
|
|
|
|
if (error >= 0) {
|
|
|
|
|
lgb->tlb_errno = (uint32_t)error;
|
|
|
|
|
}
|
|
|
|
|
lgb->tlb_flex1 = pru;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* TCP attaches to socket via pru_attach(), reserving space,
|
|
|
|
|
* and an internet control block.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
2001-09-12 04:38:13 -04:00
|
|
|
tcp_usr_attach(struct socket *so, int proto, struct thread *td)
|
1996-07-11 12:32:50 -04:00
|
|
|
{
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
struct tcpcb *tp = NULL;
|
|
|
|
|
int error;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp == NULL, ("tcp_usr_attach: inp != NULL"));
|
1996-07-11 12:32:50 -04:00
|
|
|
|
tcp: utilize new solisten_clone() and solisten_enqueue()
This streamlines cloning of a socket from a listener. Now we do not
drop the inpcb lock during creation of a new socket, do not do useless
state transitions, and put a fully initialized socket+inpcb+tcpcb into
the listen queue.
Before this change, first we would allocate the socket and inpcb+tcpcb via
tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock
pcb and put this onto incomplete queue (see 6f3caa6d815). Then, after
sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED,
insert into inpcb hash, finalize initialization of tcpcb. And then, in
call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call
soisconnected(). This call would lock the listening socket once again
with a LOR protection sequence and then we would relocate the socket onto
the complete queue and only now it is ready for accept(2).
Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36064
2022-08-10 14:09:34 -04:00
|
|
|
error = soreserve(so, V_tcp_sendspace, V_tcp_recvspace);
|
|
|
|
|
if (error)
|
|
|
|
|
goto out;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2020-01-22 00:54:58 -05:00
|
|
|
so->so_rcv.sb_flags |= SB_AUTOSIZE;
|
|
|
|
|
so->so_snd.sb_flags |= SB_AUTOSIZE;
|
|
|
|
|
error = in_pcballoc(so, &V_tcbinfo);
|
2020-01-22 01:01:26 -05:00
|
|
|
if (error)
|
2020-01-22 00:54:58 -05:00
|
|
|
goto out;
|
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
tp = tcp_newtcpcb(inp);
|
|
|
|
|
if (tp == NULL) {
|
2020-01-22 01:01:26 -05:00
|
|
|
error = ENOBUFS;
|
2020-01-22 00:54:58 -05:00
|
|
|
in_pcbdetach(inp);
|
|
|
|
|
in_pcbfree(inp);
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
tp->t_state = TCPS_CLOSED;
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_ATTACH, error);
|
2020-01-22 00:54:58 -05:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
TCPSTATES_INC(TCPS_CLOSED);
|
1996-07-11 12:32:50 -04:00
|
|
|
out:
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_ATTACH);
|
2020-01-22 00:54:58 -05:00
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
2020-01-22 01:06:27 -05:00
|
|
|
* tcp_usr_detach is called when the socket layer loses its final reference
|
2006-07-21 13:11:15 -04:00
|
|
|
* to the socket, be it a file descriptor reference, a reference from TCP,
|
|
|
|
|
* etc. At this point, there is only one case in which we will keep around
|
|
|
|
|
* inpcb state: time wait.
|
1996-07-11 12:32:50 -04:00
|
|
|
*/
|
Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.
soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.
Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.
In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.
netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.
MFC after: 3 months
2006-04-01 10:42:02 -05:00
|
|
|
static void
|
2020-01-22 01:06:27 -05:00
|
|
|
tcp_usr_detach(struct socket *so)
|
1996-07-11 12:32:50 -04:00
|
|
|
{
|
2020-01-22 01:06:27 -05:00
|
|
|
struct inpcb *inp;
|
1996-07-11 12:32:50 -04:00
|
|
|
struct tcpcb *tp;
|
|
|
|
|
|
2020-01-22 01:06:27 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("%s: inp == NULL", __func__));
|
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
|
KASSERT(so->so_pcb == inp && inp->inp_socket == so,
|
|
|
|
|
("%s: socket %p inp %p mismatch", __func__, so, inp));
|
2006-04-02 12:42:51 -04:00
|
|
|
|
2006-07-21 13:11:15 -04:00
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
|
2022-10-06 22:22:23 -04:00
|
|
|
KASSERT(inp->inp_flags & INP_DROPPED ||
|
|
|
|
|
tp->t_state < TCPS_SYN_SENT,
|
|
|
|
|
("%s: inp %p not dropped or embryonic", __func__, inp));
|
|
|
|
|
|
|
|
|
|
tcp_discardcb(tp);
|
|
|
|
|
in_pcbdetach(inp);
|
|
|
|
|
in_pcbfree(inp);
|
2006-04-24 04:20:02 -04:00
|
|
|
}
|
|
|
|
|
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* Give the socket an address.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
2001-09-12 04:38:13 -04:00
|
|
|
tcp_usr_bind(struct socket *so, struct sockaddr *nam, struct thread *td)
|
1996-07-11 12:32:50 -04:00
|
|
|
{
|
|
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
1996-07-11 12:32:50 -04:00
|
|
|
struct sockaddr_in *sinp;
|
|
|
|
|
|
2023-05-06 05:12:06 -04:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_bind: inp == NULL"));
|
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
|
2004-04-04 16:14:55 -04:00
|
|
|
sinp = (struct sockaddr_in *)nam;
|
2021-05-31 18:53:34 -04:00
|
|
|
if (nam->sa_family != AF_INET) {
|
|
|
|
|
/*
|
|
|
|
|
* Preserve compatibility with old programs.
|
|
|
|
|
*/
|
|
|
|
|
if (nam->sa_family != AF_UNSPEC ||
|
2021-08-05 07:42:30 -04:00
|
|
|
nam->sa_len < offsetof(struct sockaddr_in, sin_zero) ||
|
2023-05-06 05:12:06 -04:00
|
|
|
sinp->sin_addr.s_addr != INADDR_ANY) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2021-05-31 18:53:34 -04:00
|
|
|
nam->sa_family = AF_INET;
|
|
|
|
|
}
|
2023-05-06 05:12:06 -04:00
|
|
|
if (nam->sa_len != sizeof(*sinp)) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* Must check for multicast addresses and disallow binding
|
|
|
|
|
* to them.
|
|
|
|
|
*/
|
2023-05-06 05:12:06 -04:00
|
|
|
if (IN_MULTICAST(ntohl(sinp->sin_addr.s_addr))) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
1996-07-11 12:32:50 -04:00
|
|
|
goto out;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
2023-02-15 13:30:16 -05:00
|
|
|
error = in_pcbbind(inp, sinp, td->td_ucred);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_BIND, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_BIND);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
|
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif /* INET */
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
static int
|
2001-09-12 04:38:13 -04:00
|
|
|
tcp6_usr_bind(struct socket *so, struct sockaddr *nam, struct thread *td)
|
2000-01-09 14:17:30 -05:00
|
|
|
{
|
|
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2019-08-02 03:41:36 -04:00
|
|
|
struct sockaddr_in6 *sin6;
|
2019-10-24 16:05:10 -04:00
|
|
|
u_char vflagsav;
|
2000-01-09 14:17:30 -05:00
|
|
|
|
2023-05-06 05:12:06 -04:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp6_usr_bind: inp == NULL"));
|
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
|
|
|
|
INP_WUNLOCK(inp);
|
2004-04-04 16:14:55 -04:00
|
|
|
return (EINVAL);
|
2023-05-06 05:12:06 -04:00
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
|
|
|
|
|
vflagsav = inp->inp_vflag;
|
2021-05-03 12:51:04 -04:00
|
|
|
|
2023-05-06 05:12:06 -04:00
|
|
|
sin6 = (struct sockaddr_in6 *)nam;
|
|
|
|
|
if (nam->sa_family != AF_INET6) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (nam->sa_len != sizeof(*sin6)) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2000-01-09 14:17:30 -05:00
|
|
|
/*
|
|
|
|
|
* Must check for multicast addresses and disallow binding
|
|
|
|
|
* to them.
|
|
|
|
|
*/
|
2023-05-06 05:12:06 -04:00
|
|
|
if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr)) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
goto out;
|
|
|
|
|
}
|
2023-05-06 05:12:06 -04:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
2000-01-09 14:17:30 -05:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
|
|
|
|
inp->inp_vflag |= INP_IPV6;
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
2002-07-25 14:10:04 -04:00
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0) {
|
2019-08-02 03:41:36 -04:00
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr))
|
2000-01-09 14:17:30 -05:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
2019-08-02 03:41:36 -04:00
|
|
|
else if (IN6_IS_ADDR_V4MAPPED(&sin6->sin6_addr)) {
|
2000-01-09 14:17:30 -05:00
|
|
|
struct sockaddr_in sin;
|
|
|
|
|
|
2019-08-02 03:41:36 -04:00
|
|
|
in6_sin6_2_sin(&sin, sin6);
|
2018-07-30 17:27:26 -04:00
|
|
|
if (IN_MULTICAST(ntohl(sin.sin_addr.s_addr))) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2000-01-09 14:17:30 -05:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
|
|
|
|
inp->inp_vflag &= ~INP_IPV6;
|
2023-02-15 13:30:16 -05:00
|
|
|
error = in_pcbbind(inp, &sin, td->td_ucred);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2000-01-09 14:17:30 -05:00
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
}
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif
|
2023-02-15 13:30:16 -05:00
|
|
|
error = in6_pcbbind(inp, sin6, td->td_ucred);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2019-10-24 16:05:10 -04:00
|
|
|
if (error != 0)
|
|
|
|
|
inp->inp_vflag = vflagsav;
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_BIND, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_BIND);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
return (error);
|
2000-01-09 14:17:30 -05:00
|
|
|
}
|
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* Prepare to accept connections.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
2005-10-30 14:44:40 -05:00
|
|
|
tcp_usr_listen(struct socket *so, int backlog, struct thread *td)
|
1996-07-11 12:32:50 -04:00
|
|
|
{
|
|
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_listen: inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2023-05-06 05:12:06 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (EINVAL);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
2023-05-06 05:12:06 -04:00
|
|
|
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 16:58:17 -05:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
|
error = solisten_proto_check(so);
|
2021-09-07 14:49:53 -04:00
|
|
|
if (error != 0) {
|
|
|
|
|
SOCK_UNLOCK(so);
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (inp->inp_lport == 0) {
|
|
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
|
|
|
|
error = in_pcbbind(inp, NULL, td->td_ucred);
|
|
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
|
|
|
|
}
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 16:58:17 -05:00
|
|
|
if (error == 0) {
|
2013-08-25 17:54:41 -04:00
|
|
|
tcp_state_change(tp, TCPS_LISTEN);
|
2005-10-30 14:44:40 -05:00
|
|
|
solisten_proto(so, backlog);
|
2012-06-19 03:34:13 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
2013-01-25 15:23:33 -05:00
|
|
|
if ((so->so_options & SO_NO_OFFLOAD) == 0)
|
|
|
|
|
tcp_offload_listen_start(tp);
|
2012-06-19 03:34:13 -04:00
|
|
|
#endif
|
2021-09-07 14:49:53 -04:00
|
|
|
} else {
|
|
|
|
|
solisten_proto_abort(so);
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 16:58:17 -05:00
|
|
|
}
|
|
|
|
|
SOCK_UNLOCK(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
2016-10-12 15:06:50 -04:00
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
2015-12-24 14:09:48 -05:00
|
|
|
tp->t_tfo_pending = tcp_fastopen_alloc_counter();
|
2018-02-25 22:03:41 -05:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_LISTEN, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_LISTEN);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif /* INET */
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
static int
|
2005-10-30 14:44:40 -05:00
|
|
|
tcp6_usr_listen(struct socket *so, int backlog, struct thread *td)
|
2000-01-09 14:17:30 -05:00
|
|
|
{
|
|
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2019-10-24 16:05:10 -04:00
|
|
|
u_char vflagsav;
|
2000-01-09 14:17:30 -05:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp6_usr_listen: inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2023-05-06 05:12:06 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (EINVAL);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
2023-05-06 05:12:06 -04:00
|
|
|
|
|
|
|
|
vflagsav = inp->inp_vflag;
|
|
|
|
|
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 16:58:17 -05:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
|
error = solisten_proto_check(so);
|
2021-09-07 14:49:53 -04:00
|
|
|
if (error != 0) {
|
|
|
|
|
SOCK_UNLOCK(so);
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
2021-09-07 14:49:53 -04:00
|
|
|
if (inp->inp_lport == 0) {
|
2000-01-09 14:17:30 -05:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
2002-07-25 14:10:04 -04:00
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0)
|
2000-01-09 14:17:30 -05:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
2021-09-07 14:49:53 -04:00
|
|
|
error = in6_pcbbind(inp, NULL, td->td_ucred);
|
2000-01-09 14:17:30 -05:00
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 16:58:17 -05:00
|
|
|
if (error == 0) {
|
2013-08-25 17:54:41 -04:00
|
|
|
tcp_state_change(tp, TCPS_LISTEN);
|
2005-10-30 14:44:40 -05:00
|
|
|
solisten_proto(so, backlog);
|
2012-06-19 03:34:13 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
2013-01-25 15:23:33 -05:00
|
|
|
if ((so->so_options & SO_NO_OFFLOAD) == 0)
|
|
|
|
|
tcp_offload_listen_start(tp);
|
2012-06-19 03:34:13 -04:00
|
|
|
#endif
|
2021-09-07 14:49:53 -04:00
|
|
|
} else {
|
|
|
|
|
solisten_proto_abort(so);
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 16:58:17 -05:00
|
|
|
}
|
|
|
|
|
SOCK_UNLOCK(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
2016-10-12 15:06:50 -04:00
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
2015-12-24 14:09:48 -05:00
|
|
|
tp->t_tfo_pending = tcp_fastopen_alloc_counter();
|
2018-02-25 22:03:41 -05:00
|
|
|
|
2019-10-24 16:05:10 -04:00
|
|
|
if (error != 0)
|
|
|
|
|
inp->inp_vflag = vflagsav;
|
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_LISTEN, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_LISTEN);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
return (error);
|
2000-01-09 14:17:30 -05:00
|
|
|
}
|
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* Initiate connection to peer.
|
|
|
|
|
* Create a template for use in transmissions on this connection.
|
|
|
|
|
* Enter SYN_SENT state, and mark socket as connecting.
|
|
|
|
|
* Start keep-alive timer, and seed output sequence space.
|
|
|
|
|
* Send initial segment on connection.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
2001-09-12 04:38:13 -04:00
|
|
|
tcp_usr_connect(struct socket *so, struct sockaddr *nam, struct thread *td)
|
1996-07-11 12:32:50 -04:00
|
|
|
{
|
2020-01-22 00:53:16 -05:00
|
|
|
struct epoch_tracker et;
|
1996-07-11 12:32:50 -04:00
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
1996-07-11 12:32:50 -04:00
|
|
|
struct sockaddr_in *sinp;
|
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_connect: inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2015-03-09 16:29:16 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2023-05-06 05:12:06 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNREFUSED);
|
|
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
|
|
|
|
|
sinp = (struct sockaddr_in *)nam;
|
|
|
|
|
if (nam->sa_family != AF_INET) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (nam->sa_len != sizeof (*sinp)) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
/*
|
|
|
|
|
* Must disallow TCP ``connections'' to multicast addresses.
|
|
|
|
|
*/
|
|
|
|
|
if (IN_MULTICAST(ntohl(sinp->sin_addr.s_addr))) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
goto out;
|
|
|
|
|
}
|
2023-05-06 05:12:06 -04:00
|
|
|
if (ntohl(sinp->sin_addr.s_addr) == INADDR_BROADCAST) {
|
|
|
|
|
error = EACCES;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if ((error = prison_remote_ip4(td->td_ucred, &sinp->sin_addr)) != 0)
|
|
|
|
|
goto out;
|
2021-09-07 14:49:53 -04:00
|
|
|
if (SOLISTENING(so)) {
|
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2020-01-22 01:10:41 -05:00
|
|
|
NET_EPOCH_ENTER(et);
|
2023-02-03 14:33:36 -05:00
|
|
|
if ((error = tcp_connect(tp, sinp, td)) != 0)
|
2020-01-22 01:10:41 -05:00
|
|
|
goto out_in_epoch;
|
2012-06-19 03:34:13 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
|
if (registered_toedevs > 0 &&
|
2013-01-25 15:23:33 -05:00
|
|
|
(so->so_options & SO_NO_OFFLOAD) == 0 &&
|
2012-06-19 03:34:13 -04:00
|
|
|
(error = tcp_offload_connect(so, nam)) == 0)
|
2020-01-22 01:10:41 -05:00
|
|
|
goto out_in_epoch;
|
2012-06-19 03:34:13 -04:00
|
|
|
#endif
|
|
|
|
|
tcp_timer_activate(tp, TT_KEEP, TP_KEEPINIT(tp));
|
2021-12-26 11:47:59 -05:00
|
|
|
error = tcp_output(tp);
|
2022-01-13 13:32:41 -05:00
|
|
|
KASSERT(error >= 0, ("TCP stack %s requested tcp_drop(%p) at connect()"
|
|
|
|
|
", error code %d", tp->t_fb->tfb_tcp_block_name, tp, -error));
|
2020-01-22 01:10:41 -05:00
|
|
|
out_in_epoch:
|
2020-01-22 00:53:16 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_CONNECT, error);
|
2016-03-03 12:46:38 -05:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_CONNECT);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif /* INET */
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
static int
|
2001-09-12 04:38:13 -04:00
|
|
|
tcp6_usr_connect(struct socket *so, struct sockaddr *nam, struct thread *td)
|
2000-01-09 14:17:30 -05:00
|
|
|
{
|
2020-01-22 00:53:16 -05:00
|
|
|
struct epoch_tracker et;
|
2000-01-09 14:17:30 -05:00
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2019-08-02 03:41:36 -04:00
|
|
|
struct sockaddr_in6 *sin6;
|
2019-10-24 16:05:10 -04:00
|
|
|
u_int8_t incflagsav;
|
|
|
|
|
u_char vflagsav;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp6_usr_connect: inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2023-05-06 05:12:06 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNREFUSED);
|
|
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
|
2019-10-24 16:05:10 -04:00
|
|
|
vflagsav = inp->inp_vflag;
|
|
|
|
|
incflagsav = inp->inp_inc.inc_flags;
|
2023-05-06 05:12:06 -04:00
|
|
|
|
|
|
|
|
sin6 = (struct sockaddr_in6 *)nam;
|
|
|
|
|
if (nam->sa_family != AF_INET6) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (nam->sa_len != sizeof (*sin6)) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
/*
|
|
|
|
|
* Must disallow TCP ``connections'' to multicast addresses.
|
|
|
|
|
*/
|
|
|
|
|
if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr)) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
goto out;
|
|
|
|
|
}
|
2021-09-07 14:49:53 -04:00
|
|
|
if (SOLISTENING(so)) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
/*
|
|
|
|
|
* XXXRW: Some confusion: V4/V6 flags relate to binding, and
|
|
|
|
|
* therefore probably require the hash lock, which isn't held here.
|
|
|
|
|
* Is this a significant problem?
|
|
|
|
|
*/
|
2019-08-02 03:41:36 -04:00
|
|
|
if (IN6_IS_ADDR_V4MAPPED(&sin6->sin6_addr)) {
|
2000-01-09 14:17:30 -05:00
|
|
|
struct sockaddr_in sin;
|
|
|
|
|
|
2002-07-29 05:01:39 -04:00
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2017-05-22 11:29:10 -04:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2001-06-11 08:39:29 -04:00
|
|
|
|
2019-08-02 03:41:36 -04:00
|
|
|
in6_sin6_2_sin(&sin, sin6);
|
2018-07-30 17:27:26 -04:00
|
|
|
if (IN_MULTICAST(ntohl(sin.sin_addr.s_addr))) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2020-07-16 12:46:24 -04:00
|
|
|
if (ntohl(sin.sin_addr.s_addr) == INADDR_BROADCAST) {
|
|
|
|
|
error = EACCES;
|
2020-06-03 10:16:40 -04:00
|
|
|
goto out;
|
|
|
|
|
}
|
2009-02-05 09:06:09 -05:00
|
|
|
if ((error = prison_remote_ip4(td->td_ucred,
|
|
|
|
|
&sin.sin_addr)) != 0)
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 09:32:14 -05:00
|
|
|
goto out;
|
2019-10-24 16:05:10 -04:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
|
|
|
|
inp->inp_vflag &= ~INP_IPV6;
|
2020-01-22 01:10:41 -05:00
|
|
|
NET_EPOCH_ENTER(et);
|
2023-02-03 14:33:36 -05:00
|
|
|
if ((error = tcp_connect(tp, &sin, td)) != 0)
|
2020-01-22 01:10:41 -05:00
|
|
|
goto out_in_epoch;
|
2012-06-19 03:34:13 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
|
if (registered_toedevs > 0 &&
|
2013-01-25 20:41:42 -05:00
|
|
|
(so->so_options & SO_NO_OFFLOAD) == 0 &&
|
2012-06-19 03:34:13 -04:00
|
|
|
(error = tcp_offload_connect(so, nam)) == 0)
|
2020-01-22 01:10:41 -05:00
|
|
|
goto out_in_epoch;
|
2012-06-19 03:34:13 -04:00
|
|
|
#endif
|
2021-12-26 11:47:59 -05:00
|
|
|
error = tcp_output(tp);
|
2020-01-22 01:10:41 -05:00
|
|
|
goto out_in_epoch;
|
2017-05-22 11:29:10 -04:00
|
|
|
} else {
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2000-01-09 14:17:30 -05:00
|
|
|
}
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif
|
2019-10-24 16:05:10 -04:00
|
|
|
if ((error = prison_remote_ip6(td->td_ucred, &sin6->sin6_addr)) != 0)
|
|
|
|
|
goto out;
|
2000-01-09 14:17:30 -05:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
|
|
|
|
inp->inp_vflag |= INP_IPV6;
|
2008-12-17 07:52:34 -05:00
|
|
|
inp->inp_inc.inc_flags |= INC_ISIPV6;
|
2022-09-05 13:15:19 -04:00
|
|
|
NET_EPOCH_ENTER(et);
|
2023-02-03 14:33:36 -05:00
|
|
|
if ((error = tcp6_connect(tp, sin6, td)) != 0)
|
2022-09-05 13:15:19 -04:00
|
|
|
goto out_in_epoch;
|
2012-06-19 03:34:13 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
|
if (registered_toedevs > 0 &&
|
2013-01-25 20:41:42 -05:00
|
|
|
(so->so_options & SO_NO_OFFLOAD) == 0 &&
|
2012-06-19 03:34:13 -04:00
|
|
|
(error = tcp_offload_connect(so, nam)) == 0)
|
2022-09-05 13:15:19 -04:00
|
|
|
goto out_in_epoch;
|
2012-06-19 03:34:13 -04:00
|
|
|
#endif
|
|
|
|
|
tcp_timer_activate(tp, TT_KEEP, TP_KEEPINIT(tp));
|
2021-12-26 11:47:59 -05:00
|
|
|
error = tcp_output(tp);
|
2020-01-22 01:10:41 -05:00
|
|
|
out_in_epoch:
|
2020-01-22 00:53:16 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2022-01-13 13:32:41 -05:00
|
|
|
KASSERT(error >= 0, ("TCP stack %s requested tcp_drop(%p) at connect()"
|
|
|
|
|
", error code %d", tp->t_fb->tfb_tcp_block_name, tp, -error));
|
2019-10-24 16:05:10 -04:00
|
|
|
/*
|
|
|
|
|
* If the implicit bind in the connect call fails, restore
|
|
|
|
|
* the flags we modified.
|
|
|
|
|
*/
|
|
|
|
|
if (error != 0 && inp->inp_lport == 0) {
|
|
|
|
|
inp->inp_vflag = vflagsav;
|
|
|
|
|
inp->inp_inc.inc_flags = incflagsav;
|
|
|
|
|
}
|
|
|
|
|
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_CONNECT, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_CONNECT);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
return (error);
|
2000-01-09 14:17:30 -05:00
|
|
|
}
|
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* Initiate disconnect from peer.
|
|
|
|
|
* If connection never passed embryonic stage, just drop;
|
|
|
|
|
* else if don't need to let data drain, then can just drop anyways,
|
|
|
|
|
* else have to begin TCP shutdown process: mark socket disconnecting,
|
|
|
|
|
* drain unread data, state switch to reflect user close, and
|
|
|
|
|
* send segment (e.g. FIN) to peer. Socket will be really disconnected
|
|
|
|
|
* when peer sends FIN and acks ours.
|
|
|
|
|
*
|
|
|
|
|
* SHOULD IMPLEMENT LATER PRU_CONNECT VIA REALLOC TCPCB.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
|
|
|
|
tcp_usr_disconnect(struct socket *so)
|
|
|
|
|
{
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
struct tcpcb *tp = NULL;
|
2018-07-03 22:47:16 -04:00
|
|
|
struct epoch_tracker et;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
int error = 0;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_ENTER(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_disconnect: inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2014-10-12 19:01:25 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2023-05-06 05:12:06 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
NET_EPOCH_EXIT(et);
|
|
|
|
|
return (ECONNRESET);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
2023-05-06 05:12:06 -04:00
|
|
|
|
2023-02-17 12:13:53 -05:00
|
|
|
if (tp->t_state == TCPS_TIME_WAIT)
|
|
|
|
|
goto out;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
tcp_disconnect(tp);
|
|
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_DISCONNECT, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_DISCONNECT);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
|
|
|
|
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
2010-03-06 16:38:31 -05:00
|
|
|
* Accept a connection. Essentially all the work is done at higher levels;
|
|
|
|
|
* just return the address of the peer, storing through addr.
|
1996-07-11 12:32:50 -04:00
|
|
|
*/
|
|
|
|
|
static int
|
1997-08-16 15:16:27 -04:00
|
|
|
tcp_usr_accept(struct socket *so, struct sockaddr **nam)
|
1996-07-11 12:32:50 -04:00
|
|
|
{
|
|
|
|
|
int error = 0;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct inpcb *inp;
|
|
|
|
|
struct tcpcb *tp;
|
2002-08-21 07:57:12 -04:00
|
|
|
struct in_addr addr;
|
|
|
|
|
in_port_t port = 0;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2002-06-10 16:05:46 -04:00
|
|
|
inp = sotoinpcb(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
KASSERT(inp != NULL, ("tcp_usr_accept: inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2023-05-06 05:12:06 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNABORTED);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
2001-03-11 21:57:42 -05:00
|
|
|
tp = intotcpcb(inp);
|
2002-06-10 16:05:46 -04:00
|
|
|
|
2023-05-06 05:12:06 -04:00
|
|
|
if (so->so_state & SS_ISDISCONNECTED) {
|
|
|
|
|
error = ECONNABORTED;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2004-08-16 14:32:07 -04:00
|
|
|
/*
|
2007-05-11 06:20:51 -04:00
|
|
|
* We inline in_getpeeraddr and COMMON_END here, so that we can
|
2002-08-21 07:57:12 -04:00
|
|
|
* copy the data of interest and defer the malloc until after we
|
|
|
|
|
* release the lock.
|
2002-06-10 16:05:46 -04:00
|
|
|
*/
|
2002-08-21 07:57:12 -04:00
|
|
|
port = inp->inp_fport;
|
|
|
|
|
addr = inp->inp_faddr;
|
2002-06-10 16:05:46 -04:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_ACCEPT, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_ACCEPT);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2002-08-21 07:57:12 -04:00
|
|
|
if (error == 0)
|
|
|
|
|
*nam = in_sockaddr(port, &addr);
|
|
|
|
|
return error;
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif /* INET */
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
static int
|
|
|
|
|
tcp6_usr_accept(struct socket *so, struct sockaddr **nam)
|
|
|
|
|
{
|
2023-05-06 05:12:06 -04:00
|
|
|
struct inpcb *inp;
|
2000-01-09 14:17:30 -05:00
|
|
|
int error = 0;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2002-08-21 07:57:12 -04:00
|
|
|
struct in_addr addr;
|
|
|
|
|
struct in6_addr addr6;
|
2018-07-03 22:47:16 -04:00
|
|
|
struct epoch_tracker et;
|
2002-08-21 07:57:12 -04:00
|
|
|
in_port_t port = 0;
|
|
|
|
|
int v4 = 0;
|
2000-01-09 14:17:30 -05:00
|
|
|
|
2002-06-10 16:05:46 -04:00
|
|
|
inp = sotoinpcb(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
KASSERT(inp != NULL, ("tcp6_usr_accept: inp == NULL"));
|
2023-05-06 05:12:06 -04:00
|
|
|
NET_EPOCH_ENTER(et); /* XXXMT Why is this needed? */
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2023-05-06 05:12:06 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
NET_EPOCH_EXIT(et);
|
|
|
|
|
return (ECONNABORTED);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
2001-03-11 21:57:42 -05:00
|
|
|
tp = intotcpcb(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
2023-05-06 05:12:06 -04:00
|
|
|
if (so->so_state & SS_ISDISCONNECTED) {
|
|
|
|
|
error = ECONNABORTED;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2004-08-16 14:32:07 -04:00
|
|
|
/*
|
2002-08-21 07:57:12 -04:00
|
|
|
* We inline in6_mapped_peeraddr and COMMON_END here, so that we can
|
|
|
|
|
* copy the data of interest and defer the malloc until after we
|
|
|
|
|
* release the lock.
|
|
|
|
|
*/
|
|
|
|
|
if (inp->inp_vflag & INP_IPV4) {
|
|
|
|
|
v4 = 1;
|
|
|
|
|
port = inp->inp_fport;
|
|
|
|
|
addr = inp->inp_faddr;
|
|
|
|
|
} else {
|
|
|
|
|
port = inp->inp_fport;
|
|
|
|
|
addr6 = inp->in6p_faddr;
|
|
|
|
|
}
|
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_ACCEPT, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_ACCEPT);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
2002-08-21 07:57:12 -04:00
|
|
|
if (error == 0) {
|
|
|
|
|
if (v4)
|
|
|
|
|
*nam = in6_v4mapsin6_sockaddr(port, &addr);
|
|
|
|
|
else
|
|
|
|
|
*nam = in6_sockaddr(port, &addr6);
|
|
|
|
|
}
|
|
|
|
|
return error;
|
2000-01-09 14:17:30 -05:00
|
|
|
}
|
|
|
|
|
#endif /* INET6 */
|
2002-06-10 16:05:46 -04:00
|
|
|
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* Mark the connection as being incapable of further output.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
|
|
|
|
tcp_usr_shutdown(struct socket *so)
|
|
|
|
|
{
|
|
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2018-07-03 22:47:16 -04:00
|
|
|
struct epoch_tracker et;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2021-12-27 19:58:09 -05:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNRESET);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
2021-12-27 19:58:09 -05:00
|
|
|
tp = intotcpcb(inp);
|
2023-05-06 05:12:06 -04:00
|
|
|
|
2021-12-27 19:58:09 -05:00
|
|
|
NET_EPOCH_ENTER(et);
|
1996-07-11 12:32:50 -04:00
|
|
|
socantsendmore(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
tcp_usrclosed(tp);
|
2009-03-15 05:58:31 -04:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED))
|
2021-12-26 11:48:19 -05:00
|
|
|
error = tcp_output_nodrop(tp);
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_SHUTDOWN, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_SHUTDOWN);
|
2021-12-26 11:48:19 -05:00
|
|
|
error = tcp_unlock_or_drop(tp, error);
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
|
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* After a receive, possibly send window update to peer.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
|
|
|
|
tcp_usr_rcvd(struct socket *so, int flags)
|
|
|
|
|
{
|
2020-01-22 00:53:16 -05:00
|
|
|
struct epoch_tracker et;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2021-12-26 11:48:19 -05:00
|
|
|
int outrv = 0, error = 0;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_rcvd: inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2021-12-27 13:41:51 -05:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNRESET);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
2021-12-27 13:41:51 -05:00
|
|
|
tp = intotcpcb(inp);
|
2023-05-06 05:12:06 -04:00
|
|
|
|
2021-12-27 13:41:51 -05:00
|
|
|
NET_EPOCH_ENTER(et);
|
2015-12-24 14:09:48 -05:00
|
|
|
/*
|
|
|
|
|
* For passively-created TFO connections, don't attempt a window
|
|
|
|
|
* update while still in SYN_RECEIVED as this may trigger an early
|
|
|
|
|
* SYN|ACK. It is preferable to have the SYN|ACK be sent along with
|
|
|
|
|
* application response data, or failing that, when the DELACK timer
|
|
|
|
|
* expires.
|
|
|
|
|
*/
|
2016-10-12 15:06:50 -04:00
|
|
|
if (IS_FASTOPEN(tp->t_flags) &&
|
2015-12-24 14:09:48 -05:00
|
|
|
(tp->t_state == TCPS_SYN_RECEIVED))
|
|
|
|
|
goto out;
|
2012-06-19 03:34:13 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
|
if (tp->t_flags & TF_TOE)
|
|
|
|
|
tcp_offload_rcvd(tp);
|
2013-01-25 17:50:52 -05:00
|
|
|
else
|
2012-06-19 03:34:13 -04:00
|
|
|
#endif
|
2021-12-26 11:48:19 -05:00
|
|
|
outrv = tcp_output_nodrop(tp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_RCVD, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_RCVD);
|
2021-12-26 11:48:19 -05:00
|
|
|
(void) tcp_unlock_or_drop(tp, outrv);
|
|
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Do a send by putting data in output queue and updating urgent
|
1999-06-03 22:27:06 -04:00
|
|
|
* marker if URG set. Possibly send more data. Unlike the other
|
|
|
|
|
* pru_*() routines, the mbuf chains are our responsibility. We
|
|
|
|
|
* must either enqueue them or free them. The other pru_* routines
|
|
|
|
|
* generally are caller-frees.
|
1996-07-11 12:32:50 -04:00
|
|
|
*/
|
|
|
|
|
static int
|
2004-08-16 14:32:07 -04:00
|
|
|
tcp_usr_send(struct socket *so, int flags, struct mbuf *m,
|
2007-03-21 15:37:55 -04:00
|
|
|
struct sockaddr *nam, struct mbuf *control, struct thread *td)
|
1996-07-11 12:32:50 -04:00
|
|
|
{
|
2019-11-06 19:10:14 -05:00
|
|
|
struct epoch_tracker et;
|
1996-07-11 12:32:50 -04:00
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2018-07-30 17:27:26 -04:00
|
|
|
#ifdef INET
|
2018-07-31 02:27:05 -04:00
|
|
|
#ifdef INET6
|
|
|
|
|
struct sockaddr_in sin;
|
|
|
|
|
#endif
|
|
|
|
|
struct sockaddr_in *sinp;
|
2018-07-30 17:27:26 -04:00
|
|
|
#endif
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
2023-02-03 14:33:36 -05:00
|
|
|
struct sockaddr_in6 *sin6;
|
2000-01-09 14:17:30 -05:00
|
|
|
int isipv6;
|
|
|
|
|
#endif
|
2019-10-24 16:05:10 -04:00
|
|
|
u_int8_t incflagsav;
|
|
|
|
|
u_char vflagsav;
|
|
|
|
|
bool restoreflags;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2021-12-28 11:50:02 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_send: inp == NULL"));
|
|
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2021-12-28 11:50:02 -05:00
|
|
|
if (m != NULL && (flags & PRUS_NOTREADY) == 0)
|
|
|
|
|
m_freem(m);
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNRESET);
|
|
|
|
|
}
|
2023-05-06 05:12:06 -04:00
|
|
|
tp = intotcpcb(inp);
|
2021-12-28 11:50:02 -05:00
|
|
|
|
|
|
|
|
vflagsav = inp->inp_vflag;
|
|
|
|
|
incflagsav = inp->inp_inc.inc_flags;
|
|
|
|
|
restoreflags = false;
|
|
|
|
|
|
|
|
|
|
NET_EPOCH_ENTER(et);
|
2023-05-06 05:12:06 -04:00
|
|
|
if (control != NULL) {
|
|
|
|
|
/* TCP doesn't do control messages (rights, creds, etc) */
|
|
|
|
|
if (control->m_len > 0) {
|
|
|
|
|
m_freem(control);
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
m_freem(control); /* empty control, just free it */
|
|
|
|
|
}
|
|
|
|
|
|
2021-05-21 17:44:40 -04:00
|
|
|
if ((flags & PRUS_OOB) != 0 &&
|
|
|
|
|
(error = tcp_pru_options_support(tp, PRUS_OOB)) != 0)
|
|
|
|
|
goto out;
|
|
|
|
|
|
2018-07-30 17:27:26 -04:00
|
|
|
if (nam != NULL && tp->t_state < TCPS_SYN_SENT) {
|
2021-09-07 14:49:53 -04:00
|
|
|
if (tp->t_state == TCPS_LISTEN) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2018-07-30 17:27:26 -04:00
|
|
|
switch (nam->sa_family) {
|
|
|
|
|
#ifdef INET
|
|
|
|
|
case AF_INET:
|
|
|
|
|
sinp = (struct sockaddr_in *)nam;
|
|
|
|
|
if (sinp->sin_len != sizeof(struct sockaddr_in)) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if (IN_MULTICAST(ntohl(sinp->sin_addr.s_addr))) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2020-07-16 12:46:24 -04:00
|
|
|
if (ntohl(sinp->sin_addr.s_addr) == INADDR_BROADCAST) {
|
|
|
|
|
error = EACCES;
|
2020-06-03 10:16:40 -04:00
|
|
|
goto out;
|
|
|
|
|
}
|
2018-07-30 17:27:26 -04:00
|
|
|
if ((error = prison_remote_ip4(td->td_ucred,
|
2021-05-21 17:44:40 -04:00
|
|
|
&sinp->sin_addr)))
|
2018-07-30 17:27:26 -04:00
|
|
|
goto out;
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
isipv6 = 0;
|
|
|
|
|
#endif
|
|
|
|
|
break;
|
|
|
|
|
#endif /* INET */
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
case AF_INET6:
|
2019-08-02 03:41:36 -04:00
|
|
|
sin6 = (struct sockaddr_in6 *)nam;
|
|
|
|
|
if (sin6->sin6_len != sizeof(*sin6)) {
|
2018-07-30 17:27:26 -04:00
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2020-05-15 10:06:37 -04:00
|
|
|
if ((inp->inp_vflag & INP_IPV6PROTO) == 0) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2019-08-02 03:41:36 -04:00
|
|
|
if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr)) {
|
2018-07-30 17:27:26 -04:00
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2019-08-02 03:41:36 -04:00
|
|
|
if (IN6_IS_ADDR_V4MAPPED(&sin6->sin6_addr)) {
|
2018-07-30 17:27:26 -04:00
|
|
|
#ifdef INET
|
|
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2019-10-24 16:05:10 -04:00
|
|
|
restoreflags = true;
|
2018-07-30 17:27:26 -04:00
|
|
|
inp->inp_vflag &= ~INP_IPV6;
|
|
|
|
|
sinp = &sin;
|
2019-08-02 03:41:36 -04:00
|
|
|
in6_sin6_2_sin(sinp, sin6);
|
2018-07-30 17:27:26 -04:00
|
|
|
if (IN_MULTICAST(
|
|
|
|
|
ntohl(sinp->sin_addr.s_addr))) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if ((error = prison_remote_ip4(td->td_ucred,
|
2021-05-21 17:44:40 -04:00
|
|
|
&sinp->sin_addr)))
|
2018-07-30 17:27:26 -04:00
|
|
|
goto out;
|
|
|
|
|
isipv6 = 0;
|
|
|
|
|
#else /* !INET */
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
#endif /* INET */
|
|
|
|
|
} else {
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0) {
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2019-10-24 16:05:10 -04:00
|
|
|
restoreflags = true;
|
2018-07-30 17:27:26 -04:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
|
|
|
|
inp->inp_inc.inc_flags |= INC_ISIPV6;
|
|
|
|
|
if ((error = prison_remote_ip6(td->td_ucred,
|
2021-05-21 17:44:40 -04:00
|
|
|
&sin6->sin6_addr)))
|
2018-07-30 17:27:26 -04:00
|
|
|
goto out;
|
|
|
|
|
isipv6 = 1;
|
|
|
|
|
}
|
|
|
|
|
break;
|
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
default:
|
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
}
|
2002-06-10 16:05:46 -04:00
|
|
|
if (!(flags & PRUS_OOB)) {
|
2022-09-27 13:38:20 -04:00
|
|
|
if (tp->t_acktime == 0)
|
|
|
|
|
tp->t_acktime = ticks;
|
2014-11-30 08:24:21 -05:00
|
|
|
sbappendstream(&so->so_snd, m, flags);
|
2021-05-21 17:44:40 -04:00
|
|
|
m = NULL;
|
1996-07-11 12:32:50 -04:00
|
|
|
if (nam && tp->t_state < TCPS_SYN_SENT) {
|
2021-09-07 14:49:53 -04:00
|
|
|
KASSERT(tp->t_state == TCPS_CLOSED,
|
|
|
|
|
("%s: tp %p is listening", __func__, tp));
|
|
|
|
|
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* Do implied connect if not yet connected,
|
|
|
|
|
* initialize window to default value, and
|
2016-01-06 19:14:42 -05:00
|
|
|
* initialize maxseg using peer's cached MSS.
|
1996-07-11 12:32:50 -04:00
|
|
|
*/
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
if (isipv6)
|
2023-02-03 14:33:36 -05:00
|
|
|
error = tcp6_connect(tp, sin6, td);
|
2000-01-09 14:17:30 -05:00
|
|
|
#endif /* INET6 */
|
2011-04-30 07:21:29 -04:00
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
|
else
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
2023-02-03 14:33:36 -05:00
|
|
|
error = tcp_connect(tp, sinp, td);
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif
|
2019-10-24 16:05:10 -04:00
|
|
|
/*
|
|
|
|
|
* The bind operation in tcp_connect succeeded. We
|
|
|
|
|
* no longer want to restore the flags if later
|
|
|
|
|
* operations fail.
|
|
|
|
|
*/
|
|
|
|
|
if (error == 0 || inp->inp_lport != 0)
|
|
|
|
|
restoreflags = false;
|
|
|
|
|
|
2021-05-21 17:44:40 -04:00
|
|
|
if (error) {
|
|
|
|
|
/* m is freed if PRUS_NOTREADY is unset. */
|
|
|
|
|
sbflush(&so->so_snd);
|
1996-07-11 12:32:50 -04:00
|
|
|
goto out;
|
2021-05-21 17:44:40 -04:00
|
|
|
}
|
2018-02-25 21:53:22 -05:00
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
|
|
|
|
tcp_fastopen_connect(tp);
|
2018-02-25 22:03:41 -05:00
|
|
|
else {
|
2018-02-25 21:53:22 -05:00
|
|
|
tp->snd_wnd = TTCP_CLIENT_SND_WND;
|
|
|
|
|
tcp_mss(tp, -1);
|
|
|
|
|
}
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
|
|
|
|
if (flags & PRUS_EOF) {
|
|
|
|
|
/*
|
|
|
|
|
* Close the send side of the connection after
|
|
|
|
|
* the data is sent.
|
|
|
|
|
*/
|
|
|
|
|
socantsendmore(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
tcp_usrclosed(tp);
|
|
|
|
|
}
|
2020-06-08 07:48:07 -04:00
|
|
|
if (TCPS_HAVEESTABLISHED(tp->t_state) &&
|
|
|
|
|
((tp->t_flags2 & TF2_FBYTES_COMPLETE) == 0) &&
|
|
|
|
|
(tp->t_fbyte_out == 0) &&
|
|
|
|
|
(so->so_snd.sb_ccc > 0)) {
|
|
|
|
|
tp->t_fbyte_out = ticks;
|
|
|
|
|
if (tp->t_fbyte_out == 0)
|
|
|
|
|
tp->t_fbyte_out = 1;
|
|
|
|
|
if (tp->t_fbyte_out && tp->t_fbyte_in)
|
|
|
|
|
tp->t_flags2 |= TF2_FBYTES_COMPLETE;
|
|
|
|
|
}
|
2014-11-30 08:43:52 -05:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED) &&
|
|
|
|
|
!(flags & PRUS_NOTREADY)) {
|
1999-01-20 12:32:01 -05:00
|
|
|
if (flags & PRUS_MORETOCOME)
|
|
|
|
|
tp->t_flags |= TF_MORETOCOME;
|
2021-12-26 11:48:19 -05:00
|
|
|
error = tcp_output_nodrop(tp);
|
1999-01-20 12:32:01 -05:00
|
|
|
if (flags & PRUS_MORETOCOME)
|
|
|
|
|
tp->t_flags &= ~TF_MORETOCOME;
|
|
|
|
|
}
|
1996-07-11 12:32:50 -04:00
|
|
|
} else {
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
/*
|
|
|
|
|
* XXXRW: PRUS_EOF not implemented with PRUS_OOB?
|
|
|
|
|
*/
|
2005-03-14 17:15:14 -05:00
|
|
|
SOCKBUF_LOCK(&so->so_snd);
|
1996-07-11 12:32:50 -04:00
|
|
|
if (sbspace(&so->so_snd) < -512) {
|
2005-03-14 17:15:14 -05:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
1996-07-11 12:32:50 -04:00
|
|
|
error = ENOBUFS;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
/*
|
|
|
|
|
* According to RFC961 (Assigned Protocols),
|
|
|
|
|
* the urgent pointer points to the last octet
|
|
|
|
|
* of urgent data. We continue, however,
|
|
|
|
|
* to consider it to indicate the first octet
|
|
|
|
|
* of data past the urgent section.
|
|
|
|
|
* Otherwise, snd_up should be one lower.
|
|
|
|
|
*/
|
2022-09-27 13:38:20 -04:00
|
|
|
if (tp->t_acktime == 0)
|
|
|
|
|
tp->t_acktime = ticks;
|
2014-11-30 08:24:21 -05:00
|
|
|
sbappendstream_locked(&so->so_snd, m, flags);
|
2005-03-14 17:15:14 -05:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
2021-05-21 17:44:40 -04:00
|
|
|
m = NULL;
|
1997-02-21 11:30:31 -05:00
|
|
|
if (nam && tp->t_state < TCPS_SYN_SENT) {
|
|
|
|
|
/*
|
|
|
|
|
* Do implied connect if not yet connected,
|
|
|
|
|
* initialize window to default value, and
|
2016-01-06 19:14:42 -05:00
|
|
|
* initialize maxseg using peer's cached MSS.
|
1997-02-21 11:30:31 -05:00
|
|
|
*/
|
2018-02-25 22:03:41 -05:00
|
|
|
|
2018-02-25 21:53:22 -05:00
|
|
|
/*
|
|
|
|
|
* Not going to contemplate SYN|URG
|
|
|
|
|
*/
|
|
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
|
|
|
|
tp->t_flags &= ~TF_FASTOPEN;
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
if (isipv6)
|
2023-02-03 14:33:36 -05:00
|
|
|
error = tcp6_connect(tp, sin6, td);
|
2000-01-09 14:17:30 -05:00
|
|
|
#endif /* INET6 */
|
2011-04-30 07:21:29 -04:00
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
|
else
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
2023-02-03 14:33:36 -05:00
|
|
|
error = tcp_connect(tp, sinp, td);
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif
|
2019-10-24 16:05:10 -04:00
|
|
|
/*
|
|
|
|
|
* The bind operation in tcp_connect succeeded. We
|
|
|
|
|
* no longer want to restore the flags if later
|
|
|
|
|
* operations fail.
|
|
|
|
|
*/
|
|
|
|
|
if (error == 0 || inp->inp_lport != 0)
|
|
|
|
|
restoreflags = false;
|
|
|
|
|
|
2021-05-21 17:44:40 -04:00
|
|
|
if (error != 0) {
|
|
|
|
|
/* m is freed if PRUS_NOTREADY is unset. */
|
|
|
|
|
sbflush(&so->so_snd);
|
1997-02-21 11:30:31 -05:00
|
|
|
goto out;
|
2021-05-21 17:44:40 -04:00
|
|
|
}
|
1997-02-21 11:30:31 -05:00
|
|
|
tp->snd_wnd = TTCP_CLIENT_SND_WND;
|
|
|
|
|
tcp_mss(tp, -1);
|
|
|
|
|
}
|
2014-11-30 07:11:01 -05:00
|
|
|
tp->snd_up = tp->snd_una + sbavail(&so->so_snd);
|
2021-05-21 17:44:40 -04:00
|
|
|
if ((flags & PRUS_NOTREADY) == 0) {
|
2014-11-30 08:43:52 -05:00
|
|
|
tp->t_flags |= TF_FORCEDATA;
|
2021-12-26 11:48:19 -05:00
|
|
|
error = tcp_output_nodrop(tp);
|
2014-11-30 08:43:52 -05:00
|
|
|
tp->t_flags &= ~TF_FORCEDATA;
|
|
|
|
|
}
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
2018-03-22 05:40:08 -04:00
|
|
|
TCP_LOG_EVENT(tp, NULL,
|
|
|
|
|
&inp->inp_socket->so_rcv,
|
|
|
|
|
&inp->inp_socket->so_snd,
|
|
|
|
|
TCP_LOG_USERSEND, error,
|
|
|
|
|
0, NULL, false);
|
2021-05-21 17:44:40 -04:00
|
|
|
|
2005-05-01 07:11:38 -04:00
|
|
|
out:
|
2021-05-21 17:44:40 -04:00
|
|
|
/*
|
|
|
|
|
* In case of PRUS_NOTREADY, the caller or tcp_usr_ready() is
|
|
|
|
|
* responsible for freeing memory.
|
|
|
|
|
*/
|
|
|
|
|
if (m != NULL && (flags & PRUS_NOTREADY) == 0)
|
|
|
|
|
m_freem(m);
|
|
|
|
|
|
2019-10-24 16:05:10 -04:00
|
|
|
/*
|
|
|
|
|
* If the request was unsuccessful and we changed flags,
|
|
|
|
|
* restore the original flags.
|
|
|
|
|
*/
|
|
|
|
|
if (error != 0 && restoreflags) {
|
|
|
|
|
inp->inp_vflag = vflagsav;
|
|
|
|
|
inp->inp_inc.inc_flags = incflagsav;
|
|
|
|
|
}
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, (flags & PRUS_OOB) ? PRU_SENDOOB :
|
|
|
|
|
((flags & PRUS_EOF) ? PRU_SEND_EOF : PRU_SEND), error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, (flags & PRUS_OOB) ? PRU_SENDOOB :
|
|
|
|
|
((flags & PRUS_EOF) ? PRU_SEND_EOF : PRU_SEND));
|
2021-12-26 11:48:19 -05:00
|
|
|
error = tcp_unlock_or_drop(tp, error);
|
2020-01-22 00:53:16 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
2005-05-01 09:06:05 -04:00
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
|
|
|
|
|
2014-11-30 08:43:52 -05:00
|
|
|
static int
|
|
|
|
|
tcp_usr_ready(struct socket *so, struct mbuf *m, int count)
|
|
|
|
|
{
|
2020-01-22 00:53:16 -05:00
|
|
|
struct epoch_tracker et;
|
2014-11-30 08:43:52 -05:00
|
|
|
struct inpcb *inp;
|
|
|
|
|
struct tcpcb *tp;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2014-11-30 08:43:52 -05:00
|
|
|
INP_WUNLOCK(inp);
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-28 20:48:33 -04:00
|
|
|
mb_free_notready(m, count);
|
2014-11-30 08:43:52 -05:00
|
|
|
return (ECONNRESET);
|
|
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
|
|
|
|
|
SOCKBUF_LOCK(&so->so_snd);
|
|
|
|
|
error = sbready(&so->so_snd, m, count);
|
|
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
2021-12-26 11:48:19 -05:00
|
|
|
if (error) {
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (error);
|
2020-01-22 00:53:16 -05:00
|
|
|
}
|
2021-12-26 11:48:19 -05:00
|
|
|
NET_EPOCH_ENTER(et);
|
|
|
|
|
error = tcp_output_unlock(tp);
|
|
|
|
|
NET_EPOCH_EXIT(et);
|
2014-11-30 08:43:52 -05:00
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
2006-07-21 13:11:15 -04:00
|
|
|
* Abort the TCP. Drop the connection abruptly.
|
1996-07-11 12:32:50 -04:00
|
|
|
*/
|
2006-04-01 10:15:05 -05:00
|
|
|
static void
|
1996-07-11 12:32:50 -04:00
|
|
|
tcp_usr_abort(struct socket *so)
|
|
|
|
|
{
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2018-07-03 22:47:16 -04:00
|
|
|
struct epoch_tracker et;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
2006-04-24 04:20:02 -04:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_abort: inp == NULL"));
|
1996-07-11 12:32:50 -04:00
|
|
|
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_ENTER(et);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2006-04-24 04:20:02 -04:00
|
|
|
KASSERT(inp->inp_socket != NULL,
|
|
|
|
|
("tcp_usr_abort: inp_socket == NULL"));
|
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
/*
|
2006-07-21 13:11:15 -04:00
|
|
|
* If we still have full TCP state, and we're not dropped, drop.
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
*/
|
2022-10-06 22:22:23 -04:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED)) {
|
2006-04-24 04:20:02 -04:00
|
|
|
tp = intotcpcb(inp);
|
2018-04-06 13:20:37 -04:00
|
|
|
tp = tcp_drop(tp, ECONNABORTED);
|
|
|
|
|
if (tp == NULL)
|
|
|
|
|
goto dropped;
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_ABORT, 0);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_ABORT);
|
2006-04-24 04:20:02 -04:00
|
|
|
}
|
2009-03-15 05:58:31 -04:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED)) {
|
2022-07-04 15:40:51 -04:00
|
|
|
soref(so);
|
2009-03-15 05:58:31 -04:00
|
|
|
inp->inp_flags |= INP_SOCKREF;
|
2006-07-21 13:11:15 -04:00
|
|
|
}
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2018-04-06 13:20:37 -04:00
|
|
|
dropped:
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
2006-07-21 13:11:15 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* TCP socket is closed. Start friendly disconnect.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
|
|
|
|
tcp_usr_close(struct socket *so)
|
|
|
|
|
{
|
|
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
2018-07-03 22:47:16 -04:00
|
|
|
struct epoch_tracker et;
|
2006-07-21 13:11:15 -04:00
|
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_close: inp == NULL"));
|
|
|
|
|
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_ENTER(et);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2006-07-21 13:11:15 -04:00
|
|
|
KASSERT(inp->inp_socket != NULL,
|
|
|
|
|
("tcp_usr_close: inp_socket == NULL"));
|
|
|
|
|
|
|
|
|
|
/*
|
2023-02-17 12:13:53 -05:00
|
|
|
* If we are still connected and we're not dropped, initiate
|
2006-07-21 13:11:15 -04:00
|
|
|
* a disconnect.
|
|
|
|
|
*/
|
2022-10-06 22:22:23 -04:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED)) {
|
2006-07-21 13:11:15 -04:00
|
|
|
tp = intotcpcb(inp);
|
2023-02-17 12:13:53 -05:00
|
|
|
if (tp->t_state != TCPS_TIME_WAIT) {
|
|
|
|
|
tp->t_flags |= TF_CLOSED;
|
|
|
|
|
tcp_disconnect(tp);
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_CLOSE, 0);
|
2023-02-17 12:13:53 -05:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_CLOSE);
|
|
|
|
|
}
|
2006-07-21 13:11:15 -04:00
|
|
|
}
|
2009-03-15 05:58:31 -04:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED)) {
|
2022-07-04 15:40:51 -04:00
|
|
|
soref(so);
|
2009-03-15 05:58:31 -04:00
|
|
|
inp->inp_flags |= INP_SOCKREF;
|
2006-07-21 13:11:15 -04:00
|
|
|
}
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
|
|
|
|
|
2020-05-10 13:43:42 -04:00
|
|
|
static int
|
2020-05-04 16:19:57 -04:00
|
|
|
tcp_pru_options_support(struct tcpcb *tp, int flags)
|
|
|
|
|
{
|
|
|
|
|
/*
|
|
|
|
|
* If the specific TCP stack has a pru_options
|
|
|
|
|
* specified then it does not always support
|
|
|
|
|
* all the PRU_XX options and we must ask it.
|
|
|
|
|
* If the function is not specified then all
|
|
|
|
|
* of the PRU_XX options are supported.
|
|
|
|
|
*/
|
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
|
|
if (tp->t_fb->tfb_pru_options) {
|
|
|
|
|
ret = (*tp->t_fb->tfb_pru_options)(tp, flags);
|
|
|
|
|
}
|
|
|
|
|
return (ret);
|
|
|
|
|
}
|
|
|
|
|
|
1996-07-11 12:32:50 -04:00
|
|
|
/*
|
|
|
|
|
* Receive out-of-band data.
|
|
|
|
|
*/
|
|
|
|
|
static int
|
|
|
|
|
tcp_usr_rcvoob(struct socket *so, struct mbuf *m, int flags)
|
|
|
|
|
{
|
|
|
|
|
int error = 0;
|
2002-06-10 16:05:46 -04:00
|
|
|
struct inpcb *inp;
|
2023-05-06 05:12:06 -04:00
|
|
|
struct tcpcb *tp;
|
1996-07-11 12:32:50 -04:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_rcvoob: inp == NULL"));
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2023-05-06 05:12:06 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNRESET);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
2023-05-06 05:12:06 -04:00
|
|
|
|
2020-05-04 16:19:57 -04:00
|
|
|
error = tcp_pru_options_support(tp, PRUS_OOB);
|
|
|
|
|
if (error) {
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
1996-07-11 12:32:50 -04:00
|
|
|
if ((so->so_oobmark == 0 &&
|
2004-06-14 14:16:22 -04:00
|
|
|
(so->so_rcv.sb_state & SBS_RCVATMARK) == 0) ||
|
2002-05-31 07:52:35 -04:00
|
|
|
so->so_options & SO_OOBINLINE ||
|
|
|
|
|
tp->t_oobflags & TCPOOB_HADDATA) {
|
1996-07-11 12:32:50 -04:00
|
|
|
error = EINVAL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
if ((tp->t_oobflags & TCPOOB_HAVEDATA) == 0) {
|
|
|
|
|
error = EWOULDBLOCK;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
|
|
|
|
m->m_len = 1;
|
|
|
|
|
*mtod(m, caddr_t) = tp->t_iobc;
|
|
|
|
|
if ((flags & MSG_PEEK) == 0)
|
|
|
|
|
tp->t_oobflags ^= (TCPOOB_HAVEDATA | TCPOOB_HADDATA);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
|
|
|
|
out:
|
2023-02-21 06:07:35 -05:00
|
|
|
tcp_bblog_pru(tp, PRU_RCVOOB, error);
|
2015-09-13 11:50:55 -04:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_RCVOOB);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
return (error);
|
1996-07-11 12:32:50 -04:00
|
|
|
}
|
|
|
|
|
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
2022-08-17 14:50:32 -04:00
|
|
|
struct protosw tcp_protosw = {
|
|
|
|
|
.pr_type = SOCK_STREAM,
|
|
|
|
|
.pr_protocol = IPPROTO_TCP,
|
|
|
|
|
.pr_flags = PR_CONNREQUIRED | PR_IMPLOPCL | PR_WANTRCVD |
|
|
|
|
|
PR_CAPATTACH,
|
|
|
|
|
.pr_ctloutput = tcp_ctloutput,
|
|
|
|
|
.pr_abort = tcp_usr_abort,
|
|
|
|
|
.pr_accept = tcp_usr_accept,
|
|
|
|
|
.pr_attach = tcp_usr_attach,
|
|
|
|
|
.pr_bind = tcp_usr_bind,
|
|
|
|
|
.pr_connect = tcp_usr_connect,
|
|
|
|
|
.pr_control = in_control,
|
|
|
|
|
.pr_detach = tcp_usr_detach,
|
|
|
|
|
.pr_disconnect = tcp_usr_disconnect,
|
|
|
|
|
.pr_listen = tcp_usr_listen,
|
|
|
|
|
.pr_peeraddr = in_getpeeraddr,
|
|
|
|
|
.pr_rcvd = tcp_usr_rcvd,
|
|
|
|
|
.pr_rcvoob = tcp_usr_rcvoob,
|
|
|
|
|
.pr_send = tcp_usr_send,
|
|
|
|
|
.pr_ready = tcp_usr_ready,
|
|
|
|
|
.pr_shutdown = tcp_usr_shutdown,
|
|
|
|
|
.pr_sockaddr = in_getsockaddr,
|
|
|
|
|
.pr_sosetlabel = in_pcbsosetlabel,
|
|
|
|
|
.pr_close = tcp_usr_close,
|
1996-07-11 12:32:50 -04:00
|
|
|
};
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif /* INET */
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
2022-08-17 14:50:32 -04:00
|
|
|
struct protosw tcp6_protosw = {
|
|
|
|
|
.pr_type = SOCK_STREAM,
|
|
|
|
|
.pr_protocol = IPPROTO_TCP,
|
|
|
|
|
.pr_flags = PR_CONNREQUIRED | PR_IMPLOPCL |PR_WANTRCVD |
|
|
|
|
|
PR_CAPATTACH,
|
|
|
|
|
.pr_ctloutput = tcp_ctloutput,
|
|
|
|
|
.pr_abort = tcp_usr_abort,
|
|
|
|
|
.pr_accept = tcp6_usr_accept,
|
|
|
|
|
.pr_attach = tcp_usr_attach,
|
|
|
|
|
.pr_bind = tcp6_usr_bind,
|
|
|
|
|
.pr_connect = tcp6_usr_connect,
|
|
|
|
|
.pr_control = in6_control,
|
|
|
|
|
.pr_detach = tcp_usr_detach,
|
|
|
|
|
.pr_disconnect = tcp_usr_disconnect,
|
|
|
|
|
.pr_listen = tcp6_usr_listen,
|
|
|
|
|
.pr_peeraddr = in6_mapped_peeraddr,
|
|
|
|
|
.pr_rcvd = tcp_usr_rcvd,
|
|
|
|
|
.pr_rcvoob = tcp_usr_rcvoob,
|
|
|
|
|
.pr_send = tcp_usr_send,
|
|
|
|
|
.pr_ready = tcp_usr_ready,
|
|
|
|
|
.pr_shutdown = tcp_usr_shutdown,
|
|
|
|
|
.pr_sockaddr = in6_mapped_sockaddr,
|
|
|
|
|
.pr_sosetlabel = in_pcbsosetlabel,
|
|
|
|
|
.pr_close = tcp_usr_close,
|
2000-01-09 14:17:30 -05:00
|
|
|
};
|
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
|
2011-04-30 07:21:29 -04:00
|
|
|
#ifdef INET
|
1995-02-09 18:13:27 -05:00
|
|
|
/*
|
|
|
|
|
* Common subroutine to open a TCP connection to remote host specified
|
2023-02-07 12:21:52 -05:00
|
|
|
* by struct sockaddr_in. Call in_pcbconnect() to choose local host address
|
|
|
|
|
* and assign a local port number and install the inpcb into the hash.
|
|
|
|
|
* Initialize connection parameters and enter SYN-SENT state.
|
1995-02-09 18:13:27 -05:00
|
|
|
*/
|
1995-11-14 15:34:56 -05:00
|
|
|
static int
|
2023-02-03 14:33:36 -05:00
|
|
|
tcp_connect(struct tcpcb *tp, struct sockaddr_in *sin, struct thread *td)
|
1995-02-09 18:13:27 -05:00
|
|
|
{
|
2023-02-03 14:33:36 -05:00
|
|
|
struct inpcb *inp = tptoinpcb(tp);
|
2022-11-08 13:24:40 -05:00
|
|
|
struct socket *so = tptosocket(tp);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 04:15:13 -05:00
|
|
|
int error;
|
1995-02-09 18:13:27 -05:00
|
|
|
|
2020-01-22 01:10:41 -05:00
|
|
|
NET_EPOCH_ASSERT();
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
1995-02-09 18:13:27 -05:00
|
|
|
|
2023-02-14 09:27:47 -05:00
|
|
|
if (__predict_false((so->so_state &
|
2023-06-23 09:59:52 -04:00
|
|
|
(SS_ISCONNECTING | SS_ISCONNECTED | SS_ISDISCONNECTING |
|
|
|
|
|
SS_ISDISCONNECTED)) != 0))
|
2023-02-14 09:27:47 -05:00
|
|
|
return (EISCONN);
|
|
|
|
|
|
2023-02-03 14:33:35 -05:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
2023-02-07 12:21:52 -05:00
|
|
|
error = in_pcbconnect(inp, sin, td->td_ucred, true);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2023-02-07 12:21:52 -05:00
|
|
|
if (error != 0)
|
|
|
|
|
return (error);
|
1995-02-09 18:13:27 -05:00
|
|
|
|
2007-02-01 12:39:18 -05:00
|
|
|
/*
|
|
|
|
|
* Compute window scaling to request:
|
|
|
|
|
* Scale to fit into sweet spot. See tcp_syncache.c.
|
|
|
|
|
* XXX: This should move to tcp_output().
|
|
|
|
|
*/
|
1995-02-09 18:13:27 -05:00
|
|
|
while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
|
2007-10-19 04:53:14 -04:00
|
|
|
(TCP_MAXWIN << tp->request_r_scale) < sb_max)
|
1995-02-09 18:13:27 -05:00
|
|
|
tp->request_r_scale++;
|
|
|
|
|
|
|
|
|
|
soisconnecting(so);
|
2009-04-11 18:07:19 -04:00
|
|
|
TCPSTAT_INC(tcps_connattempt);
|
2013-08-25 17:54:41 -04:00
|
|
|
tcp_state_change(tp, TCPS_SYN_SENT);
|
2018-08-19 10:56:10 -04:00
|
|
|
tp->iss = tcp_new_isn(&inp->inp_inc);
|
|
|
|
|
if (tp->t_flags & TF_REQ_TSTMP)
|
|
|
|
|
tp->ts_offset = tcp_new_ts_offset(&inp->inp_inc);
|
1995-02-09 18:13:27 -05:00
|
|
|
tcp_sendseqinit(tp);
|
1995-11-03 17:08:13 -05:00
|
|
|
|
2023-02-03 14:33:36 -05:00
|
|
|
return (0);
|
1995-02-09 18:13:27 -05:00
|
|
|
}
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif /* INET */
|
1995-02-09 18:13:27 -05:00
|
|
|
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
|
|
|
|
static int
|
2023-02-03 14:33:36 -05:00
|
|
|
tcp6_connect(struct tcpcb *tp, struct sockaddr_in6 *sin6, struct thread *td)
|
2000-01-09 14:17:30 -05:00
|
|
|
{
|
2022-11-08 13:24:40 -05:00
|
|
|
struct inpcb *inp = tptoinpcb(tp);
|
2023-02-14 09:27:47 -05:00
|
|
|
struct socket *so = tptosocket(tp);
|
2000-01-09 14:17:30 -05:00
|
|
|
int error;
|
|
|
|
|
|
2023-02-13 16:21:10 -05:00
|
|
|
NET_EPOCH_ASSERT();
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
|
2023-02-14 09:27:47 -05:00
|
|
|
if (__predict_false((so->so_state &
|
|
|
|
|
(SS_ISCONNECTING | SS_ISCONNECTED)) != 0))
|
|
|
|
|
return (EISCONN);
|
|
|
|
|
|
2023-02-03 14:33:35 -05:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
2023-02-03 14:33:36 -05:00
|
|
|
error = in6_pcbconnect(inp, sin6, td->td_ucred, true);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 05:43:55 -04:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2023-02-03 14:33:35 -05:00
|
|
|
if (error != 0)
|
|
|
|
|
return (error);
|
2000-01-09 14:17:30 -05:00
|
|
|
|
|
|
|
|
/* Compute window scaling to request. */
|
|
|
|
|
while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
|
2009-04-07 10:42:40 -04:00
|
|
|
(TCP_MAXWIN << tp->request_r_scale) < sb_max)
|
2000-01-09 14:17:30 -05:00
|
|
|
tp->request_r_scale++;
|
|
|
|
|
|
2023-02-14 09:27:47 -05:00
|
|
|
soisconnecting(so);
|
2009-04-11 18:07:19 -04:00
|
|
|
TCPSTAT_INC(tcps_connattempt);
|
2013-08-25 17:54:41 -04:00
|
|
|
tcp_state_change(tp, TCPS_SYN_SENT);
|
2018-08-19 10:56:10 -04:00
|
|
|
tp->iss = tcp_new_isn(&inp->inp_inc);
|
|
|
|
|
if (tp->t_flags & TF_REQ_TSTMP)
|
|
|
|
|
tp->ts_offset = tcp_new_ts_offset(&inp->inp_inc);
|
2000-01-09 14:17:30 -05:00
|
|
|
tcp_sendseqinit(tp);
|
|
|
|
|
|
2023-02-03 14:33:35 -05:00
|
|
|
return (0);
|
2000-01-09 14:17:30 -05:00
|
|
|
}
|
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
|
2004-11-26 13:58:46 -05:00
|
|
|
/*
|
|
|
|
|
* Export TCP internal state information via a struct tcp_info, based on the
|
|
|
|
|
* Linux 2.6 API. Not ABI compatible as our constants are mapped differently
|
|
|
|
|
* (TCP state machine, etc). We export all information using FreeBSD-native
|
|
|
|
|
* constants -- for example, the numeric values for tcpi_state will differ
|
|
|
|
|
* from Linux.
|
|
|
|
|
*/
|
|
|
|
|
static void
|
2007-03-21 15:37:55 -04:00
|
|
|
tcp_fill_info(struct tcpcb *tp, struct tcp_info *ti)
|
2004-11-26 13:58:46 -05:00
|
|
|
{
|
|
|
|
|
|
2022-11-08 13:24:40 -05:00
|
|
|
INP_WLOCK_ASSERT(tptoinpcb(tp));
|
2004-11-26 13:58:46 -05:00
|
|
|
bzero(ti, sizeof(*ti));
|
|
|
|
|
|
|
|
|
|
ti->tcpi_state = tp->t_state;
|
|
|
|
|
if ((tp->t_flags & TF_REQ_TSTMP) && (tp->t_flags & TF_RCVD_TSTMP))
|
|
|
|
|
ti->tcpi_options |= TCPI_OPT_TIMESTAMPS;
|
2007-05-06 11:56:31 -04:00
|
|
|
if (tp->t_flags & TF_SACK_PERMIT)
|
2004-11-26 13:58:46 -05:00
|
|
|
ti->tcpi_options |= TCPI_OPT_SACK;
|
|
|
|
|
if ((tp->t_flags & TF_REQ_SCALE) && (tp->t_flags & TF_RCVD_SCALE)) {
|
|
|
|
|
ti->tcpi_options |= TCPI_OPT_WSCALE;
|
|
|
|
|
ti->tcpi_snd_wscale = tp->snd_scale;
|
|
|
|
|
ti->tcpi_rcv_wscale = tp->rcv_scale;
|
|
|
|
|
}
|
2023-06-20 17:27:11 -04:00
|
|
|
switch (tp->t_flags2 & (TF2_ECN_PERMIT | TF2_ACE_PERMIT)) {
|
|
|
|
|
case TF2_ECN_PERMIT:
|
|
|
|
|
ti->tcpi_options |= TCPI_OPT_ECN;
|
|
|
|
|
break;
|
|
|
|
|
case TF2_ACE_PERMIT:
|
|
|
|
|
/* FALLTHROUGH */
|
|
|
|
|
case TF2_ECN_PERMIT | TF2_ACE_PERMIT:
|
|
|
|
|
ti->tcpi_options |= TCPI_OPT_ACE;
|
|
|
|
|
break;
|
|
|
|
|
default:
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
|
|
|
|
ti->tcpi_options |= TCPI_OPT_TFO;
|
2007-02-02 13:34:18 -05:00
|
|
|
|
2009-12-22 10:47:40 -05:00
|
|
|
ti->tcpi_rto = tp->t_rxtcur * tick;
|
2016-10-06 12:28:34 -04:00
|
|
|
ti->tcpi_last_data_recv = ((uint32_t)ticks - tp->t_rcvtime) * tick;
|
2007-02-02 13:34:18 -05:00
|
|
|
ti->tcpi_rtt = ((u_int64_t)tp->t_srtt * tick) >> TCP_RTT_SHIFT;
|
|
|
|
|
ti->tcpi_rttvar = ((u_int64_t)tp->t_rttvar * tick) >> TCP_RTTVAR_SHIFT;
|
|
|
|
|
|
2004-11-26 13:58:46 -05:00
|
|
|
ti->tcpi_snd_ssthresh = tp->snd_ssthresh;
|
|
|
|
|
ti->tcpi_snd_cwnd = tp->snd_cwnd;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* FreeBSD-specific extension fields for tcp_info.
|
|
|
|
|
*/
|
2004-11-27 15:20:11 -05:00
|
|
|
ti->tcpi_rcv_space = tp->rcv_wnd;
|
2008-05-05 16:13:31 -04:00
|
|
|
ti->tcpi_rcv_nxt = tp->rcv_nxt;
|
2004-11-26 13:58:46 -05:00
|
|
|
ti->tcpi_snd_wnd = tp->snd_wnd;
|
2010-09-16 17:06:45 -04:00
|
|
|
ti->tcpi_snd_bwnd = 0; /* Unused, kept for compat. */
|
2008-05-05 19:13:27 -04:00
|
|
|
ti->tcpi_snd_nxt = tp->snd_nxt;
|
2009-12-22 10:47:40 -05:00
|
|
|
ti->tcpi_snd_mss = tp->t_maxseg;
|
|
|
|
|
ti->tcpi_rcv_mss = tp->t_maxseg;
|
2010-11-17 13:55:12 -05:00
|
|
|
ti->tcpi_snd_rexmitpack = tp->t_sndrexmitpack;
|
|
|
|
|
ti->tcpi_rcv_ooopack = tp->t_rcvoopack;
|
|
|
|
|
ti->tcpi_snd_zerowin = tp->t_sndzerowin;
|
2018-04-02 21:08:54 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
|
if (tp->t_flags & TF_TOE) {
|
|
|
|
|
ti->tcpi_options |= TCPI_OPT_TOE;
|
|
|
|
|
tcp_offload_tcp_info(tp, ti);
|
|
|
|
|
}
|
|
|
|
|
#endif
|
2022-11-06 05:55:52 -05:00
|
|
|
/*
|
|
|
|
|
* AccECN related counters.
|
|
|
|
|
*/
|
|
|
|
|
if ((tp->t_flags2 & (TF2_ECN_PERMIT | TF2_ACE_PERMIT)) ==
|
|
|
|
|
(TF2_ECN_PERMIT | TF2_ACE_PERMIT))
|
|
|
|
|
/*
|
|
|
|
|
* Internal counter starts at 5 for AccECN
|
|
|
|
|
* but 0 for RFC3168 ECN.
|
|
|
|
|
*/
|
|
|
|
|
ti->tcpi_delivered_ce = tp->t_scep - 5;
|
|
|
|
|
else
|
|
|
|
|
ti->tcpi_delivered_ce = tp->t_scep;
|
|
|
|
|
ti->tcpi_received_ce = tp->t_rcep;
|
2004-11-26 13:58:46 -05:00
|
|
|
}
|
|
|
|
|
|
1998-08-22 23:07:17 -04:00
|
|
|
/*
|
2008-01-18 07:19:50 -05:00
|
|
|
* tcp_ctloutput() must drop the inpcb lock before performing copyin on
|
|
|
|
|
* socket option arguments. When it re-acquires the lock after the copy, it
|
|
|
|
|
* has to revalidate that the connection is still valid for the socket
|
|
|
|
|
* option.
|
1998-08-22 23:07:17 -04:00
|
|
|
*/
|
2016-04-26 19:02:18 -04:00
|
|
|
#define INP_WLOCK_RECHECK_CLEANUP(inp, cleanup) do { \
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK(inp); \
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) { \
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp); \
|
2016-04-26 19:02:18 -04:00
|
|
|
cleanup; \
|
2008-01-18 07:19:50 -05:00
|
|
|
return (ECONNRESET); \
|
|
|
|
|
} \
|
|
|
|
|
tp = intotcpcb(inp); \
|
|
|
|
|
} while(0)
|
2016-04-26 19:02:18 -04:00
|
|
|
#define INP_WLOCK_RECHECK(inp) INP_WLOCK_RECHECK_CLEANUP((inp), /* noop */)
|
2008-01-18 07:19:50 -05:00
|
|
|
|
2022-02-08 12:49:44 -05:00
|
|
|
int
|
2021-10-25 23:38:31 -04:00
|
|
|
tcp_ctloutput_set(struct inpcb *inp, struct sockopt *sopt)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2022-02-08 12:49:44 -05:00
|
|
|
struct socket *so = inp->inp_socket;
|
|
|
|
|
struct tcpcb *tp = intotcpcb(inp);
|
2021-10-25 23:38:31 -04:00
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
|
|
MPASS(sopt->sopt_dir == SOPT_SET);
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
KASSERT((inp->inp_flags & INP_DROPPED) == 0,
|
2022-02-08 12:49:44 -05:00
|
|
|
("inp_flags == %x", inp->inp_flags));
|
|
|
|
|
KASSERT(so != NULL, ("inp_socket == NULL"));
|
1994-05-24 06:09:53 -04:00
|
|
|
|
1998-08-22 23:07:17 -04:00
|
|
|
if (sopt->sopt_level != IPPROTO_TCP) {
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WUNLOCK(inp);
|
2000-01-09 14:17:30 -05:00
|
|
|
#ifdef INET6
|
2021-10-25 23:40:12 -04:00
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO)
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ip6_ctloutput(so, sopt);
|
2021-10-25 23:40:12 -04:00
|
|
|
#endif
|
|
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
|
else
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ip_ctloutput(so, sopt);
|
2021-10-25 23:40:12 -04:00
|
|
|
#endif
|
|
|
|
|
/*
|
|
|
|
|
* When an IP-level socket option affects TCP, pass control
|
|
|
|
|
* down to stack tfb_tcp_ctloutput, otherwise return what
|
|
|
|
|
* IP level returned.
|
|
|
|
|
*/
|
|
|
|
|
switch (sopt->sopt_level) {
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
case IPPROTO_IPV6:
|
|
|
|
|
if ((inp->inp_vflag & INP_IPV6PROTO) == 0)
|
|
|
|
|
return (error);
|
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
|
case IPV6_TCLASS:
|
|
|
|
|
/* Notify tcp stacks that care (e.g. RACK). */
|
|
|
|
|
break;
|
|
|
|
|
case IPV6_USE_MIN_MTU:
|
2021-10-25 23:53:07 -04:00
|
|
|
/* Update t_maxseg accordingly. */
|
|
|
|
|
break;
|
2021-10-25 23:40:12 -04:00
|
|
|
default:
|
|
|
|
|
return (error);
|
2018-08-21 10:12:30 -04:00
|
|
|
}
|
2021-10-25 23:40:12 -04:00
|
|
|
break;
|
2011-04-30 07:21:29 -04:00
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
2021-10-25 23:40:12 -04:00
|
|
|
case IPPROTO_IP:
|
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
|
case IP_TOS:
|
2022-02-03 13:50:56 -05:00
|
|
|
inp->inp_ip_tos &= ~IPTOS_ECN_MASK;
|
|
|
|
|
break;
|
2021-10-25 23:40:12 -04:00
|
|
|
case IP_TTL:
|
|
|
|
|
/* Notify tcp stacks that care (e.g. RACK). */
|
|
|
|
|
break;
|
|
|
|
|
default:
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
break;
|
2008-01-18 07:19:50 -05:00
|
|
|
#endif
|
2021-10-25 23:40:12 -04:00
|
|
|
default:
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNRESET);
|
|
|
|
|
}
|
2021-10-25 23:38:31 -04:00
|
|
|
} else if (sopt->sopt_name == TCP_FUNCTION_BLK) {
|
|
|
|
|
/*
|
|
|
|
|
* Protect the TCP option TCP_FUNCTION_BLK so
|
|
|
|
|
* that a sub-function can *never* overwrite this.
|
|
|
|
|
*/
|
|
|
|
|
struct tcp_function_set fsn;
|
|
|
|
|
struct tcp_function_block *blk;
|
2023-04-01 01:46:38 -04:00
|
|
|
void *ptr = NULL;
|
2021-10-25 23:38:31 -04:00
|
|
|
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WUNLOCK(inp);
|
2021-10-25 23:38:31 -04:00
|
|
|
error = sooptcopyin(sopt, &fsn, sizeof fsn, sizeof fsn);
|
2015-12-15 19:56:45 -05:00
|
|
|
if (error)
|
|
|
|
|
return (error);
|
2021-10-25 23:38:31 -04:00
|
|
|
|
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
|
2015-12-15 19:56:45 -05:00
|
|
|
blk = find_and_ref_tcp_functions(&fsn);
|
|
|
|
|
if (blk == NULL) {
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ENOENT);
|
|
|
|
|
}
|
2016-08-16 11:11:46 -04:00
|
|
|
if (tp->t_fb == blk) {
|
|
|
|
|
/* You already have this */
|
|
|
|
|
refcount_release(&blk->tfb_refcnt);
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
if (tp->t_state != TCPS_CLOSED) {
|
2020-02-12 08:31:36 -05:00
|
|
|
/*
|
2016-08-16 11:11:46 -04:00
|
|
|
* The user has advanced the state
|
|
|
|
|
* past the initial point, we may not
|
2020-02-12 08:31:36 -05:00
|
|
|
* be able to switch.
|
2016-08-16 11:11:46 -04:00
|
|
|
*/
|
|
|
|
|
if (blk->tfb_tcp_handoff_ok != NULL) {
|
2020-02-12 08:31:36 -05:00
|
|
|
/*
|
2016-08-16 11:11:46 -04:00
|
|
|
* Does the stack provide a
|
|
|
|
|
* query mechanism, if so it may
|
|
|
|
|
* still be possible?
|
|
|
|
|
*/
|
|
|
|
|
error = (*blk->tfb_tcp_handoff_ok)(tp);
|
2018-08-24 06:50:19 -04:00
|
|
|
} else
|
|
|
|
|
error = EINVAL;
|
2016-08-16 11:11:46 -04:00
|
|
|
if (error) {
|
2015-12-15 19:56:45 -05:00
|
|
|
refcount_release(&blk->tfb_refcnt);
|
|
|
|
|
INP_WUNLOCK(inp);
|
2016-08-16 11:11:46 -04:00
|
|
|
return(error);
|
2015-12-15 19:56:45 -05:00
|
|
|
}
|
2016-08-16 11:11:46 -04:00
|
|
|
}
|
|
|
|
|
if (blk->tfb_flags & TCP_FUNC_BEING_REMOVED) {
|
|
|
|
|
refcount_release(&blk->tfb_refcnt);
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ENOENT);
|
|
|
|
|
}
|
2020-02-12 08:31:36 -05:00
|
|
|
/*
|
2023-04-01 01:46:38 -04:00
|
|
|
* Ensure the new stack takes ownership with a
|
|
|
|
|
* clean slate on peak rate threshold.
|
2016-08-16 11:11:46 -04:00
|
|
|
*/
|
2020-02-12 08:31:36 -05:00
|
|
|
#ifdef TCPHPTS
|
2018-04-19 09:37:59 -04:00
|
|
|
/* Assure that we are not on any hpts */
|
2023-04-25 15:18:33 -04:00
|
|
|
tcp_hpts_remove(tp);
|
2018-04-19 09:37:59 -04:00
|
|
|
#endif
|
|
|
|
|
if (blk->tfb_tcp_fb_init) {
|
2023-04-01 01:46:38 -04:00
|
|
|
error = (*blk->tfb_tcp_fb_init)(tp, &ptr);
|
2018-04-19 09:37:59 -04:00
|
|
|
if (error) {
|
2023-04-01 01:46:38 -04:00
|
|
|
/*
|
|
|
|
|
* Release the ref count the lookup
|
|
|
|
|
* acquired.
|
|
|
|
|
*/
|
2018-04-19 09:37:59 -04:00
|
|
|
refcount_release(&blk->tfb_refcnt);
|
2023-04-01 01:46:38 -04:00
|
|
|
/*
|
|
|
|
|
* Now there is a chance that the
|
|
|
|
|
* init() function mucked with some
|
|
|
|
|
* things before it failed, such as
|
|
|
|
|
* hpts or inp_flags2 or timer granularity.
|
|
|
|
|
* It should not of, but lets give the old
|
|
|
|
|
* stack a chance to reset to a known good state.
|
|
|
|
|
*/
|
|
|
|
|
if (tp->t_fb->tfb_switch_failed) {
|
|
|
|
|
(*tp->t_fb->tfb_switch_failed)(tp);
|
2018-04-19 09:37:59 -04:00
|
|
|
}
|
2023-04-01 01:46:38 -04:00
|
|
|
goto err_out;
|
2018-04-19 09:37:59 -04:00
|
|
|
}
|
|
|
|
|
}
|
2023-04-01 01:46:38 -04:00
|
|
|
if (tp->t_fb->tfb_tcp_fb_fini) {
|
|
|
|
|
struct epoch_tracker et;
|
|
|
|
|
/*
|
|
|
|
|
* Tell the stack to cleanup with 0 i.e.
|
|
|
|
|
* the tcb is not going away.
|
|
|
|
|
*/
|
|
|
|
|
NET_EPOCH_ENTER(et);
|
|
|
|
|
(*tp->t_fb->tfb_tcp_fb_fini)(tp, 0);
|
|
|
|
|
NET_EPOCH_EXIT(et);
|
|
|
|
|
}
|
|
|
|
|
/*
|
|
|
|
|
* Release the old refcnt, the
|
|
|
|
|
* lookup acquired a ref on the
|
|
|
|
|
* new one already.
|
|
|
|
|
*/
|
2016-08-16 11:11:46 -04:00
|
|
|
refcount_release(&tp->t_fb->tfb_refcnt);
|
2023-04-01 01:46:38 -04:00
|
|
|
/*
|
|
|
|
|
* Set in the new stack.
|
|
|
|
|
*/
|
2016-08-16 11:11:46 -04:00
|
|
|
tp->t_fb = blk;
|
2023-04-01 01:46:38 -04:00
|
|
|
tp->t_fb_ptr = ptr;
|
2015-12-15 19:56:45 -05:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
|
if (tp->t_flags & TF_TOE) {
|
|
|
|
|
tcp_offload_ctloutput(tp, sopt->sopt_dir,
|
|
|
|
|
sopt->sopt_name);
|
|
|
|
|
}
|
|
|
|
|
#endif
|
2018-04-19 09:37:59 -04:00
|
|
|
err_out:
|
2015-12-15 19:56:45 -05:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (error);
|
2023-04-01 01:46:38 -04:00
|
|
|
|
2021-10-25 23:38:31 -04:00
|
|
|
}
|
|
|
|
|
|
2022-02-02 03:20:43 -05:00
|
|
|
/* Pass in the INP locked, callee must unlock it. */
|
2023-04-07 15:18:10 -04:00
|
|
|
return (tp->t_fb->tfb_tcp_ctloutput(tp, sopt));
|
2021-10-25 23:38:31 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static int
|
|
|
|
|
tcp_ctloutput_get(struct inpcb *inp, struct sockopt *sopt)
|
|
|
|
|
{
|
2022-02-08 12:49:44 -05:00
|
|
|
struct socket *so = inp->inp_socket;
|
|
|
|
|
struct tcpcb *tp = intotcpcb(inp);
|
|
|
|
|
int error = 0;
|
2021-10-25 23:38:31 -04:00
|
|
|
|
|
|
|
|
MPASS(sopt->sopt_dir == SOPT_GET);
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
KASSERT((inp->inp_flags & INP_DROPPED) == 0,
|
2022-02-08 12:49:44 -05:00
|
|
|
("inp_flags == %x", inp->inp_flags));
|
|
|
|
|
KASSERT(so != NULL, ("inp_socket == NULL"));
|
2021-10-25 23:38:31 -04:00
|
|
|
|
|
|
|
|
if (sopt->sopt_level != IPPROTO_TCP) {
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WUNLOCK(inp);
|
2021-10-25 23:38:31 -04:00
|
|
|
#ifdef INET6
|
|
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO)
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ip6_ctloutput(so, sopt);
|
2021-10-25 23:38:31 -04:00
|
|
|
#endif /* INET6 */
|
|
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
|
else
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ip_ctloutput(so, sopt);
|
2021-10-25 23:38:31 -04:00
|
|
|
#endif
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
if (((sopt->sopt_name == TCP_FUNCTION_BLK) ||
|
2021-10-25 23:08:54 -04:00
|
|
|
(sopt->sopt_name == TCP_FUNCTION_ALIAS))) {
|
2021-10-25 23:38:31 -04:00
|
|
|
struct tcp_function_set fsn;
|
|
|
|
|
|
2021-10-25 23:08:54 -04:00
|
|
|
if (sopt->sopt_name == TCP_FUNCTION_ALIAS) {
|
|
|
|
|
memset(&fsn, 0, sizeof(fsn));
|
|
|
|
|
find_tcp_function_alias(tp->t_fb, &fsn);
|
|
|
|
|
} else {
|
|
|
|
|
strncpy(fsn.function_set_name,
|
|
|
|
|
tp->t_fb->tfb_tcp_block_name,
|
|
|
|
|
TCP_FUNCTION_NAME_LEN_MAX);
|
|
|
|
|
fsn.function_set_name[TCP_FUNCTION_NAME_LEN_MAX - 1] = '\0';
|
|
|
|
|
}
|
2015-12-15 19:56:45 -05:00
|
|
|
fsn.pcbcnt = tp->t_fb->tfb_refcnt;
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyout(sopt, &fsn, sizeof fsn);
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
2021-10-25 23:38:31 -04:00
|
|
|
|
2022-02-02 03:20:43 -05:00
|
|
|
/* Pass in the INP locked, callee must unlock it. */
|
2023-04-07 15:18:10 -04:00
|
|
|
return (tp->t_fb->tfb_tcp_ctloutput(tp, sopt));
|
2021-10-25 23:38:31 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
int
|
|
|
|
|
tcp_ctloutput(struct socket *so, struct sockopt *sopt)
|
|
|
|
|
{
|
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
|
KASSERT(inp != NULL, ("tcp_ctloutput: inp == NULL"));
|
|
|
|
|
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2022-02-02 03:20:43 -05:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (ECONNRESET);
|
|
|
|
|
}
|
2021-10-25 23:38:31 -04:00
|
|
|
if (sopt->sopt_dir == SOPT_SET)
|
|
|
|
|
return (tcp_ctloutput_set(inp, sopt));
|
|
|
|
|
else if (sopt->sopt_dir == SOPT_GET)
|
|
|
|
|
return (tcp_ctloutput_get(inp, sopt));
|
|
|
|
|
else
|
|
|
|
|
panic("%s: sopt_dir $%d", __func__, sopt->sopt_dir);
|
2015-12-15 19:56:45 -05:00
|
|
|
}
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2018-03-22 05:40:08 -04:00
|
|
|
/*
|
|
|
|
|
* If this assert becomes untrue, we need to change the size of the buf
|
|
|
|
|
* variable in tcp_default_ctloutput().
|
|
|
|
|
*/
|
|
|
|
|
#ifdef CTASSERT
|
|
|
|
|
CTASSERT(TCP_CA_NAME_MAX <= TCP_LOG_ID_LEN);
|
|
|
|
|
CTASSERT(TCP_LOG_REASON_LEN <= TCP_LOG_ID_LEN);
|
|
|
|
|
#endif
|
|
|
|
|
|
2020-04-27 18:31:42 -04:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
|
static int
|
|
|
|
|
copyin_tls_enable(struct sockopt *sopt, struct tls_enable *tls)
|
|
|
|
|
{
|
|
|
|
|
struct tls_enable_v0 tls_v0;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
if (sopt->sopt_valsize == sizeof(tls_v0)) {
|
|
|
|
|
error = sooptcopyin(sopt, &tls_v0, sizeof(tls_v0),
|
|
|
|
|
sizeof(tls_v0));
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
memset(tls, 0, sizeof(*tls));
|
|
|
|
|
tls->cipher_key = tls_v0.cipher_key;
|
|
|
|
|
tls->iv = tls_v0.iv;
|
|
|
|
|
tls->auth_key = tls_v0.auth_key;
|
|
|
|
|
tls->cipher_algorithm = tls_v0.cipher_algorithm;
|
|
|
|
|
tls->cipher_key_len = tls_v0.cipher_key_len;
|
|
|
|
|
tls->iv_len = tls_v0.iv_len;
|
|
|
|
|
tls->auth_algorithm = tls_v0.auth_algorithm;
|
|
|
|
|
tls->auth_key_len = tls_v0.auth_key_len;
|
|
|
|
|
tls->flags = tls_v0.flags;
|
|
|
|
|
tls->tls_vmajor = tls_v0.tls_vmajor;
|
|
|
|
|
tls->tls_vminor = tls_v0.tls_vminor;
|
|
|
|
|
return (0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return (sooptcopyin(sopt, tls, sizeof(*tls), sizeof(*tls)));
|
|
|
|
|
}
|
|
|
|
|
#endif
|
|
|
|
|
|
2021-11-11 06:28:18 -05:00
|
|
|
extern struct cc_algo newreno_cc_algo;
|
|
|
|
|
|
|
|
|
|
static int
|
2022-02-21 06:30:17 -05:00
|
|
|
tcp_set_cc_mod(struct inpcb *inp, struct sockopt *sopt)
|
2021-11-11 06:28:18 -05:00
|
|
|
{
|
|
|
|
|
struct cc_algo *algo;
|
|
|
|
|
void *ptr = NULL;
|
2022-02-02 03:20:43 -05:00
|
|
|
struct tcpcb *tp;
|
2021-11-11 06:28:18 -05:00
|
|
|
struct cc_var cc_mem;
|
|
|
|
|
char buf[TCP_CA_NAME_MAX];
|
|
|
|
|
size_t mem_sz;
|
|
|
|
|
int error;
|
|
|
|
|
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, buf, TCP_CA_NAME_MAX - 1, 1);
|
|
|
|
|
if (error)
|
|
|
|
|
return(error);
|
|
|
|
|
buf[sopt->sopt_valsize] = '\0';
|
|
|
|
|
CC_LIST_RLOCK();
|
2022-02-21 06:30:17 -05:00
|
|
|
STAILQ_FOREACH(algo, &cc_list, entries) {
|
2021-11-11 06:28:18 -05:00
|
|
|
if (strncmp(buf, algo->name,
|
|
|
|
|
TCP_CA_NAME_MAX) == 0) {
|
|
|
|
|
if (algo->flags & CC_MODULE_BEING_REMOVED) {
|
|
|
|
|
/* We can't "see" modules being unloaded */
|
|
|
|
|
continue;
|
|
|
|
|
}
|
|
|
|
|
break;
|
|
|
|
|
}
|
2022-02-21 06:30:17 -05:00
|
|
|
}
|
2021-11-11 06:28:18 -05:00
|
|
|
if (algo == NULL) {
|
|
|
|
|
CC_LIST_RUNLOCK();
|
|
|
|
|
return(ESRCH);
|
|
|
|
|
}
|
2022-02-21 06:30:17 -05:00
|
|
|
/*
|
|
|
|
|
* With a reference the algorithm cannot be removed
|
|
|
|
|
* so we hold a reference through the change process.
|
|
|
|
|
*/
|
|
|
|
|
cc_refer(algo);
|
|
|
|
|
CC_LIST_RUNLOCK();
|
2021-11-11 06:28:18 -05:00
|
|
|
if (algo->cb_init != NULL) {
|
|
|
|
|
/* We can now pre-get the memory for the CC */
|
|
|
|
|
mem_sz = (*algo->cc_data_sz)();
|
|
|
|
|
if (mem_sz == 0) {
|
|
|
|
|
goto no_mem_needed;
|
|
|
|
|
}
|
|
|
|
|
ptr = malloc(mem_sz, M_CC_MEM, M_WAITOK);
|
|
|
|
|
} else {
|
|
|
|
|
no_mem_needed:
|
|
|
|
|
mem_sz = 0;
|
|
|
|
|
ptr = NULL;
|
|
|
|
|
}
|
|
|
|
|
/*
|
|
|
|
|
* Make sure its all clean and zero and also get
|
|
|
|
|
* back the inplock.
|
|
|
|
|
*/
|
|
|
|
|
memset(&cc_mem, 0, sizeof(cc_mem));
|
2021-11-12 16:08:18 -05:00
|
|
|
INP_WLOCK(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2021-11-12 16:08:18 -05:00
|
|
|
INP_WUNLOCK(inp);
|
2022-02-21 06:30:17 -05:00
|
|
|
if (ptr)
|
|
|
|
|
free(ptr, M_CC_MEM);
|
|
|
|
|
/* Release our temp reference */
|
|
|
|
|
CC_LIST_RLOCK();
|
|
|
|
|
cc_release(algo);
|
2021-11-12 16:08:18 -05:00
|
|
|
CC_LIST_RUNLOCK();
|
|
|
|
|
return (ECONNRESET);
|
|
|
|
|
}
|
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
if (ptr != NULL)
|
2021-11-11 06:28:18 -05:00
|
|
|
memset(ptr, 0, mem_sz);
|
|
|
|
|
cc_mem.ccvc.tcp = tp;
|
|
|
|
|
/*
|
|
|
|
|
* We once again hold a write lock over the tcb so it's
|
|
|
|
|
* safe to do these things without ordering concerns.
|
|
|
|
|
* Note here we init into stack memory.
|
|
|
|
|
*/
|
|
|
|
|
if (algo->cb_init != NULL)
|
|
|
|
|
error = algo->cb_init(&cc_mem, ptr);
|
|
|
|
|
else
|
|
|
|
|
error = 0;
|
|
|
|
|
/*
|
|
|
|
|
* The CC algorithms, when given their memory
|
|
|
|
|
* should not fail we could in theory have a
|
|
|
|
|
* KASSERT here.
|
|
|
|
|
*/
|
|
|
|
|
if (error == 0) {
|
|
|
|
|
/*
|
|
|
|
|
* Touchdown, lets go ahead and move the
|
|
|
|
|
* connection to the new CC module by
|
|
|
|
|
* copying in the cc_mem after we call
|
|
|
|
|
* the old ones cleanup (if any).
|
|
|
|
|
*/
|
|
|
|
|
if (CC_ALGO(tp)->cb_destroy != NULL)
|
tcp: embed inpcb into tcpcb
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
2022-12-07 12:00:48 -05:00
|
|
|
CC_ALGO(tp)->cb_destroy(&tp->t_ccv);
|
2022-02-21 06:30:17 -05:00
|
|
|
/* Detach the old CC from the tcpcb */
|
|
|
|
|
cc_detach(tp);
|
|
|
|
|
/* Copy in our temp memory that was inited */
|
tcp: embed inpcb into tcpcb
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
2022-12-07 12:00:48 -05:00
|
|
|
memcpy(&tp->t_ccv, &cc_mem, sizeof(struct cc_var));
|
2022-02-21 06:30:17 -05:00
|
|
|
/* Now attach the new, which takes a reference */
|
|
|
|
|
cc_attach(tp, algo);
|
2021-11-11 06:28:18 -05:00
|
|
|
/* Ok now are we where we have gotten past any conn_init? */
|
|
|
|
|
if (TCPS_HAVEESTABLISHED(tp->t_state) && (CC_ALGO(tp)->conn_init != NULL)) {
|
|
|
|
|
/* Yep run the connection init for the new CC */
|
tcp: embed inpcb into tcpcb
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
2022-12-07 12:00:48 -05:00
|
|
|
CC_ALGO(tp)->conn_init(&tp->t_ccv);
|
2021-11-11 06:28:18 -05:00
|
|
|
}
|
|
|
|
|
} else if (ptr)
|
|
|
|
|
free(ptr, M_CC_MEM);
|
|
|
|
|
INP_WUNLOCK(inp);
|
2022-02-21 06:30:17 -05:00
|
|
|
/* Now lets release our temp reference */
|
|
|
|
|
CC_LIST_RLOCK();
|
|
|
|
|
cc_release(algo);
|
|
|
|
|
CC_LIST_RUNLOCK();
|
2021-11-11 06:28:18 -05:00
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
2015-12-15 19:56:45 -05:00
|
|
|
int
|
2023-04-07 15:18:10 -04:00
|
|
|
tcp_default_ctloutput(struct tcpcb *tp, struct sockopt *sopt)
|
2015-12-15 19:56:45 -05:00
|
|
|
{
|
2023-04-07 15:18:10 -04:00
|
|
|
struct inpcb *inp = tptoinpcb(tp);
|
2015-12-15 19:56:45 -05:00
|
|
|
int error, opt, optval;
|
|
|
|
|
u_int ui;
|
|
|
|
|
struct tcp_info ti;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
|
struct tls_enable tls;
|
2022-02-09 06:16:43 -05:00
|
|
|
struct socket *so = inp->inp_socket;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
#endif
|
2018-03-22 05:40:08 -04:00
|
|
|
char *pbuf, buf[TCP_LOG_ID_LEN];
|
2019-12-02 15:58:04 -05:00
|
|
|
#ifdef STATS
|
|
|
|
|
struct statsblob *sbp;
|
|
|
|
|
#endif
|
2016-01-27 02:34:00 -05:00
|
|
|
size_t len;
|
2016-01-21 21:07:48 -05:00
|
|
|
|
2021-10-25 23:53:07 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2022-10-06 22:22:23 -04:00
|
|
|
KASSERT((inp->inp_flags & INP_DROPPED) == 0,
|
2022-02-08 12:49:44 -05:00
|
|
|
("inp_flags == %x", inp->inp_flags));
|
2022-02-09 06:16:43 -05:00
|
|
|
KASSERT(inp->inp_socket != NULL, ("inp_socket == NULL"));
|
2021-10-25 23:53:07 -04:00
|
|
|
|
|
|
|
|
switch (sopt->sopt_level) {
|
|
|
|
|
#ifdef INET6
|
|
|
|
|
case IPPROTO_IPV6:
|
|
|
|
|
MPASS(inp->inp_vflag & INP_IPV6PROTO);
|
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
|
case IPV6_USE_MIN_MTU:
|
|
|
|
|
tcp6_use_min_mtu(tp);
|
|
|
|
|
/* FALLTHROUGH */
|
|
|
|
|
}
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (0);
|
|
|
|
|
#endif
|
|
|
|
|
#ifdef INET
|
|
|
|
|
case IPPROTO_IP:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
return (0);
|
|
|
|
|
#endif
|
|
|
|
|
}
|
|
|
|
|
|
2016-01-21 21:07:48 -05:00
|
|
|
/*
|
|
|
|
|
* For TCP_CCALGOOPT forward the control to CC module, for both
|
|
|
|
|
* SOPT_SET and SOPT_GET.
|
|
|
|
|
*/
|
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
|
case TCP_CCALGOOPT:
|
|
|
|
|
INP_WUNLOCK(inp);
|
2018-11-30 05:50:07 -05:00
|
|
|
if (sopt->sopt_valsize > CC_ALGOOPT_LIMIT)
|
|
|
|
|
return (EINVAL);
|
2016-01-27 02:34:00 -05:00
|
|
|
pbuf = malloc(sopt->sopt_valsize, M_TEMP, M_WAITOK | M_ZERO);
|
|
|
|
|
error = sooptcopyin(sopt, pbuf, sopt->sopt_valsize,
|
2016-01-21 21:07:48 -05:00
|
|
|
sopt->sopt_valsize);
|
|
|
|
|
if (error) {
|
2016-01-27 02:34:00 -05:00
|
|
|
free(pbuf, M_TEMP);
|
2016-01-21 21:07:48 -05:00
|
|
|
return (error);
|
|
|
|
|
}
|
2016-04-26 19:02:18 -04:00
|
|
|
INP_WLOCK_RECHECK_CLEANUP(inp, free(pbuf, M_TEMP));
|
2016-01-21 21:07:48 -05:00
|
|
|
if (CC_ALGO(tp)->ctl_output != NULL)
|
tcp: embed inpcb into tcpcb
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
2022-12-07 12:00:48 -05:00
|
|
|
error = CC_ALGO(tp)->ctl_output(&tp->t_ccv, sopt, pbuf);
|
2016-01-21 21:07:48 -05:00
|
|
|
else
|
|
|
|
|
error = ENOENT;
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
if (error == 0 && sopt->sopt_dir == SOPT_GET)
|
2016-01-27 02:34:00 -05:00
|
|
|
error = sooptcopyout(sopt, pbuf, sopt->sopt_valsize);
|
|
|
|
|
free(pbuf, M_TEMP);
|
2016-01-21 21:07:48 -05:00
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
|
1998-08-22 23:07:17 -04:00
|
|
|
switch (sopt->sopt_dir) {
|
|
|
|
|
case SOPT_SET:
|
|
|
|
|
switch (sopt->sopt_name) {
|
2017-02-06 03:49:57 -05:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
2004-02-16 17:21:16 -05:00
|
|
|
case TCP_MD5SIG:
|
2022-06-23 10:50:47 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
if (!TCPMD5_ENABLED())
|
2017-02-06 03:49:57 -05:00
|
|
|
return (ENOPROTOOPT);
|
|
|
|
|
error = TCPMD5_PCBCTL(inp, sopt);
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-10 23:26:04 -05:00
|
|
|
if (error)
|
2008-01-18 07:19:50 -05:00
|
|
|
return (error);
|
2022-06-23 10:50:47 -04:00
|
|
|
INP_WLOCK_RECHECK(inp);
|
2012-06-19 03:34:13 -04:00
|
|
|
goto unlock_and_done;
|
2017-02-06 03:49:57 -05:00
|
|
|
#endif /* IPSEC */
|
2012-06-19 03:34:13 -04:00
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
case TCP_NODELAY:
|
1998-08-22 23:07:17 -04:00
|
|
|
case TCP_NOOPT:
|
2021-05-10 12:47:47 -04:00
|
|
|
case TCP_LRD:
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
1998-08-22 23:07:17 -04:00
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
2008-01-18 07:19:50 -05:00
|
|
|
sizeof optval);
|
1998-08-22 23:07:17 -04:00
|
|
|
if (error)
|
2008-01-18 07:19:50 -05:00
|
|
|
return (error);
|
1998-08-22 23:07:17 -04:00
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_RECHECK(inp);
|
1998-08-22 23:07:17 -04:00
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
|
case TCP_NODELAY:
|
|
|
|
|
opt = TF_NODELAY;
|
|
|
|
|
break;
|
|
|
|
|
case TCP_NOOPT:
|
|
|
|
|
opt = TF_NOOPT;
|
|
|
|
|
break;
|
2021-05-10 12:47:47 -04:00
|
|
|
case TCP_LRD:
|
|
|
|
|
opt = TF_LRD;
|
|
|
|
|
break;
|
1998-08-22 23:07:17 -04:00
|
|
|
default:
|
|
|
|
|
opt = 0; /* dead code to fool gcc */
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (optval)
|
|
|
|
|
tp->t_flags |= opt;
|
1994-05-24 06:09:53 -04:00
|
|
|
else
|
1998-08-22 23:07:17 -04:00
|
|
|
tp->t_flags &= ~opt;
|
2012-06-19 03:34:13 -04:00
|
|
|
unlock_and_done:
|
|
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
|
if (tp->t_flags & TF_TOE) {
|
|
|
|
|
tcp_offload_ctloutput(tp, sopt->sopt_dir,
|
|
|
|
|
sopt->sopt_name);
|
|
|
|
|
}
|
|
|
|
|
#endif
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
1994-05-24 06:09:53 -04:00
|
|
|
break;
|
|
|
|
|
|
2001-02-02 13:48:25 -05:00
|
|
|
case TCP_NOPUSH:
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2001-02-02 13:48:25 -05:00
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
2008-01-18 07:19:50 -05:00
|
|
|
sizeof optval);
|
2001-02-02 13:48:25 -05:00
|
|
|
if (error)
|
2008-01-18 07:19:50 -05:00
|
|
|
return (error);
|
2001-02-02 13:48:25 -05:00
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_RECHECK(inp);
|
2001-02-02 13:48:25 -05:00
|
|
|
if (optval)
|
|
|
|
|
tp->t_flags |= TF_NOPUSH;
|
2011-02-04 09:13:15 -05:00
|
|
|
else if (tp->t_flags & TF_NOPUSH) {
|
2001-02-02 13:48:25 -05:00
|
|
|
tp->t_flags &= ~TF_NOPUSH;
|
2020-01-22 00:53:16 -05:00
|
|
|
if (TCPS_HAVEESTABLISHED(tp->t_state)) {
|
|
|
|
|
struct epoch_tracker et;
|
|
|
|
|
|
|
|
|
|
NET_EPOCH_ENTER(et);
|
2021-12-26 11:48:19 -05:00
|
|
|
error = tcp_output_nodrop(tp);
|
2020-01-22 00:53:16 -05:00
|
|
|
NET_EPOCH_EXIT(et);
|
|
|
|
|
}
|
2001-02-02 13:48:25 -05:00
|
|
|
}
|
2012-06-19 03:34:13 -04:00
|
|
|
goto unlock_and_done;
|
2001-02-02 13:48:25 -05:00
|
|
|
|
2021-04-18 10:08:08 -04:00
|
|
|
case TCP_REMOTE_UDP_ENCAPS_PORT:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
|
sizeof optval);
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
if ((optval < TCP_TUNNELING_PORT_MIN) ||
|
|
|
|
|
(optval > TCP_TUNNELING_PORT_MAX)) {
|
|
|
|
|
/* Its got to be in range */
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
}
|
|
|
|
|
if ((V_tcp_udp_tunneling_port == 0) && (optval != 0)) {
|
|
|
|
|
/* You have to have enabled a UDP tunneling port first */
|
|
|
|
|
return (EINVAL);
|
|
|
|
|
}
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
if (tp->t_state != TCPS_CLOSED) {
|
|
|
|
|
/* You can't change after you are connected */
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
} else {
|
|
|
|
|
/* Ok we are all good set the port */
|
|
|
|
|
tp->t_port = htons(optval);
|
|
|
|
|
}
|
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
case TCP_MAXSEG:
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
1998-08-22 23:07:17 -04:00
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
2008-01-18 07:19:50 -05:00
|
|
|
sizeof optval);
|
1998-08-22 23:07:17 -04:00
|
|
|
if (error)
|
2008-01-18 07:19:50 -05:00
|
|
|
return (error);
|
1994-05-24 06:09:53 -04:00
|
|
|
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_RECHECK(inp);
|
2004-01-08 12:40:07 -05:00
|
|
|
if (optval > 0 && optval <= tp->t_maxseg &&
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 19:27:27 -04:00
|
|
|
optval + 40 >= V_tcp_minmss)
|
1998-08-22 23:07:17 -04:00
|
|
|
tp->t_maxseg = optval;
|
1995-02-09 18:13:27 -05:00
|
|
|
else
|
|
|
|
|
error = EINVAL;
|
2012-06-19 03:34:13 -04:00
|
|
|
goto unlock_and_done;
|
1995-02-09 18:13:27 -05:00
|
|
|
|
2004-11-26 13:58:46 -05:00
|
|
|
case TCP_INFO:
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 13:58:46 -05:00
|
|
|
error = EINVAL;
|
|
|
|
|
break;
|
|
|
|
|
|
2019-12-02 15:58:04 -05:00
|
|
|
case TCP_STATS:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
#ifdef STATS
|
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
|
sizeof optval);
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
if (optval > 0)
|
|
|
|
|
sbp = stats_blob_alloc(
|
|
|
|
|
V_tcp_perconn_stats_dflt_tpl, 0);
|
|
|
|
|
else
|
|
|
|
|
sbp = NULL;
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
if ((tp->t_stats != NULL && sbp == NULL) ||
|
|
|
|
|
(tp->t_stats == NULL && sbp != NULL)) {
|
|
|
|
|
struct statsblob *t = tp->t_stats;
|
|
|
|
|
tp->t_stats = sbp;
|
|
|
|
|
sbp = t;
|
|
|
|
|
}
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
|
|
|
|
|
stats_blob_destroy(sbp);
|
|
|
|
|
#else
|
|
|
|
|
return (EOPNOTSUPP);
|
|
|
|
|
#endif /* !STATS */
|
|
|
|
|
break;
|
|
|
|
|
|
2010-11-12 01:41:55 -05:00
|
|
|
case TCP_CONGESTION:
|
2022-02-21 06:30:17 -05:00
|
|
|
error = tcp_set_cc_mod(inp, sopt);
|
2016-01-21 17:53:12 -05:00
|
|
|
break;
|
2010-11-12 01:41:55 -05:00
|
|
|
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 17:04:46 -05:00
|
|
|
case TCP_REUSPORT_LB_NUMA:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof(optval),
|
|
|
|
|
sizeof(optval));
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
if (!error)
|
|
|
|
|
error = in_pcblbgroup_numa(inp, optval);
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
break;
|
|
|
|
|
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
|
case TCP_TXTLS_ENABLE:
|
|
|
|
|
INP_WUNLOCK(inp);
|
2020-04-27 18:31:42 -04:00
|
|
|
error = copyin_tls_enable(sopt, &tls);
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
if (error)
|
|
|
|
|
break;
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ktls_enable_tx(so, &tls);
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
break;
|
|
|
|
|
case TCP_TXTLS_MODE:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, &ui, sizeof(ui), sizeof(ui));
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ktls_set_tx_mode(so, ui);
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
break;
|
2020-04-27 19:17:19 -04:00
|
|
|
case TCP_RXTLS_ENABLE:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, &tls, sizeof(tls),
|
|
|
|
|
sizeof(tls));
|
|
|
|
|
if (error)
|
|
|
|
|
break;
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ktls_enable_rx(so, &tls);
|
2020-04-27 19:17:19 -04:00
|
|
|
break;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
#endif
|
2022-09-27 13:38:20 -04:00
|
|
|
case TCP_MAXUNACKTIME:
|
2012-02-05 11:53:02 -05:00
|
|
|
case TCP_KEEPIDLE:
|
|
|
|
|
case TCP_KEEPINTVL:
|
|
|
|
|
case TCP_KEEPINIT:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, &ui, sizeof(ui), sizeof(ui));
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
if (ui > (UINT_MAX / hz)) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
ui *= hz;
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
switch (sopt->sopt_name) {
|
2022-09-27 13:38:20 -04:00
|
|
|
case TCP_MAXUNACKTIME:
|
|
|
|
|
tp->t_maxunacktime = ui;
|
|
|
|
|
break;
|
|
|
|
|
|
2012-02-05 11:53:02 -05:00
|
|
|
case TCP_KEEPIDLE:
|
|
|
|
|
tp->t_keepidle = ui;
|
|
|
|
|
/*
|
|
|
|
|
* XXX: better check current remaining
|
|
|
|
|
* timeout and "merge" it with new value.
|
|
|
|
|
*/
|
|
|
|
|
if ((tp->t_state > TCPS_LISTEN) &&
|
|
|
|
|
(tp->t_state <= TCPS_CLOSING))
|
|
|
|
|
tcp_timer_activate(tp, TT_KEEP,
|
|
|
|
|
TP_KEEPIDLE(tp));
|
|
|
|
|
break;
|
|
|
|
|
case TCP_KEEPINTVL:
|
|
|
|
|
tp->t_keepintvl = ui;
|
|
|
|
|
if ((tp->t_state == TCPS_FIN_WAIT_2) &&
|
|
|
|
|
(TP_MAXIDLE(tp) > 0))
|
|
|
|
|
tcp_timer_activate(tp, TT_2MSL,
|
|
|
|
|
TP_MAXIDLE(tp));
|
|
|
|
|
break;
|
|
|
|
|
case TCP_KEEPINIT:
|
|
|
|
|
tp->t_keepinit = ui;
|
|
|
|
|
if (tp->t_state == TCPS_SYN_RECEIVED ||
|
|
|
|
|
tp->t_state == TCPS_SYN_SENT)
|
|
|
|
|
tcp_timer_activate(tp, TT_KEEP,
|
|
|
|
|
TP_KEEPINIT(tp));
|
|
|
|
|
break;
|
|
|
|
|
}
|
2012-06-19 03:34:13 -04:00
|
|
|
goto unlock_and_done;
|
2012-02-05 11:53:02 -05:00
|
|
|
|
2012-09-27 03:13:21 -04:00
|
|
|
case TCP_KEEPCNT:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, &ui, sizeof(ui), sizeof(ui));
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
tp->t_keepcnt = ui;
|
|
|
|
|
if ((tp->t_state == TCPS_FIN_WAIT_2) &&
|
|
|
|
|
(TP_MAXIDLE(tp) > 0))
|
|
|
|
|
tcp_timer_activate(tp, TT_2MSL,
|
|
|
|
|
TP_MAXIDLE(tp));
|
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-13 20:35:37 -04:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
|
case TCP_PCAP_OUT:
|
|
|
|
|
case TCP_PCAP_IN:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
|
sizeof optval);
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
if (optval >= 0)
|
2023-02-28 13:57:30 -05:00
|
|
|
tcp_pcap_set_sock_max(
|
|
|
|
|
(sopt->sopt_name == TCP_PCAP_OUT) ?
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-13 20:35:37 -04:00
|
|
|
&(tp->t_outpkts) : &(tp->t_inpkts),
|
|
|
|
|
optval);
|
|
|
|
|
else
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
#endif
|
|
|
|
|
|
2018-02-25 21:53:22 -05:00
|
|
|
case TCP_FASTOPEN: {
|
|
|
|
|
struct tcp_fastopen tfo_optval;
|
|
|
|
|
|
2015-12-24 14:09:48 -05:00
|
|
|
INP_WUNLOCK(inp);
|
2018-02-25 21:53:22 -05:00
|
|
|
if (!V_tcp_fastopen_client_enable &&
|
|
|
|
|
!V_tcp_fastopen_server_enable)
|
2015-12-24 14:09:48 -05:00
|
|
|
return (EPERM);
|
|
|
|
|
|
2018-02-25 21:53:22 -05:00
|
|
|
error = sooptcopyin(sopt, &tfo_optval,
|
|
|
|
|
sizeof(tfo_optval), sizeof(int));
|
2015-12-24 14:09:48 -05:00
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
2020-06-03 09:51:53 -04:00
|
|
|
if ((tp->t_state != TCPS_CLOSED) &&
|
|
|
|
|
(tp->t_state != TCPS_LISTEN)) {
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
}
|
2018-02-25 21:53:22 -05:00
|
|
|
if (tfo_optval.enable) {
|
|
|
|
|
if (tp->t_state == TCPS_LISTEN) {
|
|
|
|
|
if (!V_tcp_fastopen_server_enable) {
|
|
|
|
|
error = EPERM;
|
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (tp->t_tfo_pending == NULL)
|
|
|
|
|
tp->t_tfo_pending =
|
|
|
|
|
tcp_fastopen_alloc_counter();
|
|
|
|
|
} else {
|
|
|
|
|
/*
|
|
|
|
|
* If a pre-shared key was provided,
|
|
|
|
|
* stash it in the client cookie
|
|
|
|
|
* field of the tcpcb for use during
|
|
|
|
|
* connect.
|
|
|
|
|
*/
|
|
|
|
|
if (sopt->sopt_valsize ==
|
|
|
|
|
sizeof(tfo_optval)) {
|
|
|
|
|
memcpy(tp->t_tfo_cookie.client,
|
|
|
|
|
tfo_optval.psk,
|
|
|
|
|
TCP_FASTOPEN_PSK_LEN);
|
|
|
|
|
tp->t_tfo_client_cookie_len =
|
|
|
|
|
TCP_FASTOPEN_PSK_LEN;
|
|
|
|
|
}
|
|
|
|
|
}
|
2020-06-03 09:51:53 -04:00
|
|
|
tp->t_flags |= TF_FASTOPEN;
|
2015-12-24 14:09:48 -05:00
|
|
|
} else
|
|
|
|
|
tp->t_flags &= ~TF_FASTOPEN;
|
|
|
|
|
goto unlock_and_done;
|
2018-02-25 21:53:22 -05:00
|
|
|
}
|
2015-12-24 14:09:48 -05:00
|
|
|
|
2018-03-24 08:48:10 -04:00
|
|
|
#ifdef TCP_BLACKBOX
|
2018-03-22 05:40:08 -04:00
|
|
|
case TCP_LOG:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
|
sizeof optval);
|
|
|
|
|
if (error)
|
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
error = tcp_log_state_change(tp, optval);
|
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
|
|
|
|
|
case TCP_LOGBUF:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
break;
|
|
|
|
|
|
|
|
|
|
case TCP_LOGID:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyin(sopt, buf, TCP_LOG_ID_LEN - 1, 0);
|
|
|
|
|
if (error)
|
|
|
|
|
break;
|
|
|
|
|
buf[sopt->sopt_valsize] = '\0';
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
error = tcp_log_set_id(tp, buf);
|
|
|
|
|
/* tcp_log_set_id() unlocks the INP. */
|
|
|
|
|
break;
|
|
|
|
|
|
|
|
|
|
case TCP_LOGDUMP:
|
|
|
|
|
case TCP_LOGDUMPID:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error =
|
|
|
|
|
sooptcopyin(sopt, buf, TCP_LOG_REASON_LEN - 1, 0);
|
|
|
|
|
if (error)
|
|
|
|
|
break;
|
|
|
|
|
buf[sopt->sopt_valsize] = '\0';
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
if (sopt->sopt_name == TCP_LOGDUMP) {
|
|
|
|
|
error = tcp_log_dump_tp_logbuf(tp, buf,
|
|
|
|
|
M_WAITOK, true);
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
} else {
|
|
|
|
|
tcp_log_dump_tp_bucket_logbufs(tp, buf);
|
|
|
|
|
/*
|
|
|
|
|
* tcp_log_dump_tp_bucket_logbufs() drops the
|
|
|
|
|
* INP lock.
|
|
|
|
|
*/
|
|
|
|
|
}
|
|
|
|
|
break;
|
2018-03-24 08:48:10 -04:00
|
|
|
#endif
|
2018-03-22 05:40:08 -04:00
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
default:
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
1994-05-24 06:09:53 -04:00
|
|
|
error = ENOPROTOOPT;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
break;
|
|
|
|
|
|
1998-08-22 23:07:17 -04:00
|
|
|
case SOPT_GET:
|
2008-01-18 07:19:50 -05:00
|
|
|
tp = intotcpcb(inp);
|
1998-08-22 23:07:17 -04:00
|
|
|
switch (sopt->sopt_name) {
|
2017-02-06 03:49:57 -05:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
2004-02-16 17:21:16 -05:00
|
|
|
case TCP_MD5SIG:
|
2022-06-23 10:50:47 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
if (!TCPMD5_ENABLED())
|
2017-02-06 03:49:57 -05:00
|
|
|
return (ENOPROTOOPT);
|
|
|
|
|
error = TCPMD5_PCBCTL(inp, sopt);
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-10 23:26:04 -05:00
|
|
|
break;
|
2004-02-13 13:21:45 -05:00
|
|
|
#endif
|
2008-01-18 07:19:50 -05:00
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
case TCP_NODELAY:
|
1998-08-22 23:07:17 -04:00
|
|
|
optval = tp->t_flags & TF_NODELAY;
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 13:58:46 -05:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
1994-05-24 06:09:53 -04:00
|
|
|
break;
|
|
|
|
|
case TCP_MAXSEG:
|
1998-08-22 23:07:17 -04:00
|
|
|
optval = tp->t_maxseg;
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 13:58:46 -05:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
1994-05-24 06:09:53 -04:00
|
|
|
break;
|
2021-04-18 10:08:08 -04:00
|
|
|
case TCP_REMOTE_UDP_ENCAPS_PORT:
|
|
|
|
|
optval = ntohs(tp->t_port);
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
|
break;
|
1995-02-09 18:13:27 -05:00
|
|
|
case TCP_NOOPT:
|
1998-08-22 23:07:17 -04:00
|
|
|
optval = tp->t_flags & TF_NOOPT;
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 13:58:46 -05:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
1995-02-09 18:13:27 -05:00
|
|
|
break;
|
|
|
|
|
case TCP_NOPUSH:
|
1998-08-22 23:07:17 -04:00
|
|
|
optval = tp->t_flags & TF_NOPUSH;
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 13:58:46 -05:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
|
break;
|
|
|
|
|
case TCP_INFO:
|
|
|
|
|
tcp_fill_info(tp, &ti);
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 13:58:46 -05:00
|
|
|
error = sooptcopyout(sopt, &ti, sizeof ti);
|
1995-02-09 18:13:27 -05:00
|
|
|
break;
|
2019-12-02 15:58:04 -05:00
|
|
|
case TCP_STATS:
|
|
|
|
|
{
|
|
|
|
|
#ifdef STATS
|
|
|
|
|
int nheld;
|
|
|
|
|
TYPEOF_MEMBER(struct statsblob, flags) sbflags = 0;
|
|
|
|
|
|
|
|
|
|
error = 0;
|
|
|
|
|
socklen_t outsbsz = sopt->sopt_valsize;
|
|
|
|
|
if (tp->t_stats == NULL)
|
|
|
|
|
error = ENOENT;
|
|
|
|
|
else if (outsbsz >= tp->t_stats->cursz)
|
|
|
|
|
outsbsz = tp->t_stats->cursz;
|
|
|
|
|
else if (outsbsz >= sizeof(struct statsblob))
|
|
|
|
|
outsbsz = sizeof(struct statsblob);
|
|
|
|
|
else
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
if (error)
|
|
|
|
|
break;
|
|
|
|
|
|
|
|
|
|
sbp = sopt->sopt_val;
|
|
|
|
|
nheld = atop(round_page(((vm_offset_t)sbp) +
|
|
|
|
|
(vm_size_t)outsbsz) - trunc_page((vm_offset_t)sbp));
|
|
|
|
|
vm_page_t ma[nheld];
|
|
|
|
|
if (vm_fault_quick_hold_pages(
|
|
|
|
|
&curproc->p_vmspace->vm_map, (vm_offset_t)sbp,
|
|
|
|
|
outsbsz, VM_PROT_READ | VM_PROT_WRITE, ma,
|
|
|
|
|
nheld) < 0) {
|
|
|
|
|
error = EFAULT;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if ((error = copyin_nofault(&(sbp->flags), &sbflags,
|
|
|
|
|
SIZEOF_MEMBER(struct statsblob, flags))))
|
|
|
|
|
goto unhold;
|
|
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
|
error = stats_blob_snapshot(&sbp, outsbsz, tp->t_stats,
|
|
|
|
|
sbflags | SB_CLONE_USRDSTNOFAULT);
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
sopt->sopt_valsize = outsbsz;
|
|
|
|
|
unhold:
|
|
|
|
|
vm_page_unhold_pages(ma, nheld);
|
|
|
|
|
#else
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
|
#endif /* !STATS */
|
|
|
|
|
break;
|
|
|
|
|
}
|
2010-11-12 01:41:55 -05:00
|
|
|
case TCP_CONGESTION:
|
2016-01-27 02:34:00 -05:00
|
|
|
len = strlcpy(buf, CC_ALGO(tp)->name, TCP_CA_NAME_MAX);
|
2010-11-12 01:41:55 -05:00
|
|
|
INP_WUNLOCK(inp);
|
2016-01-27 02:34:00 -05:00
|
|
|
error = sooptcopyout(sopt, buf, len + 1);
|
2010-11-12 01:41:55 -05:00
|
|
|
break;
|
2022-09-27 13:38:20 -04:00
|
|
|
case TCP_MAXUNACKTIME:
|
2013-11-08 08:04:14 -05:00
|
|
|
case TCP_KEEPIDLE:
|
|
|
|
|
case TCP_KEEPINTVL:
|
|
|
|
|
case TCP_KEEPINIT:
|
|
|
|
|
case TCP_KEEPCNT:
|
|
|
|
|
switch (sopt->sopt_name) {
|
2022-09-27 13:38:20 -04:00
|
|
|
case TCP_MAXUNACKTIME:
|
|
|
|
|
ui = TP_MAXUNACKTIME(tp) / hz;
|
|
|
|
|
break;
|
2013-11-08 08:04:14 -05:00
|
|
|
case TCP_KEEPIDLE:
|
2016-09-14 10:48:00 -04:00
|
|
|
ui = TP_KEEPIDLE(tp) / hz;
|
2013-11-08 08:04:14 -05:00
|
|
|
break;
|
|
|
|
|
case TCP_KEEPINTVL:
|
2016-09-14 10:48:00 -04:00
|
|
|
ui = TP_KEEPINTVL(tp) / hz;
|
2013-11-08 08:04:14 -05:00
|
|
|
break;
|
|
|
|
|
case TCP_KEEPINIT:
|
2016-09-14 10:48:00 -04:00
|
|
|
ui = TP_KEEPINIT(tp) / hz;
|
2013-11-08 08:04:14 -05:00
|
|
|
break;
|
|
|
|
|
case TCP_KEEPCNT:
|
2016-09-14 10:48:00 -04:00
|
|
|
ui = TP_KEEPCNT(tp);
|
2013-11-08 08:04:14 -05:00
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyout(sopt, &ui, sizeof(ui));
|
|
|
|
|
break;
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-13 20:35:37 -04:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
|
case TCP_PCAP_OUT:
|
|
|
|
|
case TCP_PCAP_IN:
|
2023-02-28 13:57:30 -05:00
|
|
|
optval = tcp_pcap_get_sock_max(
|
|
|
|
|
(sopt->sopt_name == TCP_PCAP_OUT) ?
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-13 20:35:37 -04:00
|
|
|
&(tp->t_outpkts) : &(tp->t_inpkts));
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
|
break;
|
|
|
|
|
#endif
|
2015-12-24 14:09:48 -05:00
|
|
|
case TCP_FASTOPEN:
|
|
|
|
|
optval = tp->t_flags & TF_FASTOPEN;
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
|
break;
|
2018-03-24 08:48:10 -04:00
|
|
|
#ifdef TCP_BLACKBOX
|
2018-03-22 05:40:08 -04:00
|
|
|
case TCP_LOG:
|
2023-03-16 11:43:16 -04:00
|
|
|
optval = tcp_get_bblog_state(tp);
|
2018-03-22 05:40:08 -04:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
|
break;
|
|
|
|
|
case TCP_LOGBUF:
|
|
|
|
|
/* tcp_log_getlogbuf() does INP_WUNLOCK(inp) */
|
|
|
|
|
error = tcp_log_getlogbuf(sopt, tp);
|
|
|
|
|
break;
|
|
|
|
|
case TCP_LOGID:
|
|
|
|
|
len = tcp_log_get_id(tp, buf);
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyout(sopt, buf, len + 1);
|
|
|
|
|
break;
|
|
|
|
|
case TCP_LOGDUMP:
|
|
|
|
|
case TCP_LOGDUMPID:
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = EINVAL;
|
|
|
|
|
break;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
#endif
|
|
|
|
|
#ifdef KERN_TLS
|
|
|
|
|
case TCP_TXTLS_MODE:
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ktls_get_tx_mode(so, &optval);
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2021-09-17 12:14:29 -04:00
|
|
|
if (error == 0)
|
|
|
|
|
error = sooptcopyout(sopt, &optval,
|
|
|
|
|
sizeof(optval));
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-26 20:01:56 -04:00
|
|
|
break;
|
2020-04-27 19:17:19 -04:00
|
|
|
case TCP_RXTLS_MODE:
|
2022-02-08 12:49:44 -05:00
|
|
|
error = ktls_get_rx_mode(so, &optval);
|
2020-04-27 19:17:19 -04:00
|
|
|
INP_WUNLOCK(inp);
|
2021-09-17 12:14:29 -04:00
|
|
|
if (error == 0)
|
|
|
|
|
error = sooptcopyout(sopt, &optval,
|
|
|
|
|
sizeof(optval));
|
2020-04-27 19:17:19 -04:00
|
|
|
break;
|
2018-03-24 08:48:10 -04:00
|
|
|
#endif
|
2021-05-10 12:47:47 -04:00
|
|
|
case TCP_LRD:
|
|
|
|
|
optval = tp->t_flags & TF_LRD;
|
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
|
break;
|
1994-05-24 06:09:53 -04:00
|
|
|
default:
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WUNLOCK(inp);
|
1994-05-24 06:09:53 -04:00
|
|
|
error = ENOPROTOOPT;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
2008-04-17 17:38:18 -04:00
|
|
|
#undef INP_WLOCK_RECHECK
|
2016-04-26 19:02:18 -04:00
|
|
|
#undef INP_WLOCK_RECHECK_CLEANUP
|
1994-05-24 06:09:53 -04:00
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Initiate (or continue) disconnect.
|
|
|
|
|
* If embryonic state, just send reset (once).
|
|
|
|
|
* If in ``let data drain'' option and linger null, just drop.
|
|
|
|
|
* Otherwise (hard), mark socket disconnecting and drop
|
|
|
|
|
* current input data; switch states based on user close, and
|
|
|
|
|
* send segment to peer (with FIN).
|
|
|
|
|
*/
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
static void
|
2007-03-21 15:37:55 -04:00
|
|
|
tcp_disconnect(struct tcpcb *tp)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
2022-11-08 13:24:40 -05:00
|
|
|
struct inpcb *inp = tptoinpcb(tp);
|
|
|
|
|
struct socket *so = tptosocket(tp);
|
2005-06-01 08:08:15 -04:00
|
|
|
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_ASSERT();
|
2008-04-17 17:38:18 -04:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
1994-05-24 06:09:53 -04:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
/*
|
|
|
|
|
* Neither tcp_close() nor tcp_drop() should return NULL, as the
|
|
|
|
|
* socket is still open.
|
|
|
|
|
*/
|
Fix some TCP fast open issues.
The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.
Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485
2018-07-30 16:35:50 -04:00
|
|
|
if (tp->t_state < TCPS_ESTABLISHED &&
|
|
|
|
|
!(tp->t_state > TCPS_LISTEN && IS_FASTOPEN(tp->t_flags))) {
|
1994-05-24 06:09:53 -04:00
|
|
|
tp = tcp_close(tp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
KASSERT(tp != NULL,
|
|
|
|
|
("tcp_disconnect: tcp_close() returned NULL"));
|
|
|
|
|
} else if ((so->so_options & SO_LINGER) && so->so_linger == 0) {
|
2002-05-31 07:52:35 -04:00
|
|
|
tp = tcp_drop(tp, 0);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
KASSERT(tp != NULL,
|
|
|
|
|
("tcp_disconnect: tcp_drop() returned NULL"));
|
|
|
|
|
} else {
|
2002-05-31 07:52:35 -04:00
|
|
|
soisdisconnecting(so);
|
|
|
|
|
sbflush(&so->so_rcv);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
tcp_usrclosed(tp);
|
2009-03-15 05:58:31 -04:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED))
|
2021-12-26 11:48:19 -05:00
|
|
|
/* Ignore stack's drop request, we already at it. */
|
|
|
|
|
(void)tcp_output_nodrop(tp);
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* User issued close, and wish to trail through shutdown states:
|
|
|
|
|
* if never received SYN, just forget it. If got a SYN from peer,
|
|
|
|
|
* but haven't sent FIN, then go to FIN_WAIT_1 state to send peer a FIN.
|
|
|
|
|
* If already got a FIN from peer, then almost done; go to LAST_ACK
|
|
|
|
|
* state. In all other cases, have already sent FIN to peer (e.g.
|
|
|
|
|
* after PRU_SHUTDOWN), and just have to play tedious game waiting
|
|
|
|
|
* for peer to send FIN or not respond to keep-alives, etc.
|
|
|
|
|
* We can let the user exit from the close as soon as the FIN is acked.
|
|
|
|
|
*/
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
static void
|
2007-03-21 15:37:55 -04:00
|
|
|
tcp_usrclosed(struct tcpcb *tp)
|
1994-05-24 06:09:53 -04:00
|
|
|
{
|
|
|
|
|
|
2019-11-06 19:10:14 -05:00
|
|
|
NET_EPOCH_ASSERT();
|
2022-11-08 13:24:40 -05:00
|
|
|
INP_WLOCK_ASSERT(tptoinpcb(tp));
|
2005-06-01 08:08:15 -04:00
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
switch (tp->t_state) {
|
|
|
|
|
case TCPS_LISTEN:
|
2012-06-19 03:34:13 -04:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
|
tcp_offload_listen_stop(tp);
|
|
|
|
|
#endif
|
2015-09-15 16:04:30 -04:00
|
|
|
tcp_state_change(tp, TCPS_CLOSED);
|
2007-12-18 17:59:07 -05:00
|
|
|
/* FALLTHROUGH */
|
|
|
|
|
case TCPS_CLOSED:
|
1994-05-24 06:09:53 -04:00
|
|
|
tp = tcp_close(tp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 11:36:36 -05:00
|
|
|
/*
|
|
|
|
|
* tcp_close() should never return NULL here as the socket is
|
|
|
|
|
* still open.
|
|
|
|
|
*/
|
|
|
|
|
KASSERT(tp != NULL,
|
|
|
|
|
("tcp_usrclosed: tcp_close() returned NULL"));
|
1994-05-24 06:09:53 -04:00
|
|
|
break;
|
|
|
|
|
|
1995-02-09 18:13:27 -05:00
|
|
|
case TCPS_SYN_SENT:
|
|
|
|
|
case TCPS_SYN_RECEIVED:
|
|
|
|
|
tp->t_flags |= TF_NEEDFIN;
|
|
|
|
|
break;
|
|
|
|
|
|
1994-05-24 06:09:53 -04:00
|
|
|
case TCPS_ESTABLISHED:
|
2013-08-25 17:54:41 -04:00
|
|
|
tcp_state_change(tp, TCPS_FIN_WAIT_1);
|
1994-05-24 06:09:53 -04:00
|
|
|
break;
|
|
|
|
|
|
|
|
|
|
case TCPS_CLOSE_WAIT:
|
2013-08-25 17:54:41 -04:00
|
|
|
tcp_state_change(tp, TCPS_LAST_ACK);
|
1994-05-24 06:09:53 -04:00
|
|
|
break;
|
|
|
|
|
}
|
2022-09-27 13:38:20 -04:00
|
|
|
if (tp->t_acktime == 0)
|
|
|
|
|
tp->t_acktime = ticks;
|
2007-05-31 08:06:02 -04:00
|
|
|
if (tp->t_state >= TCPS_FIN_WAIT_2) {
|
2022-11-08 13:24:40 -05:00
|
|
|
soisdisconnected(tptosocket(tp));
|
2007-05-31 08:06:02 -04:00
|
|
|
/* Prevent the connection hanging in FIN_WAIT_2 forever. */
|
2007-02-26 17:25:21 -05:00
|
|
|
if (tp->t_state == TCPS_FIN_WAIT_2) {
|
|
|
|
|
int timeout;
|
|
|
|
|
|
2020-02-12 08:31:36 -05:00
|
|
|
timeout = (tcp_fast_finwait2_recycle) ?
|
2012-02-05 11:53:02 -05:00
|
|
|
tcp_finwait2_timeout : TP_MAXIDLE(tp);
|
2007-04-11 05:45:16 -04:00
|
|
|
tcp_timer_activate(tp, TT_2MSL, timeout);
|
2007-02-26 17:25:21 -05:00
|
|
|
}
|
1995-10-29 16:30:25 -05:00
|
|
|
}
|
1994-05-24 06:09:53 -04:00
|
|
|
}
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
#ifdef DDB
|
|
|
|
|
static void
|
|
|
|
|
db_print_indent(int indent)
|
|
|
|
|
{
|
|
|
|
|
int i;
|
|
|
|
|
|
|
|
|
|
for (i = 0; i < indent; i++)
|
|
|
|
|
db_printf(" ");
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
|
|
|
|
db_print_tstate(int t_state)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
switch (t_state) {
|
|
|
|
|
case TCPS_CLOSED:
|
|
|
|
|
db_printf("TCPS_CLOSED");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_LISTEN:
|
|
|
|
|
db_printf("TCPS_LISTEN");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_SYN_SENT:
|
|
|
|
|
db_printf("TCPS_SYN_SENT");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_SYN_RECEIVED:
|
|
|
|
|
db_printf("TCPS_SYN_RECEIVED");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_ESTABLISHED:
|
|
|
|
|
db_printf("TCPS_ESTABLISHED");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_CLOSE_WAIT:
|
|
|
|
|
db_printf("TCPS_CLOSE_WAIT");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_FIN_WAIT_1:
|
|
|
|
|
db_printf("TCPS_FIN_WAIT_1");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_CLOSING:
|
|
|
|
|
db_printf("TCPS_CLOSING");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_LAST_ACK:
|
|
|
|
|
db_printf("TCPS_LAST_ACK");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_FIN_WAIT_2:
|
|
|
|
|
db_printf("TCPS_FIN_WAIT_2");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
case TCPS_TIME_WAIT:
|
|
|
|
|
db_printf("TCPS_TIME_WAIT");
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
default:
|
|
|
|
|
db_printf("unknown");
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
|
|
|
|
db_print_tflags(u_int t_flags)
|
|
|
|
|
{
|
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
|
if (t_flags & TF_ACKNOW) {
|
|
|
|
|
db_printf("%sTF_ACKNOW", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_DELACK) {
|
|
|
|
|
db_printf("%sTF_DELACK", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_NODELAY) {
|
|
|
|
|
db_printf("%sTF_NODELAY", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_NOOPT) {
|
|
|
|
|
db_printf("%sTF_NOOPT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_SENTFIN) {
|
|
|
|
|
db_printf("%sTF_SENTFIN", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_REQ_SCALE) {
|
|
|
|
|
db_printf("%sTF_REQ_SCALE", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_RCVD_SCALE) {
|
|
|
|
|
db_printf("%sTF_RECVD_SCALE", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_REQ_TSTMP) {
|
|
|
|
|
db_printf("%sTF_REQ_TSTMP", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_RCVD_TSTMP) {
|
|
|
|
|
db_printf("%sTF_RCVD_TSTMP", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_SACK_PERMIT) {
|
|
|
|
|
db_printf("%sTF_SACK_PERMIT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_NEEDSYN) {
|
|
|
|
|
db_printf("%sTF_NEEDSYN", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_NEEDFIN) {
|
|
|
|
|
db_printf("%sTF_NEEDFIN", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_NOPUSH) {
|
|
|
|
|
db_printf("%sTF_NOPUSH", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2022-02-09 18:19:55 -05:00
|
|
|
if (t_flags & TF_PREVVALID) {
|
|
|
|
|
db_printf("%sTF_PREVVALID", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2007-02-17 16:02:38 -05:00
|
|
|
if (t_flags & TF_MORETOCOME) {
|
|
|
|
|
db_printf("%sTF_MORETOCOME", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2022-09-21 17:02:49 -04:00
|
|
|
if (t_flags & TF_SONOTCONN) {
|
|
|
|
|
db_printf("%sTF_SONOTCONN", comma ? ", " : "");
|
2007-02-17 16:02:38 -05:00
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_LASTIDLE) {
|
|
|
|
|
db_printf("%sTF_LASTIDLE", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_RXWIN0SENT) {
|
|
|
|
|
db_printf("%sTF_RXWIN0SENT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_FASTRECOVERY) {
|
|
|
|
|
db_printf("%sTF_FASTRECOVERY", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2010-11-12 01:41:55 -05:00
|
|
|
if (t_flags & TF_CONGRECOVERY) {
|
|
|
|
|
db_printf("%sTF_CONGRECOVERY", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2007-02-17 16:02:38 -05:00
|
|
|
if (t_flags & TF_WASFRECOVERY) {
|
|
|
|
|
db_printf("%sTF_WASFRECOVERY", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2022-02-09 18:19:55 -05:00
|
|
|
if (t_flags & TF_WASCRECOVERY) {
|
|
|
|
|
db_printf("%sTF_WASCRECOVERY", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2007-02-17 16:02:38 -05:00
|
|
|
if (t_flags & TF_SIGNATURE) {
|
|
|
|
|
db_printf("%sTF_SIGNATURE", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_FORCEDATA) {
|
|
|
|
|
db_printf("%sTF_FORCEDATA", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags & TF_TSO) {
|
|
|
|
|
db_printf("%sTF_TSO", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2015-12-24 14:09:48 -05:00
|
|
|
if (t_flags & TF_FASTOPEN) {
|
|
|
|
|
db_printf("%sTF_FASTOPEN", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
2008-07-31 11:10:09 -04:00
|
|
|
}
|
2007-02-17 16:02:38 -05:00
|
|
|
}
|
|
|
|
|
|
2019-12-01 16:01:33 -05:00
|
|
|
static void
|
|
|
|
|
db_print_tflags2(u_int t_flags2)
|
|
|
|
|
{
|
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
|
|
comma = 0;
|
2022-02-09 18:19:55 -05:00
|
|
|
if (t_flags2 & TF2_PLPMTU_BLACKHOLE) {
|
|
|
|
|
db_printf("%sTF2_PLPMTU_BLACKHOLE", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags2 & TF2_PLPMTU_PMTUD) {
|
|
|
|
|
db_printf("%sTF2_PLPMTU_PMTUD", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags2 & TF2_PLPMTU_MAXSEGSNT) {
|
|
|
|
|
db_printf("%sTF2_PLPMTU_MAXSEGSNT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags2 & TF2_LOG_AUTO) {
|
|
|
|
|
db_printf("%sTF2_LOG_AUTO", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags2 & TF2_DROP_AF_DATA) {
|
|
|
|
|
db_printf("%sTF2_DROP_AF_DATA", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2019-12-01 16:01:33 -05:00
|
|
|
if (t_flags2 & TF2_ECN_PERMIT) {
|
|
|
|
|
db_printf("%sTF2_ECN_PERMIT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2022-02-09 18:19:55 -05:00
|
|
|
if (t_flags2 & TF2_ECN_SND_CWR) {
|
|
|
|
|
db_printf("%sTF2_ECN_SND_CWR", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags2 & TF2_ECN_SND_ECE) {
|
|
|
|
|
db_printf("%sTF2_ECN_SND_ECE", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags2 & TF2_ACE_PERMIT) {
|
|
|
|
|
db_printf("%sTF2_ACE_PERMIT", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_flags2 & TF2_FBYTES_COMPLETE) {
|
|
|
|
|
db_printf("%sTF2_FBYTES_COMPLETE", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
2019-12-01 16:01:33 -05:00
|
|
|
}
|
|
|
|
|
|
2007-02-17 16:02:38 -05:00
|
|
|
static void
|
|
|
|
|
db_print_toobflags(char t_oobflags)
|
|
|
|
|
{
|
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
|
if (t_oobflags & TCPOOB_HAVEDATA) {
|
|
|
|
|
db_printf("%sTCPOOB_HAVEDATA", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
if (t_oobflags & TCPOOB_HADDATA) {
|
|
|
|
|
db_printf("%sTCPOOB_HADDATA", comma ? ", " : "");
|
|
|
|
|
comma = 1;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
|
|
|
|
db_print_tcpcb(struct tcpcb *tp, const char *name, int indent)
|
|
|
|
|
{
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("%s at %p\n", name, tp);
|
|
|
|
|
|
|
|
|
|
indent += 2;
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("t_segq first: %p t_segqlen: %d t_dupacks: %d\n",
|
2018-08-20 08:43:18 -04:00
|
|
|
TAILQ_FIRST(&tp->t_segq), tp->t_segqlen, tp->t_dupacks);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
tcp: use single locked callout per tcpcb for the TCP timers
Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.
With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens. See
comment above tcp_timer_stop(). Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.
While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.
While here, clean unused parts of timer KPI for the pluggable TCP stacks.
While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.
Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.
Discussed with: rrs, tuexen, rscheff (the TCP part of the diff)
Reviewed by: hselasky, kib, mav (the callout part)
Differential revision: https://reviews.freebsd.org/D37321
2022-12-07 12:00:48 -05:00
|
|
|
db_printf("t_callout: %p t_timers: %p\n",
|
|
|
|
|
&tp->t_callout, &tp->t_timers);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("t_state: %d (", tp->t_state);
|
|
|
|
|
db_print_tstate(tp->t_state);
|
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("t_flags: 0x%x (", tp->t_flags);
|
|
|
|
|
db_print_tflags(tp->t_flags);
|
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
2019-12-01 16:01:33 -05:00
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("t_flags2: 0x%x (", tp->t_flags2);
|
|
|
|
|
db_print_tflags2(tp->t_flags2);
|
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
2007-02-17 16:02:38 -05:00
|
|
|
db_print_indent(indent);
|
2023-02-06 15:41:05 -05:00
|
|
|
db_printf("snd_una: 0x%08x snd_max: 0x%08x snd_nxt: 0x%08x\n",
|
2007-02-17 16:02:38 -05:00
|
|
|
tp->snd_una, tp->snd_max, tp->snd_nxt);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("snd_up: 0x%08x snd_wl1: 0x%08x snd_wl2: 0x%08x\n",
|
|
|
|
|
tp->snd_up, tp->snd_wl1, tp->snd_wl2);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("iss: 0x%08x irs: 0x%08x rcv_nxt: 0x%08x\n",
|
|
|
|
|
tp->iss, tp->irs, tp->rcv_nxt);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 12:28:34 -04:00
|
|
|
db_printf("rcv_adv: 0x%08x rcv_wnd: %u rcv_up: 0x%08x\n",
|
2007-02-17 16:02:38 -05:00
|
|
|
tp->rcv_adv, tp->rcv_wnd, tp->rcv_up);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 12:28:34 -04:00
|
|
|
db_printf("snd_wnd: %u snd_cwnd: %u\n",
|
2010-09-16 17:06:45 -04:00
|
|
|
tp->snd_wnd, tp->snd_cwnd);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 12:28:34 -04:00
|
|
|
db_printf("snd_ssthresh: %u snd_recover: "
|
2010-09-16 17:06:45 -04:00
|
|
|
"0x%08x\n", tp->snd_ssthresh, tp->snd_recover);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-01-06 19:14:42 -05:00
|
|
|
db_printf("t_rcvtime: %u t_startime: %u\n",
|
|
|
|
|
tp->t_rcvtime, tp->t_starttime);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2010-09-16 17:06:45 -04:00
|
|
|
db_printf("t_rttime: %u t_rtsq: 0x%08x\n",
|
|
|
|
|
tp->t_rtttime, tp->t_rtseq);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2010-09-16 17:06:45 -04:00
|
|
|
db_printf("t_rxtcur: %d t_maxseg: %u t_srtt: %d\n",
|
|
|
|
|
tp->t_rxtcur, tp->t_maxseg, tp->t_srtt);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2022-11-16 05:22:13 -05:00
|
|
|
db_printf("t_rttvar: %d t_rxtshift: %d t_rttmin: %u\n",
|
|
|
|
|
tp->t_rttvar, tp->t_rxtshift, tp->t_rttmin);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2023-01-26 12:07:11 -05:00
|
|
|
db_printf("t_rttupdated: %u max_sndwnd: %u t_softerror: %d\n",
|
2007-02-17 16:02:38 -05:00
|
|
|
tp->t_rttupdated, tp->max_sndwnd, tp->t_softerror);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("t_oobflags: 0x%x (", tp->t_oobflags);
|
|
|
|
|
db_print_toobflags(tp->t_oobflags);
|
|
|
|
|
db_printf(") t_iobc: 0x%02x\n", tp->t_iobc);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("snd_scale: %u rcv_scale: %u request_r_scale: %u\n",
|
|
|
|
|
tp->snd_scale, tp->rcv_scale, tp->request_r_scale);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2009-06-16 14:58:50 -04:00
|
|
|
db_printf("ts_recent: %u ts_recent_age: %u\n",
|
2007-05-06 12:04:36 -04:00
|
|
|
tp->ts_recent, tp->ts_recent_age);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("ts_offset: %u last_ack_sent: 0x%08x snd_cwnd_prev: "
|
2016-10-06 12:28:34 -04:00
|
|
|
"%u\n", tp->ts_offset, tp->last_ack_sent, tp->snd_cwnd_prev);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 12:28:34 -04:00
|
|
|
db_printf("snd_ssthresh_prev: %u snd_recover_prev: 0x%08x "
|
2009-06-16 14:58:50 -04:00
|
|
|
"t_badrxtwin: %u\n", tp->snd_ssthresh_prev,
|
2007-02-17 16:02:38 -05:00
|
|
|
tp->snd_recover_prev, tp->t_badrxtwin);
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2007-05-06 11:56:31 -04:00
|
|
|
db_printf("snd_numholes: %d snd_holes first: %p\n",
|
|
|
|
|
tp->snd_numholes, TAILQ_FIRST(&tp->snd_holes));
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2020-02-13 10:14:46 -05:00
|
|
|
db_printf("snd_fack: 0x%08x rcv_numsacks: %d\n",
|
|
|
|
|
tp->snd_fack, tp->rcv_numsacks);
|
2007-02-17 16:02:38 -05:00
|
|
|
|
|
|
|
|
/* Skip sackblks, sackhint. */
|
|
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
|
db_printf("t_rttlow: %d rfbuf_ts: %u rfbuf_cnt: %d\n",
|
|
|
|
|
tp->t_rttlow, tp->rfbuf_ts, tp->rfbuf_cnt);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
DB_SHOW_COMMAND(tcpcb, db_show_tcpcb)
|
|
|
|
|
{
|
|
|
|
|
struct tcpcb *tp;
|
|
|
|
|
|
|
|
|
|
if (!have_addr) {
|
|
|
|
|
db_printf("usage: show tcpcb <addr>\n");
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
tp = (struct tcpcb *)addr;
|
|
|
|
|
|
|
|
|
|
db_print_tcpcb(tp, "tcpcb", 0);
|
|
|
|
|
}
|
|
|
|
|
#endif
|