When this socket option is enabled, relatively large contiguous
buffers are allocated and used to receive data from the remote
connection. When data is received a wrapper M_EXT mbuf is queued to
the socket's receive buffer. This reduces the length of the linked
list of received mbufs and allows consumers to consume receive data in
larger chunks.
To minimize reprogramming the page pods in the adapter, receive
buffers for a given connection are recycled. When a buffer has been
fully consumed by the receiver and freed, the buffer is placed on a
per-connection free buffers list.
The size of the receive buffers defaults to 256k and can be set via
the hw.cxgbe.toe.ddp_rcvbuf_len sysctl. The
hw.cxgbe.toe.ddp_rcvbuf_cache sysctl (defaults to 4) determines the
maximum number of free buffers cached per connection. Note that this
limit does not apply to "in-flight" receive buffers that are
associated with mbufs in the socket's receive buffer.
Co-authored-by: Navdeep Parhar <np@FreeBSD.org>
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44001
(cherry picked from commit eba13bbc37)
Use a separate state for when a request to set RX_QUIESCE has been
sent but the resulting TCB reply has not been received. In
particular, this correctly handles the case where data has been
received and queued in the receive queue before the quiesce request
takes effect.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44435
(cherry picked from commit 9978c6289d)
Most ULP modes in cxgbe's TOE are enabled on the fly when a protocol
is needed (e.g. ULP_MODE_ISCSI is enabled by cxgbei when offloading a
connection using iSCSI, and ULP_MODE_TLS is enabled when RX TLS keys
are programmed for a TOE connection). The one exception to this is
ULP_MODE_TCPDDP.
Currently the cxgbe driver enables ULP_MODE_TCPDDP when a TOE
connection is first created. However, since DDP connections cannot be
converted to other connection types, this requires some special
handling in the driver. For example, iSCSI daemons use the SO_NO_DDP
socket option to ensure TOE connections use ULP_MODE_NONE so they can
be converted to ULP_MODE_ISCSI. Similarly, using TLS receive offload
(ULP_MODE_TLS) requires disabling TCP DDP for new connections by
default.
This commit changes cxgbe to instead switch a connection from
ULP_MODE_NONE to ULP_MODE_TCPDDP when a connection first attempts to
use TCP DDP via aio_read(2). This permits connections to always start
as ULP_MODE_NONE and switch to a protocol-specific mode as needed.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43670
(cherry picked from commit a5a965d759)
Similar to dcfddc8dc0, replace the
simpler, inlined version with the full version.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D41690
(cherry picked from commit 897e564361)
In particular, the kernel RPC layer used by the NFS client never
invokes pru_rcvd since it always reads data from the socket upcall
via MSG_SOCALLBCK which avoids calling pru_rcvd. As a result, on an
NFS client connection managed by t4_tom, RX credits were never
returned to the TOE connection to open the TCP window resulting in
connection hangs.
To fix, expand the set of conditions in do_rx_data where RX credits
are returned to match those in t4_rcvd_locked by calling the function
directly.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D41688
(cherry picked from commit dcfddc8dc0)
This change adds struct tcp_info fields corresponding to the following
struct tcpcb ones:
- snd_una
- snd_max
- rcv_numsacks
- rcv_adv
- dupacks
Note that while both tcp_fill_info() and fill_tcp_info_from_tcb() are
extended accordingly, no counterpart of rcv_numsacks is available in
the cxgbe(4) TOE PCB, though.
Sponsored by: NetApp, Inc. (originally)
This function actually only ever reads from the TCP PCB. Consequently,
also make the pointer to its TCP PCB parameter const.
Sponsored by: NetApp, Inc. (originally)
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
Summary:
Keep TOEDEV() macro for backwards compatibility, and add a SETTOEDEV()
macro to complement with the new accessors.
Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D38199
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
Rather than requiring a socket to be created as a TLS socket from the
get go, switch a TOE socket from "plain" TOE to TLS mode when a
receive key is added to the socket.
The firmware is only able to switch a "plain" TOE connection to TLS
mode if the head of the pending socket data is the start of a TLS
record, so the connection is migrated to TLS mode as a multi-step
process.
When TOE TLS RX is enabled, the associated connection's receive side
is frozen via a flag in the TCB. The state of the socket buffer is
then examined to determine if the pending data in the socket buffer
ends on a TLS record boundary. If so, the connection is migrated to
TLS mode and unfrozen. Otherwise, the connection is unfrozen
temporarily until more data arrives. Once more data arrives, the
receive queue is frozen again and rechecked. This continues until the
connection is paused at a record boundary. Any records received
before TLS mode is enabled are decrypted as software records.
Note that this removes the 'rx_tls_ports' sysctl. TOE TLS offload for
receive is now enabled automatically on existing TOE connections when
using a KTLS-aware SSL library just as it was previously enabled
automatically for TLS transmit. This also enables TLS offload for TOE
connections which enable TLS after passing initial data in the clear
(e.g. STARTTLS with SMTP).
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37351
The previous commit lost an implicit struct socket * cast. Use an
inline function instead as the macro is already rather long.
Fixes: e1401f7579 cxgbe: use standard sototcpcb() accessor macro to get socket's tcpcb
Sponsored by: Chelsio Communications
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193a, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d1), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee5).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232
Remove a few more remnants from the old pre-KTLS support and instead
assume that each work request sends a single TLS record.
Sponsored by: Chelsio Communications
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it. In 74a68313b5 macros that access
this mutex directly were added. Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.
This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.
This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally. However, it allows operation of a
protocol that doesn't use them.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35152
The final transmit and receive queue indices need to be positive
values. However, since txq_idx and rxq_idx are signed (to permit
using -1 to as a marker for uninitialized values), using %= with
another integer type (vi->nofld[tr]xq) yielded a sign-extended modulus
value. This resulted in negative queue indices and a buffer underrun
when arc4random() returned a value with the sign bit set. Use a
temporary unsigned variable to hold the "raw" queue index to force
unsigned modulus.
This worked previously because the modulus was previously applied
directly to the return value of arc4random() which is unsigned before
the result was assigned to txq_idx and rxq_idx.
Discussed with: np
Fixes: db28d4a0cd cxgbe/t4_tom: Support for round-robin selection of offload queues.
Sponsored by: Chelsio Communications
A COP (Connection Offload Policy) rule can now specify that the tx
and/or rx queue for a new tid should be selected in a round-robin
manner. There is no change in default behavior.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D34921
- Add icl_pdu_append_bio and icl_pdu_get_bio methods.
- Add new page pod routines for allocating and writing page pods for
unmapped bio requests. Use these new routines for setting up DDP
for iSCSI tasks with a SCSI I/O CCB which uses CAM_DATA_BIO.
- When ICL_NOCOPY is used to append data from an unmapped I/O request
to a PDU, construct unmapped mbufs from the relevant pages backing
the struct bio. This also requires changes in the t4_push_pdus path
to support unmapped mbufs.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D34383
Otherwise we end up copying one uninitialized byte into the socket
buffer.
Reported by: KMSAN
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33953
The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection
when they do output on it. The default stack never does this, thus
existing framework expects tcp_output() always to return locked and
valid tcpcb.
Provide KPI extension to satisfy demands of advanced stacks. If the
output method returns negative error code, it means that caller must
call tcp_drop().
In tcp_var() provide three inline methods to call tcp_output():
- tcp_output() is a drop-in replacement for the default stack, so that
default stack can continue using it internally without modifications.
For advanced stacks it would perform tcp_drop() and unlock and report
that with negative error code.
- tcp_output_unlock() handles the negative code and always converts
it to positive and always unlocks.
- tcp_output_nodrop() just calls the method and leaves the responsibility
to drop on the caller.
Sweep over the advanced stacks and use new KPI instead of using HPTS
delayed drop queue for that.
Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33370
Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.
Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.
The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().
Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741
For TCP DDP, handle_ddp_close() needs to see the pre-FIN rcv_nxt to
determine how much data was placed in the local buffer before the FIN
was received. The changes in d59f1c49e2 broke this by updating
rcv_nxt before calling handle_ddp_close().
Fixes: d59f1c49e2 cxgbe tom: Permit rcv_nxt mismatches on FIN for iSCSI connections on T6.
Sponsored by: Chelsio Communications
This is similar to the fixes in 141fe2dcee. One difference is that
TOE sockets do not change states (listen vs non-listen) once created,
so no lock is needed for SOLISTENING().
Sponsored by: Chelsio Communications