Commit graph

669 commits

Author SHA1 Message Date
Konstantin Belousov
7acae33bc9 mlx5 ipsec: for tx, enable SN autoincrement whenever ESN is enabled
Sponsored by:	Nvidia networking
2025-08-20 11:49:50 +03:00
Eric van Gyzen
5c5bb958fc mlx5: plug theoretical leak in vxlan rules
Plug a theoretical memory/refcount leak when adding a vxlan rule.
This is not currently an actual leak, but it could become one.

PR:		287945
Reviewed by:	kib
Sponsored by:	Dell Inc.
Differential Revision: https://reviews.freebsd.org/D51883
2025-08-13 21:47:14 -04:00
Konstantin Belousov
72c9ad9331 mlx5en ipsec offload: copy xform_history to the ipsec_accel_in_tag
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, slavash
Sponsored by:	Nvidia networking
2025-07-17 12:36:26 +03:00
Ariel Ehrenberg
d6d66936c4 mlx5en: fix TLS Rx hardware offload initialization
The TLS RX context had the tcp sequence number of next TLS record set
in resync_tcp_sn parameter instead of in next_record_tcp_sn parameter
during hardware initialization.  This prevent the hardware from
synchronizing with the TLS stream, and caused TLS offload to remain
inactive.  Set next_record_tcp_sn to the next TCP sequence number and
resync_tcp_sn to zero to enable proper TLS record boundary detection
and activate hardware offload.

Reviewed by:	kib, slavash
Sponsored by:	NVidia networking
MFC after:	1 week
2025-07-17 02:32:27 +03:00
Ariel Ehrenberg
b6b3743fa2 mlx5en: add driver tls status string method for rx sessions
Upon collecting tls information, kernel calls driver to get driver/hw
tls state. Driver calls hw to get its tracking and authentication
states, and dump them into the driver state buffer. This requires a
sleep to wait for the hw response.

Reviewed by:	kib
Sponsored by:	NVidia networking
2025-07-10 17:42:27 +03:00
Konstantin Belousov
cdd8129216 mlx5_en: wait_for_completion_timeout() takes jiffies
Sponsored by:	Nvidia networking
2025-07-10 17:42:27 +03:00
Andrew Gallatin
20e15e905c mlx5: Decrease FW init timeout from 120 seconds to 5 seconds
When encountering a failed NIC, the mlx5 driver will wait up to 120
secs for the firmware to respond.  This timeout is absurdly huge, and
leads to boot times of 40 minutes to over an hour on our servers when a
NIC fails.  This is because the driver will attempt to attach to the
failed NIC multiple times (once for each driver loaded after mlx5),
and wait 2 minutes on each attempt.  This happens because the mlx5
driver is still the best match for the device.  This delay then
triggers watchdog timeouts in our environment, rendering servers
with a failed NIC entirely unbootable without manual intervention.

Note that FW_INIT_WARN_MESSAGE_INTERVAL must also be decreased, as
it must be less than the init timeout.

Reviewed by: kib (initial version, before reducing warn interval)
Sponsored by: Netflix
2025-06-29 16:51:50 -04:00
Konstantin Belousov
901256f6ea mlx5: jiffies is unsigned long
Sponsored by:	NVidia networking
Differential revision:	https://reviews.freebsd.org/D48878
2025-04-29 13:53:40 +00:00
John Baldwin
dcb2a1ae46 <net/sff8472.h>: Conditionally export table of ID names
Only export the array of ID names if either _WANT_SFF_8024_ID or
_WANT_SFF_8472_ID is defined.  Exporting them unconditionally can
trigger unused variable warnings if a consumer doesn't use the array.

Reviewed by:	olce, bz, brooks
Differential Revision:	https://reviews.freebsd.org/D49955
2025-04-28 13:06:07 -04:00
Ariel Ehrenberg
89e0e3814e mlx5en: Use connector type instead of cable type for media detection
Replace cable type detection with connector type for more accurate media
type selection. The connector type is queried directly from the PTYS
register and provides more reliable information about the physical port
type compared to cable type.

Reviewed by:	slavash
Sponsored by:	NVidia networking
MFC after:	1 week
2025-04-09 07:55:27 +03:00
Konstantin Belousov
f0adc907fc mlx5en: sync channel close with the rq completion processing
Without the wait, mlx5e_destroy_rq() might free mbuf that is passed up
to the network stack on receive in mlx5e_poll_rx_cq().

Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-31 21:59:50 +03:00
Konstantin Belousov
480fc5b8e5 mlx5en: bump MLX5E_MAX_BUSDMA_RX_SEGS
This is needed to accomodate more data segments in wqes for 64K receive
mbuf chains.

Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:02:17 +02:00
Konstantin Belousov
016f40466a mlx5en: fix rq->wqe_sz usage
Define it as the size of the single data segment in wqe.

Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:01:59 +02:00
Konstantin Belousov
c2987d7876 mlx5: bump the max LRO packet size
The belief is that the 7*MCLBYTES limit was set to not hit the segment
limit for wqe busdma tag.  But with the current mbuf allocator it is not
possible, and even if it was, the corresponding wqe fill would simply
fail.

Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:01:36 +02:00
Konstantin Belousov
89491b1edb mlx5en: stop arbitrary limiting max wqe size
Since the times the driver accepts s/g receive buffers, there is no
sense in trying to use pre-existing mbuf clusters sizes.  The only
possible optimization is to use full page size if wqe size is greater
than MCLBYTES.

Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:01:17 +02:00
Konstantin Belousov
bc10238492 mlx5: overwrite only the echo reply timestamp from the last packet in LRO
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:01:05 +02:00
Konstantin Belousov
7560ed3a6b mlx5: assert CQE structure size
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:00:53 +02:00
Konstantin Belousov
903996760d mlx5: correct the predicate asserted in __predict_true()
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:00:42 +02:00
Konstantin Belousov
efe9a3996e mlx5: recalculate tcp checksum for ipv6 hw lro coalesced packet
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:00:27 +02:00
Konstantin Belousov
3eb6d4b4a2 mlx5: recalculate tcp checksum for ipv4 hw lro coalesced packet
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 18:00:09 +02:00
Konstantin Belousov
dd1bd0ec5c mlx5_en: correct recalculation of the ipv4 checksum for hw lro packet
The call to in_cksum_skip() did not skipped the ethernet header.

Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 17:59:50 +02:00
Konstantin Belousov
c3555174fd mlx5en: follow PRM for setting the max hw lro segment size
If the NIC is capable, just pass the full packet size, including L2/L3
headers, as the segment size.  Otherwise, decrement the number of
strides by 1 to left the space for L2/IP headers, as it was done before.
But do the arithmetic on the segment number instead of the full packet
size.

Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 17:59:27 +02:00
Konstantin Belousov
93e70e3a94 mlx5en: explain why interface needs to be reopened on hw lro change
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 17:59:12 +02:00
Konstantin Belousov
02fe38b921 mlx5en: make the hw lro control dynamic
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 17:58:53 +02:00
Konstantin Belousov
9807157363 mlx5core: add mlx5_core_modify_tir()
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 17:58:35 +02:00
Konstantin Belousov
816f27e848 mlx5en: control hw LRO with the driver conf sysctl, leaving IFCAP_LRO to sw
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 17:58:12 +02:00
Konstantin Belousov
bbac54b820 mlx5en: make conf.hw_lro sysctl r/w
This alone does not make hw lro configurable by sysctl, it only removes
unneeded complications for users to access it.

Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>, Slava Shwartsman <slavash@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-13 17:57:54 +02:00
Slava Shwartsman
7008b9fab5 mlx5: Fix BlueField-4 device description
BlueField-4 will not be based on ConnectX-8. Remove the wrong description

Sponsored by:   NVidia networking
MFC after:      1 week
2025-03-12 00:47:26 +02:00
Slava Shwartsman
85af37e159 mlx5en: Fix domain set usage in TLS tag import functions
Use the correct device pointer to obtain the domain set for memory
allocation. Previously, the functions were incorrectly using the arg
parameter directly instead of accessing mlx5_core_dev.

Signed-off-by: Slava Shwartsman <slavash@nvidia.com>

Sponsored by:	NVidia networking
MFC after:	1 week
2025-03-04 06:58:07 +02:00
Konstantin Belousov
1fbce7deef mlx5 ipsec: return EOPNOTSUPP for unsupported SAs instead of EINVAL
The ipsec offload infra requires the EOPNOTSUPP error from driver to
understand that the SA is valid but offload cannot be performed.

Sponsored by:	NVidia networking
2025-02-13 12:32:32 +02:00
Konstantin Belousov
2e794b7733 mlx5: add synthetic error for MLX5_CMD_OP_QUERY_FLOW_COUNTER when device is down
Sponsored by:	NVidia networking
2025-02-13 12:32:32 +02:00
Konstantin Belousov
4c2795340e mlx5 ipsec: fix typo in the message
Sponsored by:	NVidia networking
2025-02-09 02:19:32 +02:00
Andrew Gallatin
36fdc42c6a mlx5en: Fix SIOCSIFCAPNV
In 4cc5d081d8, a change was introduced that manipulated
drv_ioctl_data->reqcap using IFCAP2 bits.  This was noticed
when creating a mixed lagg with mce0 and ixl0 caused the
interfaces' txcsum caps to be disabled.

Fixes: 4cc5d081d8
Reviewed by: glebius
Sponsored by: Netflix
MFC After: 7 days
2025-01-30 20:57:35 -05:00
Ariel Ehrenberg
080f68d0ab mlx5_core: Add steering support for IPsec with IPv6
ipv6 flow tables were not connected to previous FS tables.
Created an additional table to serve as IPsec RX root.
This table has 2 rules for redirecting the received packets
to ipv4/ipv6 based on the IP family in the packet header.

Sponsored by:	   NVidia networking
2025-01-07 02:53:37 +02:00
Slava Shwartsman
b762b199af mlx5: Eliminate the use of mlx5_rule_fwd_action
Driver defined all flow context actions in MLX5_FLOW_CONTEXT_ACTION_*,
no need to duplicate them with mlx5_rule_fwd_action.

Sponsored by:   NVidia networking
MFC after:      1 week
2024-12-19 01:59:42 +02:00
Ariel Ehrenberg
2fb2c03512 mlx5_core: fix "no space" error on sriov enablement
Change POOL_NEXT_SIZE define value from 0 to BIT(30), since this define
is used to request the available maximum sized flow table, and zero doesn't
make sense for it, whereas many places in the driver use zero explicitly
expecting the smallest table size possible but instead due to this
define they end up allocating the biggest table size unawarely.

Sponsored by:	NVidia networking
2024-12-16 00:27:53 +02:00
Ariel Ehrenberg
29a9d7c6ce mlx5_core: fix panic on sriov enablement
Align the code of fdb steering with flow steering core
and add missing parts in namespace initialization and
in prio logic

PR:	281714
Sponsored by:	NVidia networking
2024-12-16 00:27:31 +02:00
Richard Scheffenegger
0fc7bdc978 tcp: extend the use of the th_flags accessor function
Formally, there are 12 bits for TCP header flags.
Use the accessor functions in more (kernel) places.

No functional change.

Reviewed By: cc, #transport, cy, glebius, #iflib, kbowling
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47063
2024-11-29 09:48:23 +01:00
Konstantin Belousov
4cc5d081d8 mlx5en: only enable to toggle offload caps if they are supported
Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2024-11-26 14:34:34 +02:00
Konstantin Belousov
cca0dc49e0 mlx5en: move runtime capabilities checks into helper functions
For TLS TX/RX, ratelimit, and IPSEC offload caps.

Reviewed by:	Ariel Ehrenberg <aehrenberg@nvidia.com>
Sponsored by:	NVidia networking
MFC after:	1 week
2024-11-26 14:34:34 +02:00
Gleb Smirnoff
67f9307907 mlx5e tls: use non-sleeping malloc flag is it was intended
Reviewed by:	gallatin
Fixes:		81b38bce07
2024-11-25 10:46:13 -08:00
Ariel Ehrenberg
253a1fa16b mlx5: Fix handling of port_module_event
Remove the array of port module status and instead save module status
and module number.

At boot, for each PCI function driver get event from fw about module
status. The event contains module number and module status. Driver
stores module number and module status..  When user (ifconfig) ask for
modules information, for each pci function driver first queries fw to
get module number of current pci function, then driver compares the
module number to the module number it stored before and if it matches
and module status is "plugged and enabled" then driver queries fw for
the eprom information of that module number and return it to the
caller.

In fact fw could have concluded that required module number of the
current pci function, but fw is not implemented this way. current
design of PRM/FW is that MCIA register handling is only aware of
modules, not the pci function->module connections.  FW is designed to
take the module number written to MCIA and write/read the content
to/from the associated module's EPROM.

So, based on current FW design, we must supply the module num so fw
can find the corresponding I2C interface of the module to write/read.

Sponsored by:	NVidia networking
MFC after:	1 week
2024-11-23 12:59:26 +02:00
Konstantin Belousov
0d38b0bc8f mlx5en: fix the sign of mlx5e_tls_st_init() error, convert from Linux to BSD
Sponsored by:	NVidia networking
MFC after:	1 week
2024-11-23 12:09:50 +02:00
Konstantin Belousov
64bf5a431c mlx5_en: style function prototype
Sponsored by:	NVidia networking
MFC after:	2 weeks
2024-11-23 12:01:50 +02:00
Andrew Gallatin
81b38bce07 mlx5e tls: Ensure all allocated tags have a hw context associated
Ensure all allocated tags have a hardware context associated.
The hardware context allocation is moved into the zone import
routine, as suggested by kib.  This is safe because these zone
allocations are always done in a sleepable context.

I have removed the now pointless num_resources tracking,
and added sysctls / tunables to control UMA zone limits
for these tls tags, as well as a tunable to let the
driver pre-allocate tags at boot.

MFC after:	2 weeks
2024-11-23 12:01:50 +02:00
Konstantin Belousov
de7a92756f mlx5en: improve reporting of kernel TLS, IPSEC offload, and ratelimit caps
Only ever set the capabilities bits if kernel options are enabled.
Check for hardware capabilities before setting software bits.

Sponsored by:	NVidia networking
MFC after:	1 week
2024-11-14 00:56:11 +02:00
Andrew Gallatin
49597c3e84 mlx5e: Use M_WAITOK when allocating TLS tags
Now that it is clear we're in a sleepable context, use
M_WAITOK when allocating TLS tags.

Suggested by: kib
Sponsored by: Netflix
2024-10-23 15:56:14 -04:00
Andrew Gallatin
81dbc22ce8 mlx5e: Immediately initialize TLS send tags
Under massive connection thrashing (web server restarting), we see
long periods where the web server blocks when enabling ktls offload
when NIC ktls offload is enabled.

It turns out the driver uses a single-threaded linux work queue to
serialize the commands that must be sent to the nic to allocate and
free tls resources. When freeing sessions, this work is handled
asynchronously. However, when allocating sessions, the work is handled
synchronously and the driver waits for the work to complete before
returning. When under massive connection thrashing, the work queue is
first filled by TLS sessions closing. Then when new sessions arrive,
the web server enables kTLS and blocks while the tens or hundreds of
thousands of sessions closes queued up are processed by the NIC.

Rather than using the work queue to open a TLS session on the NIC,
switch to doing the open directly. This allows use to cut in front of
all those sessions that are waiting to close, and minimize the amount
of time the web server blocks. The risk is that the NIC may be out of
resources because it has not processed all of those session frees. So
if we fail to open a session directly, we fall back to using the work
queue.

Differential Revision: https://reviews.freebsd.org/D47260
Sponsored by: Netflix
Reviewed by: kib
2024-10-23 15:16:19 -04:00
Konstantin Belousov
8e5b07dd08 mlx5_ipsec: add enough #ifdef IPSEC_OFFLOAD to make LINT_NOIP compilable
Reported by:	kp
Sponsored by:	NVidia networking
Fixes:	2851aafe96
2024-10-10 16:18:11 +03:00
Konstantin Belousov
2851aafe96 mlx5 ipsec_offload: ensure that driver does not dereference dead sahindex
Take the sahtree rlock and check for the DEAD SA state before validating
and filling the SA xfrm attributes.

Sponsored by:	NVidia networking
2024-10-10 12:55:45 +03:00