This gives a more uniform API for send tag life cycle management.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000
Each TLS send tag in mlx5 contains a nested rate limit send tag.
Previously, the driver was calling internal functions to manage the
nested tag. Calling free methods directly instead of m_snd_tag_rele()
leaked send tag references and references on the ifp. Changes to use
the ifp methods for the nested tag for other methods are more cosmetic
but do simplify the code.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26996
Send tags are refcounted and if_snd_tag_free() is called by
m_snd_tag_rele() when the last reference is dropped on a send tag.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26995
In r348254, if_snd_tag_alloc() routines were changed to bump the ifp
refcount via m_snd_tag_init(). This function wasn't in the tree at
the time and wasn't updated for the new semantics, so was still doing
a separate bump after if_snd_tag_alloc() returned.
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26999
r350501 added the 'st' parameter, but did not pass it down to
if_snd_tag_alloc().
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26997
instead of mount_nullfs(8).
Obviously you'd need to force mount(8) to not call
mount_nullfs(8) to make use of it.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26934
more readable. While here, add linux_check_errtbl() function to make
sure we don't leave holes.
No objections: emaste (earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26972
struct nameidata mixes caller arguments, internal state and output, which
can be quite error prone.
Recent addition of valdiating ni_resflags uncovered a caller which could
repeatedly call namei, effectively operating on partially populated state.
Add bare minimium validation this does not happen. The real fix would
decouple aforementioned state.
Reported by: pho
Tested by: pho (different variant)
- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).
- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.
- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.
- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.
- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.
- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].
Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691
The calling process's process group can change between PROC_UNLOCK(p)
and PGRP_LOCK(pg) in tty_wait_background(), e.g. by a setpgid() call
from another process. If that happens, the signal is not sent to the
calling process, even if the prior checks determine that one should be
sent. Re-check that the process group hasn't changed after acquiring
the pgrp lock, and if it has, redo the checks.
PR: 250701
Submitted by: Jakub Piecuch <j.piecuch96@gmail.com>
MFC after: 2 weeks
Code was supposed to call callout_reset_sbt_on() rather than
callout_reset_sbt(). This resulted into passing a "cpu" value
to a "flag" argument. A recipe for subtle errors.
PR: 248652
Reported by: sg@efficientip.com
MFC with: r367093
Foundation copyrights, approved by emaste@. It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.
Reviewed by: emaste, imp, gbe (manpages)
Differential Revision: https://reviews.freebsd.org/D26980
The way netmap TX is handled in iflib when TX interrupts are not
used (IFC_NETMAP_TX_IRQ not set) has some issues:
- The netmap_tx_irq() function gets called by iflib_timer(), which
gets scheduled with tick granularity (hz). This is not frequent
enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
result is that the transmitting netmap application is not woken
up fast enough to saturate the link with small packets.
- The iflib_timer() functions also calls isc_txd_credits_update()
to ask for more TX completion updates. However, this violates
the netmap requirement that only txsync can access the TX queue
for datapath operations. Only netmap_tx_irq() may be called out
of the txsync context.
This change introduces per-tx-queue netmap timers, using microsecond
granularity to ensure that netmap_tx_irq() can be called often enough
to allow for maximum packet rate. The timer routine simply calls
netmap_tx_irq() to wake up the netmap application. The latter will
wake up and call txsync to collect TX completion updates.
This change brings back line rate speed with small packets for ixgbe.
For the time being, timer expiration is hardcoded to 90 microseconds,
in order to avoid introducing a new sysctl.
We may eventually implement an adaptive expiration period or use another
deferred work mechanism in place of timers.
Also, fix the timers usage to make sure that each queue is serviced
by a different CPU.
PR: 248652
Reported by: sg@efficientip.com
MFC after: 2 weeks
All vnodes allocated by UMA are present on the global list used by
vnlru. getnewvnode modifies the state of the vnode (most notably
altering v_holdcnt) but never locks it. Moreover filesystems also
modify it in arbitrary manners sometimes before taking the vnode
lock or adding any other indicator that the vnode can be used.
Picking up such a vnode by vnlru would be problematic.
To that end there are 2 fixes:
- vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the
interlock and verifies that v_mount has been set. It is an
invariant that the vnode lock is held by that point, providing
the necessary serialisation against locking after vhold.
- vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes,
now makes sure to only transition the count 0->1 and newly allocated
vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only
transition VHOLD_NO_SMR->1 once more making the hold fail
Tested by: pho
function checks that the mutex lock is owned.
This fixes 'devctl disable re0' operation.
Sponsored by: Innovate DSbD
Differential Revision: https://reviews.freebsd.org/D26904
PCPU_GET(curpmap) expands to multiple instructions on arm64, and if the
current thread is migrated in between execution of those instructions, a
stale value may be used in the assertion condition.
Diagnosed by: mmel
Reported by: mmel, Bob Prohaska <fbsd@www.zefox.net>
Submitted by: alc
MFC after: 1 week
- remove setting of register value which is not used until the next value is
set
- Use the L2_SHIFT constant when setting up L2 superpages
Submitted by: Antonin Houska <ah AT melesmeles DOT cz>
Nothing implements this in the tree. Remove the ioctl and the
conversion to the geom atttribute stuff.
This was introduced in r94287 in 2002 and was retired in r113390
2003. It appeared in FreeBSD 5.0, but no other releases. This is a
vestige that was missed at the time and overlooked until now. No
compat is provided for this reason. And there's no implementation of
it today. And it was never part of a release from a stable branch.
Reviewed by: phk@
Differential Revision: https://reviews.freebsd.org/D26967
Version 0.2 of the SBI specification [1] marked the existing SBI
functions as "legacy" in order to move to a newer calling convention. It
also introduced a set of replacement extensions for some of the legacy
functionality. In particular, the TIME, IPI, and RFENCE extensions
implement and extend the semantics of their legacy counterparts, while
conforming to the newer version of the spec.
Update our SBI code to use the new replacement extensions when
available, and fall back to the legacy ones. These will eventually be
dropped, when support for version 0.2 is ubiquitous.
[1] https://github.com/riscv/riscv-sbi-doc/blob/master/riscv-sbi.adoc
Submitted by: Danjel Q. <danq1222@gmail.com>
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D26953
S-mode software has write access to the SIP.SSIP bit, so instead of
making a second round-trip through the SBI we can clear it ourselves.
The SBI spec has deprecated this function for this exactly this reason.
Submitted by: Danjel Q. <danq1222@gmail.com
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D26952
Linux execve() gets audited as AUE_EXECVE as well, we should also interpret
the return from this correctly for the same reasoning as in r367002.
MFC with: r367002
The kernel will never map the first page, so any symbols in that
range cannot refer to addresses. Some third-party assembly files
define internal constants which appear in their symbol table.
Avoiding the lookup for those symbols avoids replacing small offsets
with those symbols during disassembly.
Reported by: Anton Rang <rang%acm.org>
Reviewed by: Anton Rang <rang%acm.org>, markj
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26895
Without the 'car limit' enabled (before this), running sequential ZFS scrub
on HDD without command queuing support, I've measured latency on concurrent
random reads reaching 4 seconds (surprised that not more). Enabling this
reduced the latency to 65 milliseconds, while scrub still doing ~180MB/s.
For disks with command queuing this does not make much difference (if any),
since most time all the requests are queued down to the disk or HBA, leaving
nothing in the queue to sort. And even if something does not fit, staying on
the queue, it is likely not for long. To not limit sorting in such bursty
scenarios I've added batched counter zeroing when the queue is getting empty.
The internal scheduler of the SAS HDD I was testing seems to be even more
loyal to random I/O, reducing the scrub speed to ~120MB/s. So in case
somebody worried this is limit is too strict -- it actually looks relaxed.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
SAM-3 specification introduced concept of Task Priority, that was renamed
to Command Priority in SAM-4, and supported by all modern SCSI transports.
It provides 15 levels of relative priorities: 1 - highest, 15 - lowest and
0 - default. SAT specification for SATA devices translates priorities 1-3
into NCQ high priority.
This change adds new "priority" field into empty spots of struct ccb_scsiio
and struct ccb_accept_tio of CAM and struct ctl_scsiio of CTL. Respective
support is added into iscsi(4), isp(4), mpr(4), mps(4) and ocs_fc(4) drivers
for both initiator and where applicable target roles. Minimal support was
added to CTL to receive the priority value from different frontends, pass it
between HA controllers and report in few places.
This patch does not add consumers of this functionality, so nothing should
really change yet, since the field is still set to 0 (default) on initiator
and not actively used on target. Those are to be implemented separately.
I've confirmed priority working on WD Red SATA disks connected via mpr(4)
and properly transferred to CTL target via iscsi(4), isp(4) and ocs_fc(4).
While there, added missing tag_action support to ocs_fc(4) initiator role.
MFC after: 1 month
Relnotes: yes
Sponsored by: iXsystems, Inc.
ocs_scsi_recv_cmd() receives the flags after ocs_get_flags_fcp_cmd(),
which translates them from FCP_TASK_ATTR_* to OCS_SCSI_CMD_*. As result
non-SIMPLE requests turned into HEAD or ORDERED depending on direction.
MFC after: 2 weeks
over various major releases. Superblock check hashes were added for
the 12 release and cylinder-group and inode check hashes will appear
in the 13 release.
When a disk with a UFS filesystem is writably mounted, the kernel
clears the feature flags for anything that it does not support. For
example, if a UFS disk from a 12-stable kernel is mounted on an
11-stable system, the 11-stable kernel will clear the flag in the
filesystem superblock that indicates that superblock check-hashs
are being maintained. Thus if the disk is later moved back to a
12-stable system, the 12-stable system will know to ignore its
incorrect check-hash.
If the only filesystem modification done on the earlier kernel is
to run a utility such as growfs(8) that modifies the superblock but
neither updates the check-hash nor clears the feature flag indicating
that it does not support the check-hash, the disk will fail to mount
if it is moved back to its original newer kernel.
This patch moves the code that clears the filesystem feature flags
from the mount code (ffs_mountfs()) to the code that reads the
superblock (ffs_sbget()). As ffs_sbget() is used by the kernel mount
code and is imported into libufs(3), all the filesystem utilities
will now also clear these flags when they make modifications to the
filesystem.
As suggested by John Baldwin, fsck_ffs(8) has been changed to accept
and repair bad superblock check-hashes rather than refusing to run.
This change allows fsck to recover filesystems that have been impacted
by utilities older than those created after this change and is a
sensible thing to do in any event.
Reported by: John Baldwin (jhb@)
MFC after: 2 weeks
Sponsored by: Netflix
The age of the intel compiler support is so old as to be
uninteresting. No recent recports of intel compiler support have been
received. Remove all the special case workarounds for the Intel
compiler. Should there be interest in supporting the compiler, contact
me and I'll work with people to make it happen, though I suspect these
instances are more likely to be in the way than to be helpful.
Reviewed by: cem, emaste, vangyzen, dim
Differential Revision: https://reviews.freebsd.org/D26817