locally generated SCTP packets sent over IPv4. This make
the behaviour consistent with IPv6.
Reviewed by: ae@, bz@, jtl@
Approved by: re (kib@)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17406
Discussing with Benjamin Herrenschmidt, OPAL_INT_GET_XIRR masks the
returned priority, so must be resumed before more interrupts can be
handled at this priority. Since there are only two priorities used in
FreeBSD, we know that the previous priority in an EOI will always be
0xff (lowest priority).
Reviewed by: nwhitehorn
Approved by: re(rgrimes)
Differential Revision: https://reviews.freebsd.org/D17361
Summary:
Discussing with Benjamin Herrenschmidt, MSIs, and edge-triggered
interrupts in general, must not be masked in XICS and XIVE, else
subsequent interrupts may be ignored.
Testing locally on my Talos II (single CPU, 18-core POWER9), NVMe now
works with MSI, improving read throughput by ~70% (900MB/s -> 1.67GB/s,
with 64MB block size) over INTx interrupts, and snd_hda(4) now will
actually play music with MSI. Previously, snd_hda(4) would not receive
interrupts, timing out, and declaring the channels dead.
This has also been tested by Kevin Bowling, and others, with great
success. Kevin reported NVMe unusable on his Talos II prior to this
patch.
Reviewed by: nwhitehorn, kbowling
Approved by: re(rgrimes)
Differential Revision: https://reviews.freebsd.org/D17356
It's not supposed to be legal for two jails to contain the same IP address,
unless both jails contain only that one address. This is the behavior
documented in jail(8), and is there to prevent confusion when multiple
jails are listening on IADDR_ANY.
VIMAGE jails (now the default for GENERIC kernels) test this correctly,
but non-VIMAGE jails have been performing an incomplete test when nested
jails are used.
Approved by: re@ (kib@)
MFC after: 5 days
When using a vlan with igb and the vlanhwcsum option, any mbufs which
already had the TCP, UDP, or SCTP checksum calculated and therefore don't
have the CSUM_[IP|IP6]_[TCP|UDP|SCTP] bits set in the csum_flags field would
have the L4 checksum corrupted by the hardware.
This was caused by the driver setting E1000_TXD_POPTS_TXSM any time a
checksum bit was set OR a vlan tag was present.
The patched driver only sets E1000_TXD_POPTS_TXSM when an offload is
requested.
PR: 231416
Reported by: pi
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17404
rep stos has a high startup time even on modern microarchitectures like
Skylake. Intel optimization manuals discuss how for small sizes it is
beneficial to go for streaming stores. Since those cannot be used without
extra penalty in the kernel I investigated performance impact of just
regular movs.
The patch below implements a very simple scheme: a 32-byte loop followed
by filling in the remainder of at most 31 bytes. It has a 256 breaking
point on which it falls back to rep stos. It provides a significant win
over the current primitive on several machines I tested (both Intel and
AMD). A 64-byte loop did not provide any benefit even for multiple of 64
sizes.
See the review for benchmark data.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17398
When getting the number of bytes to checksum make sure to convert the UDP
length to host byte order when the entire header is not in the first mbuf.
Reviewed by: jtl, tuexen, ae
Approved by: re (gjb), jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D17357
Refactor sample ring buffer ring handling to make it more robust to
long running callchain collection handling
r338112 introduced a (now fixed) regression that exposed a number of race
conditions within the management of the sample buffers. This
simplifies the handling and moves the decision to overwrite a
callchain sample that has taken too long out of the NMI in to the
hardlock handler. With this change the problem no longer shows up as a
ring corruption but as the code spending all of its time in callchain
collection.
- Makes the producer / consumer index incrementing monotonic, making it
easier (for me at least) to reason about.
- Moves the decision to overwrite a sample from NMI context to interrupt
context where we can enforce serialization.
- Puts a time limit on waiting to collect a user callchain - putting a
bound on head-of-line blocking causing samples to be dropped
- Removes the flush routine which was previously needed to purge
dangling references to the pmc from the sample buffers but now is only
a source of a race condition on unload.
Previously one could lock up or crash HEAD by running:
pmcstat -S inst_retired.any_p -T and then hitting ^C
After this change it is no longer possible.
PR: 231793
Reviewed by: markj@
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D17011
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.
Results in mmap speed up and a 70% increase in brk calls / second.
Reviewed by: alc@, markj@, kib@
Approved by: re (delphij@)
Differential Revision: https://reviews.freebsd.org/D16273
With the new route cache feature udp_notify() will modify the inp when it
needs to invalidate the route cache. Ensure that we hold a write lock on
the inp before calling the function to ensure that multiple threads don't
race while trying to invalidate the cache (which previously lead to a page
fault).
Differential Revision: https://reviews.freebsd.org/D17246
Reviewed by: sbruno, bz, karels
Sponsored by: Dell EMC Isilon
Approved by: re (gjb)
This change is a no-op in terms of semantics, but has a side effect
of removing a perfectly useless nop sled for CPUs with ERMS.
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.
Reviewed by: kib
Approved by: re (rgrimes, gjb)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Review: https://reviews.freebsd.org/D17388
shutdown() to wakeup another thread blocked on a stream listen socket.
This code is failing, while it used to work on FreeBSD 10 and still
works on Linux.
It seems reasonable to add another exception to support something users are
actually doing, which used to work on FreeBSD 10, and still works on Linux.
And, it seems like it should be acceptable to POSIX, as we still return
ENOTCONN.
This patch is different to what had been committed to stable/11, since
code around listening sockets is different. Patch in D15019 is written
by jtl@, slightly modified by me.
PR: 227259
Obtained from: jtl
Approved by: re (kib)
Differential Revision: D15019
This caused microcode to be updated only on the BSP if hyperthreading
was disabled, typically resulting in a hang or reset.
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other members which require no translation.
Reviewed by: kib (prior version), jhb
Approved by: re (rgrimes)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Review: https://reviews.freebsd.org/D17378
using an application trying to use a v4mapped destination address on a
kernel without INET support or on a v6only socket.
Catch this case and prevent the packet from going anywhere;
else, without the KASSERT() armed, a v4mapped destination
address might go out on the wire or other undefined behaviour
might happen, while with the KASSERT() we panic.
PR: 231728
Reported by: Jeremy Faulkner (gldisater gmail.com)
Approved by: re (kib)
system-call entry and whenever audit arguments or return values are
captured:
1. Expose a single global, audit_syscalls_enabled, which controls
whether the audit framework is entered, rather than exposing
components of the policy -- e.g., if the trail is enabled,
suspended, etc.
2. Introduce a new function audit_syscalls_enabled_update(), which is
called to update audit_syscalls_enabled whenever an aspect of the
policy changes, so that the value can be updated.
3. Remove a check of trail enablement/suspension from audit_new() --
at the point where this function has been entered, we believe that
system-call auditing is already in force, or we wouldn't get here,
so simply proceed to more expensive policy checks.
4. Use an audit-provided global, audit_dtrace_enabled, rather than a
dtaudit-provided global, to provide policy indicating whether
dtaudit would like system calls to be audited.
5. Do some minor cosmetic renaming to clarify what various variables
are for.
These changes collectively arrange it so that traditional audit
(trail, pipes) or the DTrace audit provider can enable system-call
probes without the other configured. Otherwise, dtaudit cannot
capture system-call data without auditd(8) started.
Reviewed by: gnn
Sponsored by: DARPA, AFRL
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17348
In the probe case for SCSI SMR Host Aware or Most Managed drives, be sure
to free allocated memory.
sys/cam/scsi/scsi_da.c:
In dadone_probezone(), free the data pointer before returning.
MFC after: 3 days
Sponsored by: Spectra Logic
Approved by: re (kib)
Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range
[0, MAXMEMDOM).
Reported by: pho
Reviewed by: kib
Approved by: re (rgrimes)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17374
Tested with ifunc resolvers in the kernel and module with calls from
kernel to kernel, module to kernel, and module to module.
Reviewed by: kib (previous version)
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17370
Belatedly add a comment to the amd64 pmap explaining why we initialize
the kernel pmap's resident page count.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17377
Such data may later be unmapped. This occurs, for example, when a
loader-provided microcode update file is discarded.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17340
The initial raise in r336519 wasn't enough for using big resolution
(1920 x 1200 for example). Raise it again.
Reported by: bob prohaska <fbsd@www.zefox.net>
Tested by: bob prohaska <fbsd@www.zefox.net>
Approved by: re (gjb@)
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.
Add support to FreeBSD for empty NUMA domains by:
- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.
Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683
This removes two assignments for the flags field being done
twice and adds one, which was missing.
Thanks to Felix Weinrank for reporting the issue he found
by using fuzz testing of the userland stack.
Approved by: re (kib@)
MFC after: 1 week
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.
PR: 231428
Reviewed by: mmacy, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17335
arguments wrong in r339020.
PR: 231625
Reported by: Yuri Pankov (yuripv yuripv.net)
Reviewed by: cem, Yuri Pankov (yuripv yuripv.net)
Approved by: re (kib)
Pointyhat to: bz (a rather big one for this one)
sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs()
similar to sctp_find_cmsg() to improve consistency and avoid
the signed/unsigned issues in sctp_process_cmsgs_for_init()
and sctp_findassociation_cmsgs().
Thanks to andrew@ for reporting the problem he found using
syzcaller.
Approved by: re (kib@)
MFC after: 1 week
sending UDP encapsulated SCTP packets.
This is consistent with the behaviour that when such packets are received,
the corresponding UDP stats counter (udps_ipackets) is incremented.
Thanks to Peter Lei for making me aware of this inconsistency.
Approved by: re (kib@)
MFC after: 1 week
is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or
doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command.
Without this patch, for IPv6 the new MTU is not used when creating routes.
Especially, when initiating TCP connections after increasing the MTU,
the old MTU is still used to compute the MSS.
Thanks to ae@ and bz@ for helping to improve the patch.
Reviewed by: ae@, bz@
Approved by: re (kib@)
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D17180
This is mostly a cosmetic change except that obsolete system calls are
assigned meaningful names in the names arrays which means that using
tools like kdump or truss against binaries invoking these system calls
will print out the name instead of the number. The script I use to
generate the XML list of syscalls for GDB also ignores UNIMPL but not
OBSOL entries. In general UNIMPL should only be used to reserve
placeholders for system calls that have never been implemented while
system calls that existed at one time in FreeBSD but were removed
should be marked OBSOL instead.
Reviewed by: brooks, kib, imp
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17344
after the file mapping was wired.
if a wired map entry is backed by vnode and the file is truncated,
corresponding pages are invalidated. vm_fault_copy_entry() should be
aware of it and allow for invalid pages past end of file. Also, such
pages should be not mapped into userspace. If userspace accesses the
truncated part of the mapping later, it gets a signal, there is no way
kernel can prevent the page fault.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323