These invocations were directly calling enkbd_diag(), rather than
indirection back through kbdd_diag/kbdsw. While they're functionally
equivent, invoking kbdd_diag where feasible (i.e. not in a diag
implementation) makes it easier to visually identify locking needs in these
other drivers.
Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.
This change has not been made to 4BSD but in principle it would be
more straightforward.
Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778
Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.
Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.
MFC after: 3 weeks
This #ifdef is misleading as there are actually no user-serviceable parts
inside and, as far as I can tell, there is no pollution leading from
userland to this header. Furthermore, it becomes a slight nuisance when
attempting to move things around in this header.
an exclusive object lock.
Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.
Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.
Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.
Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654
Parse out the VSEC. If the user invokes a second -c command line option,
do a hex dump of the vendor data.
Reviewed by: imp
MFC after: 3 days
Sponsored by: Intel
Differential Revision: http://reviews.freebsd.org/D22808
CPL_TX_PKT_XT disables the internal parser on the chip and instead
relies on the driver to provide the exact length of the L2 and L3
headers. This allows hw checksumming and TSO to be used with L2 and
L3 encapsulations that the chip doesn't understand directly.
Note that netmap tx still uses the old CPL as it never uses the hw
to generate the checksum on tx.
Reviewed by: jhb@
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D22788
Delay the attachment of children, when requested, until after interrutps are
running. This is often needed to allow children to run transactions on i2c or
spi busses. It's a common enough idiom that it will be useful to have its own
wrapper.
Reviewed by: ian
Differential Revision: https://reviews.freebsd.org/D21465
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.
Differential Revision: https://reviews.freebsd.org/D20999
detach work and return the error. Especially don't call iicbus_reset()
since the most likely cause of failing to detach children is that one
of them has IO in progress.
This trims the boot time a bit more for AWS and other platforms that have nvme
drives. There's no reason too do this inline. This has been in my tree a while,
but IIRC I talked to Jim Harris about this at one of our face to face meetings.
MFC After: 2 weeks
Instead of first detaching the children(s) and then delete them,
use the device_delete_children function that does all of that.
MFC after: 1 month
Suggested by: ian
The driver used to always add the mmc device as it's child even
it no card was detected. Add a function that will detect if the
card is present or not and that will attach/detach the mmc device.
The function is either call on attach (as we won't have the interrupt
fired) or from two taskqueues. The first taskqueue will directly call
the function when the sdcard was present and is now removed and the other
one will delay a bit the attach when we didn't had a card and now have one.
This is mostly based on comments from the sdhci driver where it describe
a situation when the CD pin is detected before the others pins are connected.
MFC after: 1 month
This method will disable the regulators, clocks and assert the reset of
the module. It will also detach it's children (the mmc device) and release
it's resources.
While here enable the regulators on attach as we need them to power up
the sdcard or emmc.
MFC after: 1 month
The children of the bus need to do IO on the bus to probe for hardware
presence. Doing IO means timing the bus states using sbinuptime(), and
that requires working timecounters, which are not initialized until after
device attachment has completed.
PR: 242526
bus_get/set_resource methods are implemented in child device (iicbus).
As their implementation with bus_generic_rl_get/set calls do not
recurse up the tree, the versions in ig4 are never called.
Suggested by: jhb
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
TX_PKTS2 is more efficient within the firmware and this improves netmap
Tx by a few Mpps in some common scenarios.
MFC after: 1 week
Sponsored by: Chelsio Communications
These were obtained from the Chelsio Unified Wire v3.12.0.1 beta
release.
Note that the firmwares are not uuencoded any more.
MFH: 1 month
Sponsored by: Chelsio Communications
The datasheets for these chips claim the maximum is 921,600, but testing
shows these two higher rates also work (but no rates above 921,600 other
than these two work; these represent dividing the base buad clock by 3 and 2
respectively).
As we do for many other laptops, put the headphone jack and speakers in
the same association by default so that the generic sound device
automatically switches between them.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
of the sensor hardware. Part of the polling process involves signalling
the chip then waiting 20 milliseconds. This was being done with DELAY(),
which is a pretty rude thing to do in a callout. Now a taskqueue_thread
task is scheduled to do the polling, and because sleeping is allowed in
the task context, pause_sbt() replaces DELAY() for the 20ms wait.
This change enables the use of OpenFirmware Console (ofwcons), even when VGA is
available, allowing early kernel messages to be seen, that is important in case
of crashes before VGA console initialization.
This is specially useful in virtualized environments, where the user/developer
doesn't have full control of the virtualization engine (e.g. OpenStack).
The old behavior is preserved by default and, in order to use ofwcons, a few
tunables that have been introduced need to be set:
- hw.ofwfb.disable=1 - disable OFW FrameBuffer device
- machdep.ofw.mtx_spin=1 - change PPC OFW mutex to SPIN type, to match kernel
console's mutex type
- debug.quiesce_ofw=0 - don't call OFW quiesce, needed to keep ofwcons I/O
working
More details can be found at differential revision D20640.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20640
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
changed the sysctl format for the temperature from "I" to "IK", and
correspondingly changed the units from integer degrees C to decikelvin.
For access via sysctl(8) the output will be the same except that now
decimal fractions will be shown when available.
Previously the driver supported the DHT11 sensor. Now it supports
DHT11, DHT12, DHT21, DHT22, AM3201, AM3202.
All these chips are similar, differing primarily in supported temperature
and humidity ranges and accuracy (and, presumably, cost). There are two
basic data formats reported by the various chips, and it is possible to
figure out at runtime which format to use for decoding the data based on
the range of values in a single byte of the humidity measurement. (which
is detailed in a comment block, so I won't recapitulate it here).
functions to handle the sysctls, they all just access simple readonly
integer variables. There's no need to track the oids of the ones we add,
since the teardown is done by newbus code, not the driver itself.
Also remove the DDB code, because it just provides access to the same data
that the sysctls already provide.
At the end of a read cycle, set the gpio pin to INPUT rather than OUTPUT.
The state of the single-wire "bus" when idle should be high; setting the
pin to input allows the external pullup to pull the line high. Setting it
to output (and leaving it driving low) was leading a good read cycle followed
by one that would fail, and it just continued like that forever, effectively
reading the sensor once every 10 seconds instead of 5.
In the attach function, do an initial read from the device before registering
the sysctls for accessing the last-read values, to prevent reading spurious
values for the first 5 seconds after the driver attaches.
Do a callout_drain() in the detach function to prevent crashes after
unloading the module.
of gpio devices by using kenv to add hints for a new device and then do
'devctl rescan gpiobus4' to make the new device(s) attach.
It's not particularly easy to detect whether the 'at' hint has been deleted
for a child device that's currently attached, so this doesn't handle that.
But the user can use devctl commands to manually detach an existing device.
Don't needlessly pass around qpair pointers when the tracker knows what
qpair it's on. This will simplify code and make it easier to split
submission and completion queues in the future.
Signed-off-by: John Meneghini <johnm@netapp.com>
Uses two GPIO pins as MDC (clock) and MDIO (bidirectional I/O), relies
on mii_bitbang.
Tested on SG-3200 where the PHY for one of the ports is wired independently
of the SoC MDIO bus.
Sponsored by: Rubicon Communications, LLC (Netgate)
ConnectX-6 DX.
Currently TLS v1.2 and v1.3 with AES 128/256 crypto over TCP/IP (v4
and v6) is supported.
A per PCI device UMA zone is used to manage the memory of the send
tags. To optimize performance some crypto contexts may be cached by
the UMA zone, until the UMA zone finishes the memory of the given send
tag.
An asynchronous task is used manage setup of the send tags towards the
firmware. Most importantly setting the AES 128/256 bit pre-shared keys
for the crypto context.
Updating the state of the AES crypto engine and encrypting data, is
all done in the fast path. Each send tag tracks the TCP sequence
number in order to detect non-contiguous blocks of data, which may
require a dump of prior unencrypted data, to restore the crypto state
prior to wire transmission.
Statistics counters have been added to count the amount of TLS data
transmitted in total, and the amount of TLS data which has been dumped
prior to transmission. When non-contiguous TCP sequence numbers are
detected, the software needs to dump the beginning of the current TLS
record up until the point of retransmission. All TLS counters utilize
the counter(9) API.
In order to enable hardware TLS offload the following sysctls must be set:
kern.ipc.mb_use_ext_pgs=1
kern.ipc.tls.ifnet.permitted=1
kern.ipc.tls.enable=1
Sponsored by: Mellanox Technologies
The hardware offload is primarily targeted for TLS v1.2 and v1.3,
using AES 128/256 bit pre-shared keys. This patch adds all the needed
hardware structures, capabilites and firmware commands.
Sponsored by: Mellanox Technologies
o Remove All Rights Reserved from my notices
o imp@FreeBSD.org everywhere
o regularize punctiation, eliminate date ranges
o Make sure that it's clear that I don't claim All Rights reserved by listing
All Rights Reserved on same line as other copyright holders (but not
me). Other such holders are also listed last where it's clear.
With the ratification of the Berne Convention in 2000, it became obsolete.
I have removed that phrase and the "(c)" only from files without copyright
claims by other parties. There are 2 files (pci.c, pci_private.h) that are
also claimed by Michael Smith <msmith@freebsd.org> and by BSDi, which have
therefore not been included in this commit.
When all member nations of the Buenos Aires Convention adopted the Berne
Convention, the phrase "All rights reserved" became unnecessary to assert
copyright. Remove it from files under my copyright.
There are 2 files (pci.c, pci_private.h) that) that do also bear msmith's
and BSDi's copyright. I have left them unchanged for now, since I do not
know whether they (or the legal successor in case of BSDi) would agree.
If we boot with hw.ncpu=X (available on arm and arm64 at least) we
shouldn't attach the cpufreq driver as cf_set_method will try to get
the cpuid and it doesn't exists.
This solves cpufreq panicing on RockChip RK3399 when booting with
hw.ncpu=4
MFC after: 1 week
This was purely automatically massaged... some parts are still imperfect,
but this is close enough to make it more readable/easy to work on.
Unfortunately the vt/syscons/kdb situation slightly complicates changes to
tty locking, so some work will need to be done to remediate that.
tightening constraints on busy as a precursor to lockless page lookup and
should largely be a NOP for these cases.
Reviewed by: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22611
struct gpio_pin. It turns out these two sets of flags are completely
unrelated to each other.
Also, update the comment for GPIO_ACTIVE_LOW to reflect the fact that it
does get set, somewhat unobviously, by code that parses FDT data. The bits
from the FDT cell containing flags are just copied to gpiobus_pin.flags, so
there's never any obvious reference to the symbol GPIO_ACTIVE_LOW being
stored into the flags field.
FDT bindings document for gpio-i2c devices.
Using the gpio_pin_* functions to acquire/release/manipulate gpio pins
removes the constraint that both gpio pins must belong to the same gpio
controller/bank, and that the gpioiic instance must be a child of gpiobus.
Removing those constraints allows the driver to be fully compatible with
the modern dts bindings for a gpio bitbanged i2c bus.
For hinted attachment, the two gpio pins still must be on the same gpiobus,
and the device instance must be a child of that bus. This preserves
compatibility for existing installations that have use gpioiic(4) with hints.
that they can be used by drivers on non-FDT-configured systems. Only the
functions related to acquiring pins by parsing FDT data remain in
ofw_gpiobus. Also, add two new functions for acquiring gpio pins based on
child device_t and index, or on the bus device_t and pin number. And
finally, defer reserving pins for gpiobus children until they acquire the
pin, rather than reserving them as soon as the child is added (before it's
even known whether the child will attach).
This will allow drivers configured with hints (or any other mechanism) to
use the same code as drivers configured via FDT data. Until now, a hinted
driver and an FDT driver had to be two completely different sets of code,
because hinted drivers could only use gpiobus calls to manipulate pins,
while fdt-configured drivers could not use that API (due to not always being
children of the bus that owns the pins) and had to use the newer
gpio_pin_xxxx() functions. Now drivers can be written in the more
traditional form, where most of the code is shared and only the resource
acquisition code at attachment time changes.
As part of my journey to make it easy to determine what's relying on tty
bits, remove a couple more. Some of these just outright didn't need it,
while others did rely on <sys/tty.h> pollution for mutex headers.
improvements, the ECN bits need to be exposed to the TCP SYNcache.
This change is a minimal modification to the function headers, without any
functional change intended.
Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22436
in ofw_gpiobus_probe() return BUS_PROBE_DEFAULT rather than 0; we are not
the only possible driver to handle this device, we're just slightly better
than the base gpiobus (which probes at BUS_PROBE_GENERIC).
In the time since this code was first written, the gpio controller bindings
aquired the concept of a "hog" node which could be used to preset one or
more gpio pins as input or output at a specified level. This change doesn't
fully implement the hogging concept, it just filters out hog nodes when
instantiating child devices by scanning for child nodes in the fdt data.
The whole concept of having child nodes under the controller node is not
supported by the standard bindings, and appears to be a freebsd extension,
probably left over from the days when we had no support for cross-tree
phandle references in the fdt data.
I have no good explanation why it happens, but I found that in B2B mode
at least Xeon v4 NTB leaks accesses to its configuration memory at BAR0
originated from the link side to its host side. DMAR predictably blocks
those, making access to remote scratchpad registers in B2B mode impossible.
This change creates identity mapping in DMAR covering the BAR0 addresses,
making the NTB work fine with DMAR enabled. It seems like allowing single
4KB range at 32KB offset may be enough, but I don't see a reason to be so
specific.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
This was inherited from iwlwifi, which drives devices supported by both
iwn(4) and iwm(4) in FreeBSD. In iwm(4) _mvm is meaningless, so remove
it. OpenBSD made the same change a long time ago. No functional change
intended.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
should try in order to link up with the peer.
Various FEC variables within the driver can now have multiple bits set
instead of being powers of 2. 0 and -1 in the user knobs still mean no
FEC and auto (driver decides) respectively for backward compatibility,
but no-FEC and auto now have their own bits in the internal
representation. There is a new bit that can be set to request the FEC
recommended by the cable/transceiver module.
Add sysctls to display link related capabilities of the local side as
well as the link partner.
Note that all this needs a new firmware and the documentation for the
driver FEC knobs will be updated after that firmware is added to the
driver.
MFC after: 1 week
Sponsored by: Chelsio Communications
Also, Giant isn't required to busy / unbusy a device, so drop that too while I'm
here. It's not done elsewhere in the tree and in the future will likely be
handled by a node lock to ensure consistency. Leave Giant in place for attach
and removing childing, as that's actually still needed, even if imperfect.
Remove stale comment about contigmalloc taking Giant and calling w/o the lock
held. Neither of these is still true.
Move the locking back into the ioctl handler. This "fixes" the race where we hve
a hot plug event just after the dropping of Giant in pci_find_dbsf, assuming the
driver doesn't then call anything that drops and picks up Giant again... It's a
little safer since don't think it doesn't, but we lack the tools to know for
sure.
When we get a device departed message from the firmware, we send a TARGET_REST
to the device to let the firmware know we're done and as part of the recovery
process. This will abort all the commands. While the documentation says the IOC
is responsible for writing the completion message for all the commands pending
with an aborted status, we sometimes have queued commands for the target that
haven't been completed so are in the INQUEUE state. So, when we later complete
the pending CCB as aborted, these commands are freed and we hit the "state not
busy" panic.
Elsewhere where we dequeue commands, we move the state to BUSY from INQUEUE. Do
that here as well. In talking to Ken, Scott and Justin, they recommended a
series of tests to see if this is 100% safe. Those tests are ongoing, but
preliminary tests suggest this is safe as we see no duplicate completions when
we hit this case at work. We have a machine that has a dodgy powersupply which
usually doesn't apply power to a few drives, but sometimes does when the machine
is under heavy load so we get a rash of the connect / disconnect messages over
half an hour. Without this change, we'd see state not busy panic. With this
change, the drives just annoyingly come and go without affecting the rest of the
machine, but without a complete error injection test suite, it's hard to know if
all edge cases are now covered or not.
Discussed with: scottl, ken, gibbs
This allows the driver to be updated for the next firmware without
waiting for it to be released.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
The /dev/pci device doesn't need GIANT, per se. However, one routine
that it calls, pci_find_dbsf implicitly does. It walks a list that can
change when PCI scans a new bus. With hotplug, this means we could
have a race with that scanning. To prevent that, take out Giant around
scanning the list.
However, given that we have places in the tree that drop giant, if
held when we call into them, the whole use of Giant to protect newbus
may be less effective that we desire, so add a comment about why we're
talking it out, and we'll address the issue when we lock newbus with
something other than Giant.
The internal datastructures do not need to be visible outside of
random_harvestq, and this helps ensure they are not misused.
No functional change.
Approved by: csprng(delphij, markm)
Differential Revision: https://reviews.freebsd.org/D22485
There's no need to dynamically populate them; the SYSCTL_ macros take care
of load/unload appropriately already (and random_harvestq is 'standard' and
cannot be unloaded anyway).
Approved by: csprng(delphij, markm)
Differential Revision: https://reviews.freebsd.org/D22484
Break random_harvestq_prime up into some logical subroutines. The goal
is that it becomes easier to add other early entropy sources.
While here, drop pre-12.0 compatibility logic. loader default configuration
should preload the file as expeced since 12.0.
Approved by: csprng(delphij, markm)
Differential Revision: https://reviews.freebsd.org/D22482
On x86 platforms with the intrinsic, rdrand is a deterministic bit generator
(AES-CTR) seeded from an entropic source. On x86 platforms with rdseed, it
is something closer to the upstream entropic source. (There is more nuance;
a block diagram is provided in [1].)
On devices with rdrand and without rdseed, there is no good intrinsic for
acecssing the good entropic soure directly. However, the DRBG is guaranteed
to reseed every 8 kB on these platforms. As a conservative option, on such
hardware we can read an extra 7.99kB samples every time we want a sample
from an independent seed.
As one can imagine, this drastically slows the effective read rate of
RDRAND (a factor of 1024 on amd64 and 2048 on ia32). Microbenchmarks on AMD
Zen (has RDSEED) show an RDRAND rate of 25 MB/s and Intel Haswell (no
RDSEED) show RDRAND of 170 MB/s. This would reduce the read rate on Haswell
to ~170 kB/s (at 100% CPU). random(4)'s harvestq thread periodically
"feeds" from pure sources in amounts of 128-1024 bytes. On Haswell,
enabling this feature increases the CPU time of RDRAND in each "feed" from
approximately 0.7-6 µs to 0.7-6 ms.
Because there is some performance penalty to this more conservative option,
a knob is provided to enable the change. The change does not affect
platforms with RDSEED.
[1]: https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide#inpage-nav-4-2
Approved by: csprng(delphij, markm)
Differential Revision: https://reviews.freebsd.org/D22455
This adds support for ifnet (NIC) KTLS using Chelsio T6 adapters.
Unlike the TOE-based KTLS in r353328, NIC TLS works with non-TOE
connections.
NIC KTLS on T6 is not able to use the normal TSO (LSO) path to segment
the encrypted TLS frames output by the crypto engine. Instead, the
TOE is placed into a special setup to permit "dummy" connections to be
associated with regular sockets using KTLS. This permits using the
TOE to segment the encrypted TLS records. However, this approach does
have some limitations:
1) Regular TOE sockets cannot be used when the TOE is in this special
mode. One can use either TOE and TOE-based KTLS or NIC KTLS, but
not both at the same time.
2) In NIC KTLS mode, the TOE is only able to accept a per-connection
timestamp offset that varies in the upper 4 bits. Put another way,
only connections whose timestamp offset has the 28 lower bits
cleared can use NIC KTLS and generate correct timestamps. The
driver will refuse to enable NIC KTLS on connections with a
timestamp offset with any of the lower 28 bits set. To use NIC
KTLS, users can either disable TCP timestamps by setting the
net.inet.tcp.rfc1323 sysctl to 0, or apply a local patch to the
tcp_new_ts_offset() function to clear the lower 28 bits of the
generated offset.
3) Because the TCP segmentation relies on fields mirrored in a TCB in
the TOE, not all fields in a TCP packet can be sent in the TCP
segments generated from a TLS record. Specifically, for packets
containing TCP options other than timestamps, the driver will
inject an "empty" TCP packet holding the requested options (e.g. a
SACK scoreboard) along with the segments from the TLS record.
These empty TCP packets are counted by the
dev.cc.N.txq.M.kern_tls_options sysctls.
Unlike TOE TLS which is able to buffer encrypted TLS records in
on-card memory to handle retransmits, NIC KTLS must re-encrypt TLS
records for retransmit requests as well as non-retransmit requests
that do not include the start of a TLS record but do include the
trailer. The T6 NIC KTLS code tries to optimize some of the cases for
requests to transmit partial TLS records. In particular it attempts
to minimize sending "waste" bytes that have to be given as input to
the crypto engine but are not needed on the wire to satisfy mbufs sent
from the TCP stack down to the driver.
TCP packets for TLS requests are broken down into the following
classes (with associated counters):
- Mbufs that send an entire TLS record in full do not have any waste
bytes (dev.cc.N.txq.M.kern_tls_full).
- Mbufs that send a short TLS record that ends before the end of the
trailer (dev.cc.N.txq.M.kern_tls_short). For sockets using AES-CBC,
the encryption must always start at the beginning, so if the mbuf
starts at an offset into the TLS record, the offset bytes will be
"waste" bytes. For sockets using AES-GCM, the encryption can start
at the 16 byte block before the starting offset capping the waste at
15 bytes.
- Mbufs that send a partial TLS record that has a non-zero starting
offset but ends at the end of the trailer
(dev.cc.N.txq.M.kern_tls_partial). In order to compute the
authentication hash stored in the trailer, the entire TLS record
must be sent as input to the crypto engine, so the bytes before the
offset are always "waste" bytes.
In addition, other per-txq sysctls are provided:
- dev.cc.N.txq.M.kern_tls_cbc: Count of sockets sent via this txq
using AES-CBC.
- dev.cc.N.txq.M.kern_tls_gcm: Count of sockets sent via this txq
using AES-GCM.
- dev.cc.N.txq.M.kern_tls_fin: Count of empty FIN-only packets sent to
compensate for the TOE engine not being able to set FIN on the last
segment of a TLS record if the TLS record mbuf had FIN set.
- dev.cc.N.txq.M.kern_tls_records: Count of TLS records sent via this
txq including full, short, and partial records.
- dev.cc.N.txq.M.kern_tls_octets: Count of non-waste bytes (TLS header
and payload) sent for TLS record requests.
- dev.cc.N.txq.M.kern_tls_waste: Count of waste bytes sent for TLS
record requests.
To enable NIC KTLS with T6, set the following tunables prior to
loading the cxgbe(4) driver:
hw.cxgbe.config_file=kern_tls
hw.cxgbe.kern_tls=1
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21962
than effectively doing scatter/gather IO with a pair of iic_msgs that direct
the controller to do a single transfer with no bus STOP/START between the
two buffers. It turns out we have multiple i2c hardware drivers that don't
honor the NOSTOP and NOSTART flags; sometimes they just try to do the
transfers anyway, creating confusing failures or leading to corrupted data.
It is clearer to me to return success/error (true/false) instead of some
retry count linked to the inline assembly implementation.
No functional change.
Approved by: core(csprng) => csprng(markm)
Differential Revision: https://reviews.freebsd.org/D22454
A SIM-private field is used for that.
The pointer can be useful when examining a state of a queued ccb.
E.g., a ccb on a da_softc.pending_ccbs.
MFC after: 2 weeks
PLX NTB sends translated DMA requests not only from itsels, but from all
slots and functions of its bus. By default DMAR blocks those additional.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Use the KPI to tweak MSRs in mitigation code.
Reviewed by: markj, scottl
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22431
This CVE has already been announced in FreeBSD SA-19:26.mcu.
Mitigation for TAA involves either turning off TSX or turning on the
VERW mitigation used for MDS. Some CPUs will also be self-mitigating
for TAA and require no software workaround.
Control knobs are:
machdep.mitigations.taa.enable:
0 - no software mitigation is enabled
1 - attempt to disable TSX
2 - use the VERW mitigation
3 - automatically select the mitigation based on processor
features.
machdep.mitigations.taa.state:
inactive - no mitigation is active/enabled
TSX disable - TSX is disabled in the bare metal CPU as well as
- any virtualized CPUs
VERW - VERW instruction clears CPU buffers
not vulnerable - The CPU has identified itself as not being
vulnerable
Nothing in the base FreeBSD system uses TSX. However, the instructions
are straight-forward to add to custom applications and require no kernel
support, so the mitigation is provided for users with untrusted
applications and tenants.
Reviewed by: emaste, imp, kib, scottph
Sponsored by: Intel
Differential Revision: 22374
I've noticed that sometimes with enabled DMAR initial write from device
to this address is somehow getting delayed, triggering assertion due to
zero default being invalid.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
- Deduce allowed address range for bus_dma(9) from the hardware version.
Different versions (CPU generations) have different documented limits.
- Remove difference between address ranges for src/dst and crc. At least
docs for few recent generations of CPUs do not mention anything like that,
while older are already limited with above limits.
- Remove address assertions from arguments. While I do not think the
addresses out of allowed ranges should realistically happen there due to
the platforms physical address limitations, there is now bus_dma(9) to
make sure of that, preferably via IOMMU.
- Since crc now has the same address range as src/dst, remove crc_dmamap,
reusing dst2_dmamap instead.
Discussed with: cem
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This driver allows to usage of the paravirt SCSI controller
in VMware products like ESXi. The pvscsi driver provides a
substantial performance improvement in block devices versus
the emulated mpt and mps SCSI/SAS controllers.
Error handling in this driver has not been extensively tested
yet.
Submitted by: vbhakta@vmware.com
Relnotes: yes
Sponsored by: VMware, Panzura
Differential Revision: D18613
ccr(4) and TLS support in cxgbe(4) construct key contexts used by the
crypto engine in the T6. This consolidates some duplicated code for
helper functions used to build key contexts.
Reviewed by: np
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D22156
Disable the use of executable 2M page mappings in EPT-format page
tables on affected CPUs. For bhyve virtual machines, this effectively
disables all use of superpage mappings on affected CPUs. The
vm.pmap.allow_2m_x_ept sysctl can be set to override the default and
enable mappings on affected CPUs.
Alternate approaches have been suggested, but at present we do not
believe the complexity is warranted for typical bhyve's use cases.
Reviewed by: alc, emaste, markj, scottl
Security: CVE-2018-12207
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21884
struct nvdimm_label_index is dynamically sized, with the `free`
bitfield expanding to hold `slot_cnt` entries. Fix a few places
where we were treating the struct as though it had a fixed sized.
Reviewed by: cem
Approved by: scottl (mentor)
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22253
Apply the same user accessible filter to namespaces as is applied
to full-SPA devices. Also, explicitly filter out control region
SPAs which don't expose the nvdimm data area.
Reviewed by: cem
Approved by: scottl (mentor)
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21987
Previously ntb_transport(4) required at least 6 scratchpad registers,
plus 2 more for each additional memory window. That is too much for some
configurations, where several drivers have to share resources of the same
NTB hardware. This patch introduces new compact version of the protocol,
requiring only 3 scratchpad registers, plus one more for each additional
memory window. The optimization is based on fact that neither of version,
number of windows or number of queue pairs really need more then one byte
each, and window sizes of 4GB are not very useful now. The new protocol
is activated automatically when the configuration is low on scratchpad
registers, or it can be activated explicitly with loader tunable.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Address Lookup Table (A-LUT) being enabled allows to specify separate
translation for each 1/128th or 1/256th of the BAR2. Previously it was
used only to limit effective window size by blocking access through some
of A-LUT elements. This change allows A-LUT elements to also point
different memory locations, providing to upper layers several (up to 128)
independent memory windows. A-LUT hardware allows even more flexible
configurations than this, but NTB KPI have no way to manage that now.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This kind of clock nodes represent temporary placeholder for clocks
defined later in boot process. Also, these are necessary to break
circular dependencies occasionally occurring in complex clock graphs.
MFC after: 3 weeks
TVSENSE may not be ready by the time t4_fw_initialize returns and the
firmware returns 0 if the driver asks for the Vdd before the sensor is
ready.
MFC after: 1 week
Sponsored by: Chelsio Communications
This is what iwlwifi seems to do, and the previous behaviour triggered
firmware panics during transmit on a 9560.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Though we don't otherwise use firmware's offload capabilities, we need
to set this flag when the MAC header's size isn't a multiple of four.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
- Configure the scheduler only for the management queue.
- Fix a bug when enabling the schduler: the queues are specified using a
bitmask.
- Fix style in the area.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This is the multiqueue receive code required for 9000-series chips.
Note that we still only configure a single RX queue for now. Multiqueue
support will require MSI-X configuration and a scheme for managing a
global pool of RX buffers.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
For now iwm only ever uses queue 0 and the management queue, but my 9560
raises a software error interrupt during initialization if this flag is
not set. iwlwifi sets it for all 7000- and 8000-series hardware, so we
might as well do it unconditionally.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The firmware for 9000-series and newer devices has a different receive
API which supports multiple queues.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Match such chips using the device ID. We should really be checking the
subdevice as well, since a smaller number of 9460 and 9560 devices
actually belong to a new series of devices and require different
firmware, but that will require some extra logic in iwm_attach().
Submitted by: lwhsu, Guo Wen Jun <blockk2000@gmail.com>
MFC after: 2 weeks
Convert existing device family checks to avoid assuming that the device
family is always one of IWM_DEVICE_FAMILY_7000 or _8000.
Submitted by: lwhsu, Guo Wen Jun <blockk2000@gmail.com>
MFC after: 2 weeks
Only perform the call when a qfull bit transitions. While here, avoid
assignments in declarations in iwm_mvm_rx_tx_cmd().
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This ensures that the driver softc reflects device capabilities as early
as possible, for use by device initialization code that is conditional
on certain capabilities.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Also ensure that the htole* macros are applied correctly when specifying
the segment length and upper address bits. No functional change
intended (unless you use iwm(4) on a big-endian machine).
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
- amd_intr() does not account for the offset (0x200) in the counter
MSR address and ends up accessing invalid regions while reading
counter value after the 4th counter (0xC001000[8,9,..]) and
erroneously updates the counter values for counters [1-4].
- amd_intr() should only check core pmcs for interrupts since
other types of pmcs (L3,DF) cannot generate interrupts.
- fix pmc NMI's being ignored due to NMI latency on newer AMD processors
Note that this fixes a kernel panic due to GPFs accessing MSRs on
higher core count AMD cpus (seen on both Rome 7502P, and
Threadripper 2990WX 32-core CPUs)
Discussed with: markj
Submitted by: Shreyank Amartya
Differential Revision: https://reviews.freebsd.org/D21553
The Microchip LAN7430 is a PCIe 10/100/1000 Ethernet MAC with integrated
PHY, and the LAN7431 is a MAC with RGMII interface.
To be connected to the build after further testing and review.
Committing now so that changes like r354345 (adding a common
ETHER_IS_ZERO macro) will update this driver too.
Submitted by: Gerald ND Aryeetey <aryeeteygerald_rogers.com>
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20079
The pages stored in the ksyms object are not pageable. Moreover, this
obviates the need to set OBJ_NOSPLIT.
Reviewed by: alc, kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22229
This function will call the regnode_check_voltage method for a given regulator
and check if the desired voltage in reachable by it.
Also adds a default method that check the std_param and which should be enough
for most regulators and add it as the method for axp* rk805 and fixed regulators.
Reviewed by: mmel
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D22260
It is reported that those VFs share their RSS configuration with PF and,
thus, they cannot be configured independently.
Also:
- add missing opt_rss.h to if_ixv.c, otherwise RSS kernel option could
not be seen
- do not enable IXGBE_FEATURE_RSS on the older VFs
- set flowid / hash type to M_HASHTYPE_NONE or M_HASHTYPE_OPAQUE_HASH
(based on what the hardware reports) if IXGBE_FEATURE_RSS is not set
Reviewed by: nobody
MFC after: 4 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D21705
Some places in network code may need to verify that an ethernet address
is not the 'zero' address. Provide a standard macro ETHER_IS_ZERO for
this purpose, similar to the ETHER_IS_BROADCAST macro already available.
This patch also removes previous ETHER_IS_ZERO definitions in several
USB ethernet drivers, in favor of this centrally-located macro.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21240
A couple of drivers and one place in if.c use ETH_ADDR_LEN, even though
net/ethernet.h provides an equivalent ETHER_ADDR_LEN definition.
Cleanup all of the locations which refer to ETH_ADDR_LEN to use the
standard ETHER_ADDR_LEN instead.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, jpaetzel@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21239
Handle error bits of INTR_STAT and TX_ABORT registers.
Move interrupt clearing from interrupt handler to polling loop to get
common execution path with polled mode.
Do not clear interrupts with reading of IG4_REG_CLR_INTR register as
interrupts, triggered during the period from reg_read(IG4_REG_INTR_STAT)
to reg_read(IG4_REG_CLR_INTR) will be missed.
Instead, read each IG4_REG_CLR_* register separately.
INTR_STAT register exposes more useful informaton then STA register does
e.g. it exposes error and I2C bus STOP conditions. Make it a main source
of I2C transfer state.
In this mode DATA_CMD register reads and writes are performed in
TX/RX FIFO-sized bursts to increase I2C bus utilization.
That reduces read time from 60us to 30us per byte when read data is fit
in to RX FIFO buffer in FAST speed mode in my setup.
IC clock rates are varied between different controller models so we have
to adjust timing registers in each case individually. Borrow intresting
constants and formulas from Intel specs, i2c-designware and lpss_intel
drivers and apply them to FreeBSD supported controller models.
Implement fetching of timing data via ACPI methods execution if available.
After recent ig4 changes cyapa driver can be attached before timers
initialization is completed. Start polling thread from config_intrhook
to avoid busy loops in that case.
as the driver is fully functional on a cold boot through utilization of
polled mode.
As a side effect, ig4 children probe and attach methods can be called
earlier in the boot sequence, so now it is up to the child drivers
to wait for a kernel initialization completion if it is required.
If controller is allocated with IIC_NOWAIT option ig4 enables polled mode
for a period of allocation that makes possible to start I2C transfers
from the contexts where sleeping is not allowed e.g. from ithreads or
callouts.
Currently ig4 internally depends on it's own interrupts and uses mtx_sleep()
to wait for them. That means it can not be used from any context where
sleeping is disallowed e.g. on cold boot, from DDB/KDB, from other device
driver's interrupt handlers and so on.
This change replaces sleeps with busy loops in cold boot and DDB cases.
Setting the IG4_REG_RX_TL register to 1 was actually generating an
interrupt after 2 bytes were available in the Rx fifo. We need to set the
register to 0 to get an interrupt for 1 byte already.
Obtained from: DragonflyBSD (02f0bf2)
Now io_lock is used as condition variable to synchronize active process with
the interrupt handler. It is not used for tasks other than waiting for
interrupt and passing parameters to and from it's handler.
Specs shows no dedicated interrupt firing on disable of the controller.
Remove io lock acquisitions around set_controller() calls as they are
not needed anymore.
There is no need to read all controller's RX FIFO data to clear RX_FULL
bit in interrupt handler as interrupts are masked permanently since
previous commit.
This avoids possible interrupt storms, depending on the state of the I2C
controller before the driver attached.
During attaching this clears the interrupt mask.
Revert r338215 as this change makes it no-op.
Obtained from: DragonflyBSD (d7c8555)
Fail the attach on controller startup errors. For some reason the
dell xps 13 says there's I2C controller, but the controller appears
to be permanente disabled and will refuse to enable.
Obtained from: DragonflyBSD (509820b)
They share common device driver code with different bus attachments
This commit starts a bunch of changes which have following properties:
Reviewed by: imp (previous version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D22016
The valectl(4) program is used to manage vale(4) switches.
Add it to the system commands so that it can be used right away.
This program was previously called vale-ctl, and stored in
tools/tools/netmap
Reviewed by: hrs, bcr, lwhsu, kevans
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22146
In this release the netmap support was introduced.
Moreover, it is also now possible to use the LLQ mode of the driver on
the arm64 AWS instances (A1 type).
Differential Revision: https://reviews.freebsd.org/D21938
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
In NETMAP mode not all queues need to be allocated to NETMAP. Some of
them could be left to the kernel. Configuration is managed by the flags
nr_mode and nr_pending_mode provided per each NETMAP kring.
ENA driver checks those flags and perform proper rings initialization.
Differential Revision: https://reviews.freebsd.org/D21937
Submitted by: Rafal Kozik <rk@semihalf.com>
Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Two new tables are added to ena_tx_buffer structure:
* netmap_map_seg stores DMA mapping structures,
* netmap_buf_idx stores buff indexes taken from the slots.
When Tx resources are being set, the new mapping structures are created
and netmap Tx rings are being reset.
When Tx resources are being released, used netmap bufs are unmapped from
DMA and then mapping structures are destroyed.
When Tx interrupt occurrs, ena_netmap_tx_irq is called.
ena_netmap_txsync callback signalizes that there are new packets which
should be transmitted.
First, it fills ena_netmap_ctx. Then it performs two actions:
* ena_netmap_tx_frames moves packets from netmap ring to NIC,
* ena_netmap_tx_cleanup restores buffers from NIC and gives them back
to the userspace app.
0 is returned in case of Tx error that could be handled by the driver.
ena_netmap_tx_frames checks if there are packets ready for transmission.
Then, for each of them, ena_netmap_tx_frame is called. If error occurs,
transmitting is stopped, but if the error was cause due to HW ring being
full, information about that is not propagated to the userspace app.
When all packets are ready, doorbell is written to NIC and netmap ring
state is updated.
Parsing of one packet is done by the ena_netmap_tx_frame function.
First, it checks if number of slots does not exceed NIC limit. Invalid
packets are being dropped and the error is propagated to the upper
layer. As each netmap buffer has equal size, which is typically greater
then 2KiB, there shouldn't be any packets which contain too many slots.
Then, the ena_com_tx_ctx structure is being filled. As netmap does not
support any hardware offloads, ena_com_tx_meta structure is set to zero.
After that, ena_netmap_map_slots maps all memory slots for DMA.
If the device works in the LLQ mode, the push header is being determined
by checking if the header fits within the first socket.
If so, the portion of data is being copied directly from the slot.
In other case, the data is copied to the intermediate buffer.
First slots are treated the same as as the others, because DMA mapping
has no impact on LLQ mode. Index of each netmap buffer is taken from
slot and stored in netmap_buf_idx array. In case of mapping error,
memory is unmapped and packets are put back to the netmap ring.
ena_netmap_tx_cleanup performs out of order cleanup of sent buffers.
First, req_id is taken and is validated. As validate_tx_req_id from
ena.c is specific to kernels mbuf, another implementation is provided.
Each req_id is cleaned up by ena_netmap_tx_clean_one function. Buffers
are being unmaped from DMA and put back to netmap ring. In the end,
state of netmap and NIC rings are being updated.
Differential Revision: https://reviews.freebsd.org/D21936
Submitted by: Rafal Kozik <rk@semihalf.com>
Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Most of code used for Rx ring initialization could be reused in NETMAP.
Reset of NETMAP ring and new alloc method was added. Driver decides if
use kernels mbufs or NETMAPs slots based on IFCAP_NETMAP flag. It
allows to reuse ena_refill_rx_bufs, which provides proper handling of
Rx out of order completion.
ena_netmap_alloc_rx_slot takes exactly the same arguments as
ena_alloc_rx_mbuf, but instead of allocating one mbuf it takes one slot
from NETMAP ring. Based on queue id proper netmap_ring is found. As
NETMAP provides the "partial opening" feature not all of the rings are
avaiable. Not used points to invalid ring. If there is available slot,
it is taken from the ring. Its buffer is mapped to DMA and its index is
stored in ena_rx_buffer field in ena_rx_buffer structure. Then ena_buf
is filled with addresses and ring state is updated.
Cleanup is handled by ena_netmap_free_rx_slot. It unmaps DMA and returns
buffer to ring. As we could not return more bufs than we have taken and
we should not override occupied slots, buf_index should be 0. It is
being checked by assertion.
ena_netmap_rxsync callback puts received packets back to NETMAP ring and
passes them to user space by updating ring pointers. First it fills
ena_netmap_ctx.
Then it performs two actions:
* ena_netmap_rx_frames moves received frames from NIC to NETMAP ring,
* ena_netmap_rx_cleanup fills NIC ring with slots released by userspace
app.
In case of Rx error that could be handled by NIC driver (for example by
performing reset) rx sync should return 0.
ena_netmap_rx_frames first checks if NETMAP ring is in consistent
state and then in the loop receives new frames. When all available
frames are taken nr_hwtail is updated.
Receiving one frame is handled by ena_netmap_rx_frame. If no error
occurrs, each Descriptor is loaded by ena_netmap_rx_load_desc function.
If packets take more than one segments NS_MOREFRAG flag must be set in
all, but not last slot. In case of wrong req_id packet is removed from
NETMAP ring. If packet is successful received counters are updated.
Refiling of NIC ring is performed by ena_netmap_rx_cleanup function.
It calculates number of available slots and call ena_refill_rx_bufs with
proper number.
Differential Revision: https://reviews.freebsd.org/D21935
Submitted by: Rafal Kozik <rk@semihalf.com>
Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Mock implementation of NETMAP routines is located in ena_netmap.c/.h
files. All code is protected under the DEV_NETMAP macro. Makefile was
updated with files and flag.
As ENA driver provide own implementations of (un)likely it must be
undefined before including NETMAP headers.
ena_netmap_attach function is called on the end of NIC attach. It fills
structure with NIC configuration and callbacks. Then provides it to
netmap_attach. Similarly netmap_detach is called during ena_detach.
Three callbacks are used.
nm_register is implemented by ena_netmap_reg. It is called when user
space application open or close NIC in NETMAP mode. Current action is
recognized based on onoff parameter: true means on and false off. As
NICs rings need to be reconfigured ena_down and ena_up are reused.
When user space application wants to receive new packets from NIC
nm_rxsync is called, and when there are new packets ready for Tx
nm_txsync is called.
Differential Revision: https://reviews.freebsd.org/D21934
Submitted by: Rafal Kozik <rk@semihalf.com>
Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Move Rx/Tx routines to separate file.
Some functions:
* ena_restore_device,
* ena_destroy_device,
* ena_up,
* ena_down,
* ena_refill_rx_bufs
could be reused in upcoming netmap code in the driver. To make it
possible, they were moved to ena.h header.
Differential Revision: https://reviews.freebsd.org/D21933
Submitted by: Rafal Kozik <rk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
When the ENA_FLAG_DEVICE_RUNNING flag is disabled, the AENQ handlers
aren't executed. To fix that, the watchdog timestamp should be updated
just before enabling the watchdog.
Timer service was always being enabled, even if the device wasn't up
before the reset. That shouldn't happen, as the timer service is being
executed only for working interface.
Differential Revision: https://reviews.freebsd.org/D21932
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
As the pmamp_change_attr() is public on arm64 since r351131, it can be
used on the arm64 to map memory range as with the write combined
attribute.
It requires the driver to use generic VM_MEMATTR_WRITE_COMBINING flag
instead of the x86 specific PAT_WRITE_COMBINING.
Differential Revision: https://reviews.freebsd.org/D21931
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
First, SCL low timeout is set to 25 milliseconds by default as opposed
to 1 millisecond before. The new value is based on the SMBus
specification. The timeout can be changed on a per bus basis using
dev.iicbb.N.scl_low_timeout sysctl.
The driver uses DELAY to wait for high SCL up to 1 millisecond, then it
switches to pause_sbt(SBT_1MS) for the rest of the timeout.
While here I made a number of other changes. 'udelay' that's used for
timing clock and data signals is now calculated based on the requested
bus frequency (dev.iicbus.N.frequency) instead of being hardcoded to 10
microseconds. The calculations are done in such a fashion that the
default bus frequency of 100000 is converted to udelay of 10 us. This
is for backward compatibility. The actual frequency will be less than a
quarter (I think) of the requested frequency.
Also, I added detection of stuck low SCL in a few places. Previously,
the code would just carry on after the SCL low timeout and that might
potentially lead to misinterpreted bits.
Finally, I fixed several style issues near the code that I changed.
Many more are still remaining.
Tested by accessing HTU21 temperature and humidity sensor in this setup:
superio0: <Nuvoton NCT5104D/NCT6102D/NCT6106D (rev. B+)> at port 0x2e-0x2f on isa0
gpio1: <Nuvoton GPIO controller> at GPIO ldn 0x07 on superio0
pcib0: allocated type 4 (0x220-0x226) for rid 0 of gpio1
gpiobus1: <GPIO bus> on gpio1
gpioiic0: <GPIO I2C bit-banging driver> at pins 14-15 on gpiobus1
gpioiic0: SCL pin: 14, SDA pin: 15
iicbb0: <I2C bit-banging driver> on gpioiic0
iicbus0: <Philips I2C bus> on iicbb0 master-only
iic0: <I2C generic I/O> on iicbus0
Discussed with: ian, imp
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D22109
I limited potentially infinite timings by 960 us based on a footnote on
page 38 of Maxim Integrated Application Note 937, Book of iButton
Standards: "In order not to mask interrupt signalling by other devices
on the 1–Wire bus, tRSTL + tR should always be less than 960 us."
MFC after: 3 weeks
Previously we used the minimal value of 1 us and it was really tight.
Application Note 3829 has a table describing recommended t_rec values
for various bus voltages, temperature conditions and numbers of slave
devices. The new value decreases the maximum possible data rate from
16.3 Kbit/s to 13.3 Kbit/s, but it allows for up to four slaves on a
3.3V bus (under room temperature).
References:
- Maxim Integrated Application Note 3829
Determining the Recovery Time for Multiple-Slave 1-Wire(R) Networks
- Maxim Integrated Application Note 937
Book of iButton Standards
Discussed with: imp (D22108)
MFC after: 3 weeks
After r353292, netmap generic adapter on if_vlan interfaces panics on
asserting the NET_EPOCH. In more detail, this happens when
nm_os_generic_xmit_frame() is called, that is in the generic txsync
routine.
Fix the issue by entering the NET_EPOCH during the generic txsync.
We amortize the cost of entering/exiting over a whole batch of
transmissions.
PR: 241489
Reported by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
Previously the code used sbttous() before microseconds comparison in one
place, sbttons() and nanoseconds in another, division by SBT_1US and
microseconds in yet another.
Now the code consistently uses multiplication by SBT_1US to convert
microseconds to sbintime_t before comparing them with periods between
calls to sbinuptime(). This is fast, this is precise enough (below
0.03%) and the periods defined by the protocol cannot overflow.
Reviewed by: imp (D22108)
MFC after: 2 weeks
The lock is used only for start / stop signaling.
It is used only for 'flags' field and the related condition variable.
This change is a follow-up to r354067, it was suggested by Warner in
D22107.
Suggested by: imp
MFC after: 1 week
This is similar to what is done around other calls that lead to
own_command_wait() that can sleep.
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22107
Some controllers cannot preset future output value while the pin is in
input mode. This adds a fallback for those controllers. The new code
assumes that a controller reports an error in that case.
For example, all hardware supported by nctgpio behaves in that way.
This is a temporary measure. In the future we will use
GPIO_PIN_PRESET_LOW / GPIO_PIN_PRESET_HIGH to preset the output either
in hardware, if supported, or in software (e.g., in
gpiobus_pin_setflags).
While here, I extracted common functionality of gpioiic_set{sda,scl} and
gpioiic_get{sda,scl} to gpioiic_setpin and gpioiic_getpin respectively.
MFC after: 2 weeks
The object does not provide anonymous memory.
Reported by: kib
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22123
This method check that boot_on or always_on is set to 1 and if it
is it will try to enable the regulator.
The binding docs aren't clear on what to do but Linux enable the regulator
if any of those properties is set so we want to do the same.
The function first check the status to see if the regulator is
already enabled it then get the voltage to check if it is in a acceptable
range and then enables it.
This will be either called from the regnode_init method (if it's needed by the platform)
or by a SYSINIT at SI_SUB_LAST
Reviewed by: mmel
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22106
NIC KTLS will add a new TLS send tag type in cxgbe(4) that is a
distinct tag from a ratelimit tag. To support this, refactor
cxgbe_snd_tag to be a simple send tag with a type and convert the
existing ratelimit tag to a new cxgbe_rate_tag structure.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D22072
Previously the table was allocated on first use by TOE and the
ratelimit code. The forthcoming NIC KTLS code also uses this table.
Allocate it unconditionally during attach to simplify consumers.
Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D22028
This change consists of two parts.
First, nctgpio now supports hardware access via an I/O port window if
it's configured by firmware. For instance, PC Engines firmware
v4.10.0.2 does that. This is faster than going through the Super I/O
configuration registers.
Second, nctgpio now caches values of bits that it controls. For
example, the driver does not need to access the hardware to determine if
a pin is an output or an input, or a state of an output. Also, the
driver makes use of the fact that the hardware preserves an output state
of a pin accross a switch to the input mode and back.
With this change I am able to use the 1-Wire bus over nctgpio whereas
previously the driver introduced too much latency to be compliant with
the relatively strict protocol timings.
superio0: <Nuvoton NCT5104D/NCT6102D/NCT6106D (rev. B+)> at port 0x2e-0x2f on isa0
gpio1: <Nuvoton GPIO controller> at GPIO ldn 0x07 on superio0
pcib0: allocated type 4 (0x220-0x226) for rid 0 of gpio1
gpiobus1: <GPIO bus> on gpio1
owc0: <GPIO attached one-wire bus> at pin 4 on gpiobus1
ow0: <1 Wire Bus> on owc0
ow0: romid 28:b2:9e:45:92:10:02:34: no driver
ow_temp0: <Advanced One Wire Temperature> romid 28:b2:9e:45:92:10:02:34 on ow0
MFC after: 4 weeks
This driver seems to have a bug. The bug was carefully saved during
conversion. In the al_eth_mac_table_unicast_add() the argument 'addr',
which is the actual address is unused. So, the function is called as
many times as we have addresses, but with the exactly same argument
list. This doesn't make any sense, but was preserved.
- In em_msix_link(), properly handle IGB-class devices after the iflib(4)
conversion again by only setting EM_MSIX_LINK for the EM-class 82574
and by re-arming link interrupts unconditionally, i. e. not only in
case of spurious interrupts. This fixes the interface link state change
detection for the IGB-class. [1]
- In em_if_update_admin_status(), only re-arm the link state change
interrupt for 82574 and also only if such a device uses MSI-X, i. e.
takes advantage of autoclearing. In case of INTx and MSI as well as
for LEM- and IGB-class devices, re-arming isn't appropriate here and
setting EM_MSIX_LINK isn't either.
While at it, consistently take advantage of the hw variable.
PR: 236724 [1]
Differential Revision: https://reviews.freebsd.org/D21924
This patch is part of an effort to make bhyve networking (in particular TCP)
faster. The key strategy to enhance TCP throughput is to let the whole packet
datapath work with TSO/LRO packets (up to 64KB each), so that the per-packet
overhead is amortized over a large number of bytes.
This capability is supported in the guest by means of the vtnet(4) driver,
which is able to handle TSO/LRO packets leveraging the virtio-net header
(see struct virtio_net_hdr and struct virtio_net_hdr_mrg_rxbuf).
A bhyve VM exchanges packets with the host through a network backend,
which can be vale(4) or if_tap(4).
While vale(4) supports TSO/LRO packets, if_tap(4) does not.
This patch extends if_tap(4) with the ability to understand the virtio-net
header, so that a tapX interface can process TSO/LRO packets.
A couple of ioctl commands have been added to configure and probe the
virtio-net header. Once the virtio-net header is set, the tapX interface
acquires all the IFCAP capabilities necessary for TSO/LRO.
Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D21263
They're formatted into the device name like unit numbers, anyway; store the
number in mda_unit => si_drv0 like dev2unit() expects.
No functional change intended.
Sponsored by: Dell EMC Isilon
The sentinel value for "use the rest of the region," -1, isn't zero modulo
PAGE_SIZE. Relax the check to permit the intended special value.
X-MFC-With: r353110
Sponsored by: Dell EMC Isilon
Follow-up to incomplete pedantic change in r353691 by actually fixing the
default implementation to match the interface type. Mea culpa.
X-MFC-With: r353691, r339754
After r339754, the additional interface parameter was accidentally left out
of the default acpi_generic_id_probe implementation. Apparently this does
not cause any real problems, so this fix is mostly stylistic.
No functional change intended.
X-MFC-With: r339754
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).
It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).
The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.
Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.
The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.
No other functional change intended.
Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421
r259680 added support to vt(4) for printing double-width characters.
Remove the comment that claims no support.
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
This change applies some suggestions by delphij from D21979.
A write-only variable is removed.
There is a diagnostic message if the driver does not recognize the chip.
A chained if-statement is converted to a switch.
MFC after: 3 weeks
From Jake:
When updating the device statistics, report whether or not we have
received any pause frames to the iflib stack. This allows the iflib
stack to avoid generating a Tx hang message while the device is paused.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21870
From Jake:
Notify the iflib stack of whether we received any pause frames during
the timer window. This allows the stack to avoid reporting a Tx hang due
to the device being paused.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21869
From Jake:
The e1000 driver sets the iflib shared context isc_pause_frames value to
the number of received xoff frames. This is done so that the iflib
watchdog timer won't trigger a Tx Hang due to pause frames.
Unfortunately, the function simply sets it to the value of the xoffrxc
counter. Once the device has received a single XOFF packet, the driver
always reports that we received pause frames. This will prevent the Tx
hang detection entirely from that point on.
Fix this by assigning isc_pause_frames to a non-zero value if we
received any XOFF packets in the last timer interval.
We could attempt to calculate the total number of received packets by
doing a subtraction, but the iflib stack only seems to check if
isc_pause_frames is non-zero.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21868
the slot is flagged as 'embedded'.
The features related to embedded and shared slots were added in v3.0 of
the sdhci spec. Hardware prior to v3 sometimes supported 1.8v on non-
removable devices in embedded systems, but had no way to indicate that
via the standard sdhci registers (instead they use out of band metadata
such as FDT data).
This change adds the controller specification version to the check for
whether to filter out the 1.8v selection. On older hardware, the 1.8v
option is allowed to remain. On 3.0 or later it still requires the
embedded-slot flag to remain.
This is part of the fix for PR 241301 (eMMC not detected on Beaglebone).
Changes to the sdhci_ti driver are also needed for a full fix.
PR: 241301
This allows to remove a bunch of low level code.
Also, superio(4) provides safer interaction with other drivers
that work with Super I/O configuration registers.
Tested only on PCengines APU2:
superio0: <Nuvoton NCT5104D/NCT6102D/NCT6106D (rev. B+)> at port 0x2e-0x2f on isa0
wbwd0: <Nuvoton NCT6102 (0xc4/0x53) Watchdog Timer> at WDT ldn 0x08 on superio0
The watchdog output is incorrectly wired on that system and the watchdog
does not really do it its job, but the pulse can be seen with a signal
analyzer.
Reviewed by: delphij, bcr (man page)
MFC after: 19 days
Differential Revision: https://reviews.freebsd.org/D21979
This is where it logically belongs.
The change allows to drop a bunch of low lewel code.
Reviewed by: gonzo
MFC after: 19 days
Differential Revision: https://reviews.freebsd.org/D21980
From Zach:
Intel documentation indicates that backplane X550EM_X KR devices do not
support Energy Efficient Ethernet. Prior to this patch, X552 devices
(device ID 0x15AB) will crash the system when transitioning EEE state
via sysctl.
Signed-off-by: Zach Vargas <zvargas@xes-inc.com>
PR: 240320
Submitted by: Zach Vargas <zvargas@xes-inc.com>
Reviewed by: erj@
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21673
Rescan a PCI bus when the ACPI_NOTIFY_BUS_CHECK event is posted to a
PCI bus.
Reviewed by: scottl
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D21948
Install ACPI notify handlers on PCI devices with an _EJ0 method. This
handler is invoked when devices are added or removed.
- When an ACPI_NOTIFY_DEVICE_CHECK event posts, rescan the parent bus
device. Note that strictly speaking we only need to rescan the
specified device, but BUS_RESCAN is what is available, so we rescan
the entire bus.
- When an ACPI_NOTIFY_EJECT_REQUEST event posts, detach the device
associated with the ACPI handle, invoke the _EJ0 method, and then
delete the device.
Eventually this might be changed to vector notify events to devd in
userspace where devctl can be used instead to permit more complex
actions such as graceful unmounting of filesystems.
Tested by: cperciva
Reviewed by: cperciva, imp, scottl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21948
Ignore only ENOENT (no DTS properties found) and ENODEV (driver not
present) non-zero return values from ext_resources.
Reviewed by: manu
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22043
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548
(resets, regulators, clocks) are not available.
Rely on a system initialization done by a bootloader in that cases.
This fixes operation on Terasic DE10-Pro (an Intel Stratix 10
development kit).
Sponsored by: DARPA, AFRL
Refactor nvdimm_spa_memattr() routine and callers to just save the value at
initialization and use the value directly. The reference value from NFIT,
MemoryMapping, is read only once, so the associated memattr could never
change.
No functional change.
Sponsored by: Dell EMC Isilon
starting at the max. domain, and then work down. Then existing FreeBSD
drivers will attach. Interrupt routing from the VMD MSI-X to the NVME
drive is not well known, so any interrupt is sent to all children that
register.
VROC used Intel meta data so graid(8) works with it. However, graid(8)
supports RAID 0,1,10 for read and write. I have some early code to
support writes with RAID 5. Note that RAID 5 can have life issues
with SSDs since it can cause write amplification from updating the parity
data.
Hot plug support needs a change to skip the following check to work:
if (pcib_request_feature(dev, PCI_FEATURE_HP) != 0) {
in sys/dev/pci/pci_pci.c.
Looked at by: imp, rpokala, bcr
Differential Revision: https://reviews.freebsd.org/D21383
This ensures the clip task won't race with t4_destroy_clip_table.
While here, make some mutex destroys unconditional since attach always
initializes them.
Reviewed by: np
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21952
If the bootloader enabled DMA we need to fully reset the DMA controller
otherwise we might have some stale data in it that provoke weird
behavior.
MFC after: 1 week
This adds a TOE hook to allocate a KTLS session. It also recognizes
TLS mbufs in the socket buffer and sends those to the NIC using a TLS
work request to encrypt the record before segmenting it.
TOE TLS support must be enabled via the dev.t6nex.<N>.tls sysctl in
addition to enabling KTLS.
Reviewed by: np, gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21891
The PCI block in the adapter requires this field to be set to a valid
queue ID. It is not clear why it did not fail on all machines, but
the effect was that crypto operations reading input data via DMA
failed with an internal PCI read error on machines with 128G or more
of RAM.
Reported by: gallatin
Reviewed by: np
MFC after: 3 days
Sponsored by: Chelsio Communications
As noted by the commit message, callouts are now persistant
and should not be in the auto-zero section of the RQ's and SQ's.
This fixes an assert when using the TX completion event
factor feature with mlx5en(4).
Found by: gallatin@
MFC after: 3 days
Sponsored by: Mellanox Technologies
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.
Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.
Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
nvdimm_e820 is a newbus pseudo driver that looks for "legacy" e820 PRAM
spans and creates ordinary-looking SPA devfs nodes for them
(/dev/nvdimm_spaN).
As these legacy regions lack real NFIT SPA regions and namespace
definitions, they must be administratively sliced up externally using
device.hints. This is similar in purpose to the Linux memmap= mechanism.
It is assumed that systems with working NFIT tables will not have any use
for this driver, and that that will be the prevailing style going forward,
so if there are no explicit hints provided, this driver does not
automatically create any devices.
Reviewed by: kib (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21885
Create an attachment file for the existing ACPI attachment, and create a
new FDT attachment for the generic-ehci driver.
Submitted by: andrew (Original version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19389
Regression introduced in r343629 when malloc result was renamed from spa to
spa_mapping and the 'spa' name was instead used to iterate a table, but the
free() target was not updated.
Reviewed by: kib, scottph
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21871
The PRM suggests random 0 - 10ms to prevent multiple waiters on the same
interval in order to avoid starvation.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
Before attempting to initialize the command interface we must wait till
the fw_initializing bit is clear.
If we fail to meet this condition the hardware will drop our
configuration, specifically the descriptors page address. This scenario
can happen when the firmware is still executing an FLR flow and did not
finish yet so the driver needs to wait for that to finish.
Linux commits:
6c780a0267b8
b8a92577f4be.
MFC after: 3 days
Sponsored by: Mellanox Technologies
in mlx5en(4) after r348254.
The unlimited send tags are shared amount multiple connections and are
not allocated per send tag allocation request. Only increment the refcount.
MFC after: 3 days
Sponsored by: Mellanox Technologies
It may happen during link down that the running state may be observed
non-zero in the transmit routine, right before the running state is
cleared. This may end up using a destroyed mutex.
Make all channel mutexes and callouts persistant.
Preserve receive and send queue statistics during link toggle.
MFC after: 3 days
Sponsored by: Mellanox Technologies
in mlx5core. The EEPROM information is not only a property of the
mlx5en(4) driver.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
The following sysctls are added:
dev.mce.N.conf.qos.cable_length
dev.mce.N.conf.qos.buffers_size
dev.mce.N.conf.qos.buffers_prio
Submitted by: kib@
MFC after: 3 days
Sponsored by: Mellanox Technologies
All prints in mlx5core should use on of the macros:
mlx5_core_err/dbg/warn
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
In case of health counter fails to increment it indicates a bad device health.
In case when the syndrome indicated by firmware is 0x0, this indicates that
firmware is unable to respond to initialization segment reads.
Add proper print in this case.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
MPFS is a logical switch in the Mellanox device which forward packets
based on a hardware driven L2 address table, to one or more physical-
or virtual- functions. The physical- or virtual- function is required
to tell the MPFS by using the MPFS firmware commands, which unicast
MAC addresses it is requesting from the physical port's traffic.
Broadcast and multicast traffic however, is copied to all listening
physical- and virtual- functions and does not need a rule in the MPFS
switching table.
Linux commit: eeb66cdb682678bfd1f02a4547e3649b38ffea7e
MFC after: 3 days
Sponsored by: Mellanox Technologies
Add the 512 bytes limit of RDMA READ and the size of remote address to the max
SGE calculation.
Submitted by: slavash@
Linux commit: 288c01b746aa
MFC after: 3 days
Sponsored by: Mellanox Technologies
Currently only suspend requests are acknowledged by writing an empty
string back to the xenstore control node, but poweroff or reboot
requests are not acknowledged and FreeBSD simply proceeds to perform
the desired action.
Fix this by acknowledging all requests, and remove the suspend specific
ack done in the handler.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
As of r347221 the iflib legacy interrupt mode setup assumes that drivers
perform both receive and transmit processing from the interrupt handler.
This assumption is invalid in the vmxnet3 driver, so introduce the
IFLIB_SINGLE_IRQ_RX_ONLY flag to make iflib avoid tx processing in the
interrupt handler.
PR: 239118
Reported and tested by: Juraj Lutter <otis@sk.freebsd.org>
Obtained from: marius
Reviewed by: gallatin
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21831
r347183 bumped GEOM classes to SI_ORDER_SECOND to resolve a race between
them and the initialization of devsoftc.mtx in devinit, but missed this
dependency on g_flashmap that may now lose the race against GEOM
classes/g_init.
There's a great comment that describes the situation that has also been
updated with the new ordering of GEOM classes.
Reported by: bdragon
MFC after: 4 days
On rockchip board it seems that the value in the DTS
are not enough for reseting the chip, I don't know if
the value are really incorrect or if DELAY is not precise
enough or if the rockchip gpio driver have some "lag" of some
kind or not.
For now just add more delay.
No functional change intended.
The intent is to add a "legacy" e820 pmem newbus bus for nvdimm device in a
subsequent revision, and it's a little more clear if the parent buses get
independent source files.
Quite a lot of ACPI-specific logic is left in nvdimm.c; disentangling that
is a much larger change (and probably not especially useful).
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21813
Those functions are used by kernel, and we can't check all possible argument
errors in production kernel. Plus according to docs many of those errors
are checked by hardware. Assertions should just help with code debugging.
MFC after: 2 weeks
The TUNABLE_INT_FETCH is macro around getenv_int() and we will get
return value 0 or 1 for failure or success, we can use it to decide
which background color to use.
Convert all remaining references to that field to "ref_count" and update
comments accordingly. No functional change intended.
Reviewed by: alc, kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21768
Instead of predicting the MSI-X bar index based on the device's MAC
type, read it from the device's PCI configuration instead.
PR: 239704
Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reviewed by: erj@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21547
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
above 1Kbyte. It might look like some XHCI(4) controllers do not
support when the USB control transfer is split using a link TRB. The
next NORMAL TRB after the link TRB is simply failing with XHCI error
code 4. The quirk ensures we allocate a 64Kbyte buffer so that the
data stage TRB is not broken with a link TRB.
Found at: EuroBSDcon 2019
MFC after: 1 week
Sponsored by: Mellanox Technologies
libusb. This is useful for speeding up large data transfers while reducing
the interrupt rate.
Found at: EuroBSDcon 2019
MFC after: 1 week
Sponsored by: Mellanox Technologies
Allocate ioat->ring memory from the device domain.
Schedule ioat->poll_timer to the first CPU of the device domain.
According to pcm-numa tool from intel-pcm port, this reduces number of
remote DRAM accesses while copying data by 75%. And unless it is a noise,
I've noticed some speed improvement when copying data to other domain.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
If there is an attempt to switch from a process-owned VT to a closed VT,
then vt(4) first requests the process to release its VT and only then
realizes that the target VT is closed and, so, the switch is not
possible. So, the driver does not actually do any switch, but at the
same time the owning process is not notified about that and it does not
re-acquire the VT.
This change adds an early check for the target VT state, so that the
switch can be refused before the process coordination dance.
On top of that, the code now checks for a failure of vt_window_switch()
and calls vt_window_postswitch() for the current VT if it is in the
process mode.
Test Plan:
- configure VT1 - VT8 (ttyv0 - ttyv7) to be text consoles (run getty)
- configure VT9 (ttyv8) to rn X server
- make sure that the X server configuration allows VT switching
- leave VT10 - VT12 unconfigured
- while in the X server press Ctrl+Alt+F10
- without the patch, observe strange screen content and problems with
keyboard input
- with the patch, observe that nothing happens
The problem has been observed and the fix has been tested with an nVidia
graphics card and the proprietary nvidia driver.
Not sure if that matters.
Reviewed by: ray
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D21704
BERI stands for Bluespec Extensible RISC Implementation, based on MIPS.
BERI has not implemented standard MIPS perfomance monitoring counters,
instead it provides statistical counters.
BERI statcounters have a several limitations:
- They can't be written
- They don't support start/stop operation
- None of hardware interrupt is provided on a counter overflow.
So make it separate to hwpmc_mips module and support process/system
counting mode only.
Sponsored by: DARPA, AFRL
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
Since TX interrupt is generated when THRE is set, wait for TEMT set means
wait for full character transmission time. At low speeds that may take
awhile, burning CPU time while holding sc_hwmtx lock, also congested.
This is partial revert of r317659.
PR: 240121
MFC after: 2 weeks
Errors are communicated between the i2c controller layer and upper layers
(iicbus and slave device drivers) using a set of IIC_Exxxxxx constants which
effectively define a private number space separate from (and having values
that conflict with) the system errno number space. Sometimes it is necessary
to report a plain old system error (especially EINTR) from the controller or
bus layer and have that value make it back across the syscall interface
intact.
I initially considered replicating a few "crucial" errno values with similar
names and new numbers, e.g., IIC_EINTR, IIC_ERESTART, etc. It seemed like
that had the potential to grow over time until many of the errno names were
duplicated into the IIC_Exxxxx space.
So instead, this defines a mechanism to "encode" an errno into the IIC_Exxxx
space by setting the high bit and putting the errno into the lower-order
bits; a new errno2iic() function does this. The existing iic2errno()
recognizes the encoded values and extracts the original errno out of the
encoded value. An interesting wrinkle occurs with the pseudo-error values
such as ERESTART -- they aleady have the high bit set, and turning it off
would be the wrong thing to do. Instead, iic2errno() recognizes that lots of
high bits are on (i.e., it's a negative number near to zero) and just
returns that value as-is.
Thus, existing drivers continue to work without needing any changes, and
there is now a way to return errno values from the lower layers. The first
use of that is in iicbus_poll() which does mtx_sleep() with the PCATCH flag,
and needs to return the errno from that up the call chain.
Differential Revision: https://reviews.freebsd.org/D20975
PSCI code to use it.
This interface will also be used by Intel Stratix 10 platform.
This was not tested on arm due to lack of PSCI-enabled arm hardware
lying around.
Reviewed by: andrew
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D21439
Execution of "Soft reset" command (IG4_REG_RESETS_SKL) at controller init
stage sets SDA_HOLD register value to 0x0001 which is often too low for
normal operation.
Set SDA_HOLD back to 28 after reset to restore controller functionality.
PR: 240339
Reported by: imp, GregV, et al.
MFC after: 3 days
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.
Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
this to 2k to prevent them from being truncated and ignored. It
appears to be a sanity check only, but bumping it to 2k allows both of
my iic hid devices to be parsed and the second one to work...
The iicdev_writeto() function basically does scatter-gather IO by filling
in a pair of iic_msg structs to write the register address then the data
from different locations but with a single bus START/xfer/STOP sequence.
It turns out several low-level i2c controller drivers do not honor the
IIC_NOSTART flag, so the second piece of the write gets a new START on
the bus, and that confuses the ads111x chips which expect a continuous
write of 3 bytes to set a register.
A proper fix for this is to track down all the misbehaving controllers
drivers and fix them. For now this change makes this driver work again.
Also, disable the comparator by default; it's not used for anything.
The previous logic would start a measurement, and then pause_sbt() for the
averaging time currently configured in the chip. After waiting that long,
the code would blindly read the measurement register and return its value.
The problem is that the chip's idea of averaging time is based on its
internal free-running 1MHz oscillator, which may be running at a wildly
different rate than the kernel clock. If the chip's internal timer was
running slower than the kernel clock, we'd end up grabbing a stale result
from an old measurement.
The driver now still uses pause_sbt() to yield the cpu while waiting for
the measurement to complete, but after sleeping it checks the chip's status
register to ensure the measurement engine is idle. If it's not, the driver
uses a retry loop to wait a bit (5% of the original wait time) then check
again for completion.
The NVMe standard (1.4) states
>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.
However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.
MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
When we suspend, we need to properly shutdown the NVME controller. The
controller may go into D3 state (or may have the power removed), and
to properly flush the metadata to non-volatile RAM, we must complete a
normal shutdown. This consists of deleting the I/O queues and setting
the shutodown bit. We have to do some extra stuff to make sure we
reset the software state of the queues as well.
On resume, we have to reset the card twice, for reasons described in
the attach funcion. Once we've done that, we can restart the card. If
any of this fails, we'll fail the NVMe card, just like we do when a
reset fails.
Set is_resetting for the duration of the suspend / resume. This keeps
the reset taskqueue from running a concurrent reset, and also is
needed to prevent any hw completions from queueing more I/O to the
card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
that from the global state of the ctrlr. Wait for any pending reset to
finish. All queued I/O will get sent to the hardware as part of
nvme_ctrlr_start(), though the upper layers shouldn't send any
down. Disabling the qpairs is the other failsafe to ensure all I/O is
queued.
Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
confusion with all the other destroy functions. It just removes the
queues in hardware, while the other _destroy_ functions tear down
driver data structures.
Split parts of the hardware reset function up so that I can
do part of the reset in suspsend. Split out the software disabling
of the qpairs into nvme_ctrlr_disable_qpairs.
Finally, fix a couple of spelling errors in comments related to
this.
Relnotes: Yes
MFC After: 1 week
Reviewed by: scottl@ (prior version)
Differential Revision: https://reviews.freebsd.org/D21493
polling within a second. Panic if we don't. All the commands that use this
interface should typically complete within a few tens to hundreds of
microseconds. Panic rather than return ETIMEDOUT because if the command somehow
does later complete, it will randomly corrupt memory. Also, it helps to get a
traceback from where the unexpected failure happens, rather than an infinite
loop.
dump support code, move the while loop into an inline function. These aren't
done in the fast path, so if the compiler choses to not inline, any performance
hit is tiny.
polled interface. Normally this would have the potential to corrupt stack memory
because the completion routines would run after we return. In this case,
however, we're doing a dump so it's safe for reasons explained in the comment.
The returned error number may be EINTR or ERESTART depending on
whether or not the signal is supposed to interrupt the system call.
Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
This fixes a "timed sleep before timers are working" panic seen
while attaching jedec_dimm(4) instances too early in the boot.
Submitted by: ian
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D21452
Remove now-redundant items from toepcb and synq_entry and the code to
support them.
Let the driver calculate tx_align, rx_coalesce, and sndbuf by default.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21387
descriptor. The per-tid tx credits are in demand during active Tx and
it's best not to use too many just for payload.
Sponsored by: Chelsio Communications
According to ACPI 6.3 specification:
The OS sets this bit to 1 if it supports PCI Segment Groups as defined
by the _SEG object, and access to the configuration space of devices
in PCI Segment Groups as described by this specification. Otherwise,
the OS sets this bit to 0.
As far as I see we support both of those as PCI domains for quite a while.
MFC after: 2 months
According to my tests and errata to several generations of Intel CPUs,
PCIe hot-plug command completion reporting is not very reliable thing.
At least on my Supermicro X11DPi-NT board I never saw it reported.
Before this change timeout code detached devices and tried to disable
the slot, that in my case resulted in hot-plugged device being detached
just a second after it was successfully detected and attached. This
change removes that, so in case of timeout it just prints the error and
continue operation. Linux does the same.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
The netmap_pt.c module has become obsolete after
the refactoring that added netmap_kloop.c.
Remove it and unlink it from the build system.
MFC after: 1 week
While it worked with the kenrel, it wasn't working with the loader.
It failed to handle dependencies correctly. The reason for that is
that we never created a nvme module with the DRIVER_MODULE, but
instead a nvme_pci and nvme_ahci module. Create a real nvme module
that nvd can be dependent on so it can import the nvme symbols it
needs from there.
Arguably, nvd should just be a simple child of nvme, but transitioning
to that (and winning that argument given why it was done this way) is
beyond the scope of this change.
Reviewed by: jhb@
Differential Revision: https://reviews.freebsd.org/D21382
queues, don't try to tear them down in the ctrlr_destroy
path. Otherwise, we dereference queue structures that are NULL and we
trap.
This fix is incomplete: we leak IRQ and MSI resources when this
happens. That's preferable to a crash but still should be fixed.
load and people who pull in nvme/nvd from modules can't load nvd.ko
since it depends on nvme, not nvme_foo. The duplicate doesn't matter
since kldxref properly handles that case.
Turn off bus master after we detach the device (to match the prior
order). Release MSI after we're done detaching and have turned off
all the interrupts. Otherwise this may cause problems as other threads
race nvme_detach. This more closely matches the old order.
Reviewed by: mav@
After r351243 when ALTQ was enabled in the kernel, the inline functions
in ifq.h would not have full type information as if_var.h was not
included.
Given usb_ethernet.h already includes all the various headers (which)
is the cause of the problem here, add if_var.h to it. This fixes the
builds again.
Reported by: CI system, e.g. FreeBSD-head-aarch64-LINT
Intel has created RST and many laptops from vendors like Lenovo and Asus. It's a
mechanism for creating multiple boot devices under windows. It effectively hides
the nvme drive inside of the ahci controller. The details are supposed to be a
trade secret. However, there's a reverse engineered Linux driver, and this
implements similar operations to allow nvme drives to attach. The ahci driver
attaches nvme children that proxy the remapped resources to the child. nvme_ahci
is just like nvme_pci, except it doesn't do the PCI specific things. That's
moved into ahci where appropriate.
When the nvme drive is remapped, MSI-x interrupts aren't forwarded (the linux
driver doesn't know how to use this either). INTx interrupts are used
instead. This is suboptimal, but usually sufficient for the laptops these parts
are in.
This is based loosely on https://www.spinics.net/lists/linux-ide/msg53364.html
submitted, but not accepted by, Linux. It was written by Dan Williams. These
changes were written from scratch by Olivier Houchard.
Submitted by: cognet@ (Olivier Houchard)
Nvme drives can be attached in a number of different ways. Separate out the PCI
attachment so that we can have other attachment types, like ahci and various
types of NVMeoF.
Submitted by: cognet@
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back. If there is a detach call in such state,
just stop all activity and free resources. If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This fixes possible double call of fail_fn, for example on hot removal.
It also allows ctrlr_fn to safely return NULL cookie in case of failure
and not get useless ns_fn or fail_fn call with NULL cookie later.
MFC after: 2 weeks
Otherwise the mutex needs to be dropped when copying out the midistat
sbuf, leading to a race which allows one to read kernel memory beyond
the end of the sbuf buffer.
Reported and tested by: pho
Security: CVE-2019-5612
order to have struct mii_data available. However, it only really needs
a forward declaration of struct mii_data for use in pointer form for
the return type of a function prototype.
Custom kernel configuration that have usb and fdt enabled, but no miibus,
end up with compilation failures because miibus_if.h will not get
generated.
Due to the above, the following changes have been made to usb_ethernet.h:
* remove the inclusion of mii headers
* forward-declare struct mii_data
* include net/ifq.h to satify the need for complete struct ifqueue
Reviewed by: ian
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D21293
Return an empty string when the location is unknown instead of the
string "unknown". This ensures that all location entries are of
the form key=val.
Suggested by: imp
Approved by: jhb (mentor)
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21326
Move fast entropy source registration to the earlier
SI_SUB_RANDOM:SI_ORDER_FOURTH and move random_harvestq_prime after that.
Relocate the registration routines out of the much later randomdev module
and into random_harvestq.
This is necessary for the fast random sources to actually register before we
perform random_harvestq_prime() early in the kernel boot.
No functional change.
Reviewed by: delphij, markjm
Approved by: secteam(delphij)
Differential Revision: https://reviews.freebsd.org/D21308
If simple multifuction device also provides syscon interface, its
childern should be able to consume it. Due to this:
- declare coresponding method in syscon interface
- implement it in simple multifunction device driver
MFC after: 1 week
NTB Tool driver is meant for testing NTB hardware driver functionalities,
such as doorbell interrupts, link events, scratchpad registers and memory
windows. This is a port of ntb_tool driver from Linux. It has been
verified on top of AMD and PLX NTB HW drivers.
Submitted by: Arpan Palit <arpan.palit@amd.com>
Cleaned up by: mav
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18819
It is unused, the ABI was broken in r322969, and it is broken by design
(more than MDNPAD md devices can exist and there is no way to retreive
them with this interface).
mdconfig(8) was converted to use libgeom to obtain this information
in r157160 and any other consumers of MDIOCLIST should likewise be
converted.
Reviewed by: emaste
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18936
This adds safety net for the case of misconfigured NTB with too big
memory window, for which we may be unable to allocate a memory buffer,
which does not make much sense for the network interface. While there,
fix the code to really work with asymmetric window sizes setup.
This makes driver just print warning message on boot instead of hanging
if too large memory window is configured.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This restores parity with AMD NTB driver. Though without any drivers
supporting more then one peer and respective KPI modification to pass
peer index to most of the calls this addition is pretty useless now.
MFC after: 2 weeks
No functional change.
Add a verbose comment giving an example side-by-side comparison between the
prior and Concurrent modes of Fortuna, and why one should believe they
produce the same result.
The intent is to flip this on by default prior to 13.0, so testing is
encouraged. To enable, add the following to loader.conf:
kern.random.fortuna.concurrent_read="1"
The intent is also to flip the default blockcipher to the faster Chacha-20
prior to 13.0, so testing of that mode of operation is also appreciated.
To enable, add the following to loader.conf:
kern.random.use_chacha20_cipher="1"
Approved by: secteam(implicit)
Namespace Optimal I/O Boundary field added in NVMe 1.3 and Namespace
Preferred Write Granularity added in 1.4 allow upper layers to align
I/Os for improved SSD performance and endurance.
I don't have hardware reportig those yet, but NPWG could probably be
reported by bhyve.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
With support for multiple namespaces and multiple ports in NVMe there is
now a need for reliable unique namespace identification alike to SCSI.
MFC after: 1 weeks
Sponsored by: iXsystems, Inc.
We used the aw_clk_nm clock for clock with only one divider factor
and used a fake multiplier factor. This cannot work properly as we
end up writing the "fake" factor to the register (and so always set
the LSB to 1).
Create a new clock for those.
The reason for not using the clk_div clock is because those clocks are
a bit special. Since they are (almost) all related to video we also need
to set the parent clock (the main PLL) to a frequency that they can support.
As the main PLL have some minimal frequency that they can support we need to
be able to set the main PLL to a multiple of the desired frequency.
Let say you want to have a 71Mhz pixel clock (typical for a 1280x800 display)
and the main PLL cannot go under 192Mhz, you need to set it to 3 times the
desired frequency and set the divider to 3 on the hdmi clock.
So this also introduce the CLK_SET_ROUND_MULTIPLE flag that allow for this kind
of scenario.
Similar to what was done for device_printfs in r347229.
Convert g_print_bio() to a thin shim around g_format_bio(), which acts on an
sbuf; documented in g_bio.9.
Reviewed by: markj
Discussed with: rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21165
with Communication Device Class Ethernet Emulation Model (CDC EEM).
The driver supports both the device, and host side operation; there
is a new USB template (#11) for the former.
This enables communication with virtual USB NIC provided by iLO 5,
as found in new HPE Proliant servers.
Reviewed by: hselasky
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Hewlett Packard Enterprise
Some hardware needs more than 32, bump this value.
We cannot use the _alloc for of getencprop as this function is called
too early in the boot before pmap is initialized and we only have
2k of stack when cninit is called.
Discussed with: ian
- Check for an invalid device (vendor is invalid) before reading the
header type register when examining function 0 of a possible device.
- When iterating over functions of a device, reject any device whose
16-bit vendor is invalid rather than requiring the full 32-bit
vendor+device to be all 1's. In practice the latter check is
probably fine, but checking the vendor is what the PCI spec
recommends.
Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D21147
RT2860_WCID_MAX is supposed to describe the max STA index for wcid2ni, and
was instead being used as the size -- off-by-one.
rt2860_drain_stats_fifo was range-checking wcid only after accessing
out-of-bounds potentially.
Submitted by: Augustin Cavalier <waddlesplash@gmail.com> (basically)
Obtained from: Haiku (58d16d9fe2d5a209cf22823359a8407d138e1a87)
Differential Revision: 3 days
NVMe reservations are quite alike to SCSI persistent reservations and
can be used in clustered setups with shared multiport storage.
MFC after: 10 days
Relnotes: yes
Sponsored by: iXsystems, Inc.
Instances of the device can be configured using hints or FDT data.
Interfaces to reconfigure the chip and extract voltage measurements from
it are available via sysctl(8).
PowerPC, and possibly other architectures, use different address ranges for
PCI space vs physical address space, which is only mapped at resource
activation time, when the BAR gets written. The DRM kernel modules do not
activate the rman resources, soas not to waste KVA, instead only mapping
parts of the PCI memory at a time. This introduces a
BUS_TRANSLATE_RESOURCE() method, implemented in the Open Firmware/FDT PCI
driver, to perform this necessary translation without activating the
resource.
In addition to system KPI changes, LinuxKPI is updated to handle a
big-endian host, by adding proper endian swaps to the I/O functions.
Submitted by: mmacy
Reported by: hselasky
Differential Revision: https://reviews.freebsd.org/D21096
In particular: Changed Namespace List, Commands Supported and Effects,
Reservation Notification, Sanitize Status.
Add few new arguments to `nvmecontrol log` subcommand.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While very useful by itself, it also makes `nvmecontrol` not depend on
hardcoded device names parsing, that in its turn makes simple to take
nvdX (and potentially any other) device names as arguments.
Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them
interchangeable for management purposes.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
an updated rack depend on having access to the new
ratelimit api in this commit.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20953
with an eventual goal to convert all legacl zlib callers to the new zlib
version:
* Move generic zlib shims that are not specific to zlib 1.0.4 to
sys/dev/zlib.
* Connect new zlib (1.2.11) to the zlib kernel module, currently built
with Z_SOLO.
* Prefix the legacy zlib (1.0.4) with 'zlib104_' namespace.
* Convert sys/opencrypto/cryptodeflate.c to use new zlib.
* Remove bundled zlib 1.2.3 from ZFS and adapt it to new zlib and make
it depend on the zlib module.
* Fix Z_SOLO build of new zlib.
PR: 229763
Submitted by: Yoshihiro Ota <ota j email ne jp>
Reviewed by: markm (sys/dev/zlib/zlib_kmod.c)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D19706
Terasic DE10-Pro (an Intel Stratix 10 GX/SX FPGA Development Kit).
The Altera EMAC is an instance of Synopsys DesignWare Gigabit MAC.
This driver sets correct clock range for MDIO interface on Intel Stratix 10
platform.
This is required due to lack of support for clock manager device for
this platform that could tell us the clock frequency value for ethernet
clock domain.
Sponsored by: DARPA, AFRL
Substitute driver-defined IS_P2ALIGNED() with EFX_IS_P2ALIGNED()
defined in libefx.
Add type argument and cast value and alignment to one specified type.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21076
Substitute driver-defined P2ALIGN() with EFX_P2ALIGN() defined in
libefx.
Cast value and alignment to one specified type to guarantee result
correctness.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21075
Substitute driver-defined P2ROUNDUP() h with EFX_P2ROUNDUP()
defined in libefx.
Cast value and alignment to one specified type to guarantee result
correctness.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21074
We want to allocate a contiguous memory block anywhere in memory, but
expressed this as having to be between 0 and 0xffffffff. This limits us
on 64-bit machines, and outright breaks on machines where memory is
mapped above that address range.
Allow the full address range to be used for this allocation.
Sponsored by: Axiado
The timeout field in the CAPS register is defined to be 8 bits, so its type was
uint8_t. We recently started adding 1 to it to cope with rogue devices that
listed 0 timeout time (which is impossible). However, in so doing, other devices
that list 0xff (for a 2 minute timeout) were broken when adding 1
overflowed. Widen the type to be uint32_t like its source register to avoid the
issue.
Reported by: bapt@
- Wrong order of casting and bit shift caused that enabling and disabling
queues didn't work properly for queues number larger than 32. Use literals
with right suffix instead.
- TX ring tail address was not updated during reinitiailzation of TX
structures. It could block sending traffic.
- Also remove unused variables 'eims' and 'active_queues'.
Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by: erj@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D20826
o Add an experimental IOMMU support to xDMA framework
The BERI IOMMU device is the part of CHERI device-model project [1]. It
translates memory addresses for various BERI peripherals modelled in
software. It accepts FreeBSD/mips64 page directories format and manages
BERI TLB.
1. https://github.com/CTSRD-CHERI/device-model
Sponsored by: DARPA, AFRL
features offered by the chips.
For 2127 and 2129 chips, fix the detection of when chip-init is needed. The
chip config needs to be reset whenever power was lost, but the logic was
wrong for 212x chips (it only worked for 8523). Now the "oscillator
stopped" bit rather than the power manager mode is used to detect startup
after powerfail.
For all chips, disable the clock output pin.
For chips that have a timestamp/tamper-monitor feature, turn off monitoring
of the timestamp trigger pin.
The 8523, 2127, and 2129 chips have a "power manager" feature that offers
several options. We've been using the default mode which enables
everything. Now the code sets the power manager options to
- direct-switch (when Vdd < Vbat, without extra threshold check)
- no battery monitor
- no external powerfail monitor
This reduces the current draw while running on battery from 1930nA to 880nA,
which should roughly double the lifespan of the battery under load.
Because battery checking is a nice thing to have, the code now does a check
at startup, and then once a day after that, instead of checking continuously
(but only actually reporting at startup). The battery check is now done by
setting the power manager back to default mode, sleeping briefly while it
makes a voltage measurement, then switching back to power-saving mode.
While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.
Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.
Also convert it to a bool. While the rest of the driver isn't yet bool clean,
this will help.
Reviewed by: cem@
Differential Revision: https://reviews.freebsd.org/D20988
The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.
Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988
These macros make places where we extract these easier to read. The shift and
mask stuff is also a bit tedious and error prone. Start with the CAP_LO and
CAP_HI registers since their scope is somewhat constrained. This is style
chagne only, no functional changes.
Reviewed by: chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20979
This affects the detection of 24-hour vs AM/PM mode... the ampm bit is in a
different location on 2127 and 2129 chips compared to other nxp rtc chips.
I noticed the 2127 case wasn't being handled correctly when I accidentally
misconfiged my system by claiming my PCF2129 was a 2127.
with various laptops using hdaa(4) sound devices. We don't seem to know
the "correct" configurations for these devices and the defaults are far
superiour, e.g. they work if you don't nuke the default configs.
PR: 200526
Differential Revision: https://reviews.freebsd.org/D17772
Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1
costs little and copes with those NVMe drives that report '0' in this field
cheaply. This is consistent with what the Linux driver does as well.
This fixes the following panic on powerpc:
pci_get_vendor failed for pcib1 on bus ofwbus0, error = 2
PR: 238730
Reported by: Dennis Clarke <dclarke@blastwave.org>
Tested by: Dennis Clarke <dclarke@blastwave.org>
MFC after: 2 weeks
on PCx2129 chips too.
The datasheet for the PCx2129 chips says that there is only a watchdog
timer, no countdown timer. It turns out the countdown timer hardware is
there and works just the same as it does on a PCx2127 chip, except that you
can't use it to trigger an interrupt or toggle an output pin. We don't need
interrupts or output pins, we only need to read the timer register to get
sub-second resolution. So start treating the 2129 chips the same as 2127.
An obscure footnote in the datasheets for the PCx2127, PCx2129, and
PCF8523 rtc chips states that the chips do not support i2c repeat-start
operations. When the driver was originally written and tested, the i2c
bus on that system also didn't support repeat-start and just quietly
turned repeat-start operations into a stop-then-start, making it appear
that the nxprtc driver was working properly.
The repeat-start situation only comes up on reads, so instead of using
the standard iicdev_readfrom(), use a local nxprtc_readfrom(), which is
just a cut-and-pasted copy of iicdev_readfrom(), modified to send two
separate start-data-stop sequences instead of using repeat-start.
The driver used to log any non-zero cause and when running with a single
line interrupt it would spam the console/logs with reports of interrupts
that are of no interest to anyone.
MFC after: 1 week
Sponsored by: Chelsio Communications
Enable this for the NovAtel OEMv2 GPS receiver.
Not fixed: The receiver shows up as "<Interface 0>" in the device
tree, because that is literally what the descriptor-string is.
Reviewed by: hselasky@
When a command is finished running, we must transition it from INQUEUE
to busy state. We were failing to do that, so we hit a panic when the
commands were freed. This only affects mpr, mps already did simmilar
things. Now both the polling and interrupt paths properly set BUSY as
appropriate.
This device cannot cross a 4GB boundary with DMA. Removing the
boundary in r346386 resulted in low frequency memory corruption on
machines with isci(4) controllers.
Submitted by: gallatin@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20910
Add VMBus protocol version 4.0. and 5.0 to support Windows 10 and newer HyperV hosts.
For VMBus 4.0 and newer HyperV, the netvsc gpadl teardown must be done after vmbus close.
Submitted by: whu
MFC after: 2 weeks
Sponsored by: Microsoft
r348164 added code to iicbus_request_bus/iicbus_release_bus to automatically
call device_busy()/device_unbusy() as part of aquiring exclusive use of the
bus (so modules can't be unloaded while the bus is exclusively owned and/or
IO is in progress). That broke the ability to do i2c IO from a slave device
probe method, because the slave isn't attached yet, so calling device_busy()
triggers a sanity-check panic for trying to busy a non-attached device.
Now we check whether the device status is < DS_ATTACHING, and if so we busy
the iicbus rather than the slave device. I think this leaves a small window
where a module could be unloaded while probing is in progress. But I think
that's true of all devices, and probably should be fixed by introducing a
DS_PROBING state for devices, and handling that at various points in the
newbus code.
Eliminate the TIMEDOUT state. This state really conveyed two different
concepts: I timed out during recovery (and my command got put on the
recovery queue), and I timed out diring discovery (which doesn't).
Separate those two concepts into two flags. Use the TIMEDOUT flag to
fail requests as timed out. Use the on queue flag to remove them from
the queue.
In mps_intr_locked for MPI2_RPY_DESCRIPT_FLAGS_ADDRESS_REPLY message
type, when completing commands, ignore the ones that are not in state
INQUEUE. They were already completed as part of the recovery
process. When we complete them twice, we wind up with entries on the
free queue that are marked as busy, trigging asserts.
Reviewed by: scottl (earlier version, just for mpr)
Differential Revision: https://reviews.freebsd.org/D20785
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
table VCTRL registers.
Unconditionally program the MSI-X vector control Mask field for MSI-X
table entries without regarud for Mask's previous value. Some devices
return all zeros on reads of the VCTRL registers, which would cause us
to skip disabling interrupts. This fixes the Samsung SM961/PM961 SSDs
which are return zero starting from offset 0x3084 within the memory
region specified by BAR0, even when they are active MSI-X vectors.
The Illumos kernel writes these unconditionally to 0 or 1. However,
section 6.8.2.9 of the PCI Local Bus 3.0 spec (dated Feb 3, 2004)
states for bits 31::01:
After reset, the state of these bits must be 0. However, for
potential future use, software must preserve the value of
these reserved bits when modifying the value of other Vector
Control bits. If software modifies the value of these reserved
bits, the result is undefined."
so we always set or clear the Mask bit, but otherwise preserves the
old value.
PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713
Reviewed By: imp, jhb
Submitted by: Ka Ho Ng
MFC After: 1 week
Differential Revision: https://reviews.freebsd.org/D20873
While at it fix an invalid memory access issue when attaching external
USB HUBs, which are not mapped by ACPI, due to missing status check
when calling AcpiGetObjectInfo() from acpi_usb_hub_port_probe_cb().
Sponsored by: Mellanox Technologies
When the system has no graphical console, such as bhyve in common
configurations, ignore kern.vt.splash_cpu, instead of panicking
on INVARIANTS kernels.
Reviewed by: cem dumbbell
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20877
Print the adapter name rather than the address of the adapter
to avoid kernel address leakage.
PR: Bug 238642
Submitted by: Fuqian Huang <huangfq.daxian@gmail.com>
Reviewed by: vmaffione
MFC after: 1 week
All MMCBR bridges have to implement all the MMCBR variables. This
implements them for everybody that currently doesn't.
A common routine for this should be written.
XCHAN_CAP_BOUNCE.
The only application that uses bounce buffering for now is the Government
Furnished Equipment (GFE) P2's dma core (AXIDMA) with its own dedicated
cacheless bounce buffer.
Sponsored by: DARPA, AFRL
Otherwise there is a window where they may be rescheduled. This
typically manifested as a page fault shortly after unloading if_iwm.ko.
Close the race by draining callouts after calling iwm_stop_device(),
which is also what Dragonfly does.
Change whitespace to reduce gratuitous diffs with Dragonfly.
Reported and tested by: seanc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Previously the TOE code used its own custom unmapped mbufs via
EXT_FLAG_VENDOR1. The old version always wired the entire AIO request
buffer first for the duration of the AIO operation and constructed
multiple mbufs which used the wired buffer as an external buffer.
The new version determines how much room is available in the socket
buffer and only wires the pages needed for the available room building
chains of M_NOMAP mbufs. This means that a large AIO write will now
limit the amount of wired memory it uses to the size of the socket
buffer.
Reviewed by: gallatin, np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D20839
that node is also compatible with syscon. For instance,
Rockchip RK3399's GRF (General Register Files) is compatible
with simple-mfd as well as syscon and has devices like
usb2-phy, emmc-phy and pcie-phy etc. under it.
Reviewed by: manu
This patch is the driver for NTB hardware in AMD SoCs (ported from Linux)
and enables the NTB infrastructure like Doorbells, Scratchpads and Memory
window in AMD SoC. This driver has been validated using ntb_transport and
if_ntb driver already available in FreeBSD.
Submitted by: Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18774
otherwise we panic.
dwmmc don't handle VCCQ (voltage for the IO line of the SD/eMMC) or
TIMING.
Add the needed accessor in the {read,write}_ivar functions.
Reviewed by: imp (previous version)
This patch fixes 2 panics. The first one is due to the current VNET not
being set in the emulated adapter transmission path. The second one
is caused by the M_PKTHDR flag not being set when preallocated mbufs
are recycled in the transmit path.
Submitted by: aleksandr.fedorov@itglobal.com
Reviewed by: vmaffione
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20824
The goal of this driver is consolidate information about SuperIO chips
and to provide for peaceful coexistence of drivers that need to access
SuperIO configuration registers.
While SuperIO chips can host various functions most of them are
discoverable and accessible without any knowledge of the SuperIO.
Examples are: keyboard and mouse controllers, UARTs, floppy disk
controllers. SuperIO-s also provide non-standard functions such as
GPIO, watchdog timers and hardware monitoring. Such functions do
require drivers with a knowledge of a specific SuperIO.
At this time the driver supports a number of ITE and Nuvoton (fka
Winbond) SuperIO chips.
There is a single driver for all devices. So, I have not done the usual
split between the hardware driver and the bus functionality. Although,
superio does act as a bus for devices that represent known non-standard
functions of a SuperIO chip. The bus provides enumeration of child
devices based on the hardcoded knowledge of such functions. The
knowledge as extracted from datasheets and other drivers.
As there is a single driver, I have not defined a kobj interface for it.
So, its interface is currently made of simple functions.
I think that we can the flexibility (and complications) when we actually
need it.
I am planning to convert nctgpio and wbwd to superio bus very soon.
Also, I am working on itwd driver (watchdog in ITE SuperIO-s).
Additionally, there is ithwm driver based on the reverted sensors
import, but I am not sure how to integrate it given that we still lack
any sensors interface.
Discussed with: imp, jhb
MFC after: 7 weeks
Differential Revision: https://reviews.freebsd.org/D8175
That is, instead of the current GPIO00 - GPIO15 the names will be GPIO00
- GPIO07, GPIO10 - GPIO17. The first digit is a GPIO "bank" / group
number and the second one is a pin number within the bank. Alternative
view is that the pin names are changed from decimal numbering scheme to
octal one (as there are 8 pins per bank).
Discussed with: cem, gonzo
MFC after: 2 weeks
With more ports, some of the registers are shifted a bit to accommodate.
This switch also adds two high speed Serdes/SGMII interfaces (2.5 Gb/s).
Sponsored by: Rubicon Communications, LLC (Netgate)
Since cxgbe(4) uses sglist instead of bus_dma, this required updates
to the code that generates scatter/gather lists for packets. Also,
unmapped mbufs are always sent via DMA and never as immediate data in
the payload of a work request.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: np
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
address before returning it to the user. Some of the least significant
bits have special meaning and should be masked away.
Discussed with: kib@
MFC after: 3 days
Sponsored by: Mellanox Technologies
handle_ddp_close.
This eliminates a bad race where an aio_ddp_requeue that happened to run
after handle_ddp_close could bump up the active count.
Discussed with: jhb@
MFC after: 3 days
Sponsored by: Chelsio Communications
t_maxseg was changed in r293284 to not have any adjustment for TCP
timestamps. t4_tom inadvertently went back to pre-r293284 semantics
in r332506.
Sponsored by: Chelsio Communications
This fixes (userspace) console on the Marvell MACCHIATObin in ACPI mode with
latest TianoCore EDK2 firmware.
Submitted by: Greg V <greg@unrelenting.technology>
Reviewed by: mw, bcran
Differential Revision: https://reviews.freebsd.org/D20765
Previously, the aiotx task relied on the aio jobs in the queue to hold
a reference on the socket. However, when the last job is completed,
there is nothing left to hold a reference to the socket buffer lock
used to check if the queue is empty. In addition, if the last job on
the queue is cancelled, the task can run with no queued jobs holding a
reference to the socket buffer lock the task uses to notice the queue
is empty.
Fix these races by holding an explicit reference on the socket when
the task is queued and dropping that reference when the task
completes.
Reviewed by: np
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D20539
The format to use depends on hardware configuration (synthesis-time),
so make it compile-time kernel option.
Extended format allows DMA engine to operate with 64-bit memory addresses.
Sponsored by: DARPA, AFRL
"pin_list" allows to specify child pins as a list of pin numbers.
Existing hint "pins" serves the same purpose but with a 32-bit wide bit
mask. One problem with that is that a controller can have more than 32
pins. One example is amdgpio. Also, a list of numbers is a little bit
more human friendly than a matching bit mask. As a side note, it seems
that in FDT pins are typically specified by their numbers as well.
This commit also adds accessors for instance variables (IVARs) that
define the child pins. My primary goal is to allow a child to be
configured programmatically rather than via hints (assuming that FDT is
not supported on a platform). Also, while a child should not care about
specific pin numbers that are allocated to it, it could be interested in
how many were actually assigned to it.
While there, I removed "flags" instance variable. It was unused.
Reviewed by: mizhka
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20459
Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.
This is a step towards an implementation of an atomic reference counter
for each physical page structure.
Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20758
"fdt" is removed from the driver module name as the driver does not
require FDT and can work very well on hints based systems.
A module dependency is added for gpiobus. Without that owc cannot
resolve symbols in gpiobus if both are loaded as kernel modules.
Finally, a driver module module version is added.
Reviewed by: imp
MFC after: 11 days
NANDFS has been broken for years. Remove it. The NAND drivers that
remain are for ancient parts that are no longer relevant. They are
polled, have terrible performance and just for ancient arm
hardware. NAND parts have evolved significantly from this early work
and little to none of it would be relevant should someone need to
update to support raw nand. This code has been off by default for
years and has violated the vnode protocol leading to panics since it
was committed.
Numerous posts to arch@ and other locations have found no actual users
for this software.
Relnotes: Yes
No Objection From: arch@
Differential Revision: https://reviews.freebsd.org/D20745
Since SES specs do not define mechanism to map enclosure slots to SATA
disks, AHCI EM code I written many years ago appeared quite useless,
that always bugged me. I was thinking whether it was a good idea, but
if LSI HBAs do that, why I shouldn't?
This change introduces simple non-standard mechanism for the mapping
into both AHCI EM and SES code, that makes AHCI EM on capable controllers
(most of Intel's) a first-class SES citizen, allowing it to report disk
physical path to GEOM, show devices inserted into each enclosure slot in
`sesutil map` and `getencstat`, control locate and fault LEDs for specific
devices with `sesutil locate adaX on` and `sesutil fault adaX on`, etc.
I've successfully tested this on Supermicro X10DRH-i motherboard connected
with sideband cable of its S-SATA Mini-SAS connector to SAS815TQ backplane.
It can indicate with LEDs Locate, Fault and Rebuild/Remap SES statuses for
each disk identical to real SES of Supermicro SAS2 backplanes.
MFC after: 2 weeks
Until r349278, bhyve presented a seg_max to the guest that was too large.
Detect this case and clamp it to the virtqueue size. Otherwise, we would
fail the "too many segments to enqueue" assertion in virtqueue_enqueue().
I hit this by running a guest with a MAXPHYS of 256 KB.
Reviewed by: bryanv cem
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20703
Previously nandsim_chip_status returned EINVAL iff both of user-provided
chip->ctrl_num and chip->num were out of bounds. If only one failed the
bounds check arbitrary memory would be read and returned.
The NAND framework is not built by default, nandsim is not intended for
production use (it is a simulator), and the nandsim device has root-only
permissions.
admbugs: 827
Reported by: Daniel Hodson of elttam
MFC after: 3 days
Security: kernel information leak or DoS
Sponsored by: The FreeBSD Foundation
In r349154, random device reads of size < 16 bytes (AES block size) were
accidentally broken to loop forever. Correct the loop condition for small
reads.
Reported by: pho
Reviewed by: delphij
Approved by: secteam(delphij)
Differential Revision: https://reviews.freebsd.org/D20686
This adds ACPI device path on devinfo(8) output and
show value of _UPC(usb port capabilities), _PLD (physical location of device)
when hw.usb.debug >= 1 .
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D20630
Add experimental feature to increase concurrency in Fortuna. As this
diverges slightly from canonical Fortuna, and due to the security
sensitivity of random(4), it is off by default. To enable it, set the
tunable kern.random.fortuna.concurrent_read="1". The rest of this commit
message describes the behavior when enabled.
Readers continue to update shared Fortuna state under global mutex, as they
do in the status quo implementation of the algorithm, but shift the actual
PRF generation out from under the global lock. This massively reduces the
CPU time readers spend holding the global lock, allowing for increased
concurrency on SMP systems and less bullying of the harvestq kthread.
It is somewhat of a deviation from FS&K. I think the primary difference is
that the specific sequence of AES keys will differ if READ_RANDOM_UIO is
accessed concurrently (as the 2nd thread to take the mutex will no longer
receive a key derived from rekeying the first thread). However, I believe
the goals of rekeying AES are maintained: trivially, we continue to rekey
every 1MB for the statistical property; and each consumer gets a
forward-secret, independent AES key for their PRF.
Since Chacha doesn't need to rekey for sequences of any length, this change
makes no difference to the sequence of Chacha keys and PRF generated when
Chacha is used in place of AES.
On a GENERIC 4-thread VM (so, INVARIANTS/WITNESS, numbers not necessarily
representative), 3x concurrent AES performance jumped from ~55 MiB/s per
thread to ~197 MB/s per thread. Concurrent Chacha20 at 3 threads went from
roughly ~113 MB/s per thread to ~430 MB/s per thread.
Prior to this change, the system was extremely unresponsive with 3-4
concurrent random readers; each thread had high variance in latency and
throughput, depending on who got lucky and won the lock. "rand_harvestq"
thread CPU use was high (double digits), seemingly due to spinning on the
global lock.
After the change, concurrent random readers and the system in general are
much more responsive, and rand_harvestq CPU use dropped to basically zero.
Tests are added to the devrandom suite to ensure the uint128_add64 primitive
utilized by unlocked read functions to specification.
Reviewed by: markm
Approved by: secteam(delphij)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20313
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
names. I.e., everything related to pwm now goes in /dev/pwm. This will
make it easier for userland tools to turn an unqualified name into a fully
qualified pathname, whether it's the base pwmcX.Y name or a label name.
At a basic level, remove assumptions about the underlying algorithm (such as
output block size and reseeding requirements) from the algorithm-independent
logic in randomdev.c. Chacha20 does not have many of the restrictions that
AES-ICM does as a PRF (Pseudo-Random Function), because it has a cipher
block size of 512 bits. The motivation is that by generalizing the API,
Chacha is not penalized by the limitations of AES.
In READ_RANDOM_UIO, first attempt to NOWAIT allocate a large enough buffer
for the entire user request, or the maximal input we'll accept between
signal checking, whichever is smaller. The idea is that the implementation
of any randomdev algorithm is then free to divide up large requests in
whatever fashion it sees fit.
As part of this, two responsibilities from the "algorithm-generic" randomdev
code are pushed down into the Fortuna ra_read implementation (and any other
future or out-of-tree ra_read implementations):
1. If an algorithm needs to rekey every N bytes, it is responsible for
handling that in ra_read(). (I.e., Fortuna's 1MB rekey interval for AES
block generation.)
2. If an algorithm uses a block cipher that doesn't tolerate partial-block
requests (again, e.g., AES), it is also responsible for handling that in
ra_read().
Several APIs are changed from u_int buffer length to the more canonical
size_t. Several APIs are changed from taking a blockcount to a bytecount,
to permit PRFs like Chacha20 to directly generate quantities of output that
are not multiples of RANDOM_BLOCKSIZE (AES block size).
The Fortuna algorithm is changed to NOT rekey every 1MiB when in Chacha20
mode (kern.random.use_chacha20_cipher="1"). This is explicitly supported by
the math in FS&K §9.4 (Ferguson, Schneier, and Kohno; "Cryptography
Engineering"), as well as by their conclusion: "If we had a block cipher
with a 256-bit [or greater] block size, then the collisions would not
have been an issue at all."
For now, continue to break up reads into PAGE_SIZE chunks, as they were
before. So, no functional change, mostly.
Reviewed by: markm
Approved by: secteam(delphij)
Differential Revision: https://reviews.freebsd.org/D20312
Add some basic regression tests to verify behavior of both uint128
implementations at typical boundary conditions, to run on all architectures.
Test uint128 increment behavior of Chacha in keystream mode, as used by
'kern.random.use_chacha20_cipher=1' (r344913) to verify assumptions at edge
cases. These assumptions are critical to the safety of using Chacha as a
PRF in Fortuna (as implemented).
(Chacha's use in arc4random is safe regardless of these tests, as it is
limited to far less than 4 billion blocks of output in that API.)
Reviewed by: markm
Approved by: secteam(gordon)
Differential Revision: https://reviews.freebsd.org/D20392
Previously, there was a pwmc instance for each instance of pwm hardware
regardless of how many pwm channels that hardware supported. Now there
will be a pwmc instance for each channel when the hardware supports
multiple channels. With a separate instance for each channel, we can have
"named channels" in userland by making devfs alias entries in /dev/pwm.
These changes add support for ivars to pwmbus, and use an ivar to track the
channel number for each child. It also adds support for hinted children.
In pwmc, the driver checks for a label hint, and if present, it's used to
create an alias for the cdev in /dev/pwm. It's not anticipated that hints
will be heavily used, but it's easy to do and allows quick ad-hoc creation
of named channels from userland by using kenv to create hint.pwmc.N.label=
hints. Upcoming changes will add FDT support, and most labels will
probably be specified that way.
is nothing left in the file that related to pwmbus at all. It just contains
prototypes for the functions implemented in dev/pwm.ofw_pwm.c, so name it
accordingly and fix the include protect wrappers to match.
A new pwmbus.h will be coming along in a future commit.
The pwm and pwmbus interfaces were nearly identical, this merges them into a
single pwmbus interface. The pwmbus driver now implements the pwmbus
interface by simply passing all calls through to its parent (the hardware
driver). The channel_count method moves from pwm to pwmbus, and the
get_bus method is deleted (just no longer needed).
The net effect is that the interface for doing pwm stuff is now the same
regardless of whether you're a child of pwmbus, or some random driver
elsewhere in the hierarchy that is bypassing the pwmbus layer and is talking
directly to the hardware driver via cross-hierarchy connections established
using fdt data.
The pwmc driver is now a child of pwmbus, instead of being its sibling
(that's why the get_bus method is no longer needed; pwmc now gets the
device_t of the bus using device_get_parent()).
ioctl definitions and related datatypes that allow userland control of pwm
hardware via the pwmc device. The new name and location better reflects its
assocation with a single device driver.
part of ciss_detach. It's a left-over debug that isn't needed and also
discloses a kernel address. Only root could provoke as part of a
devctl or kldunload.
Submitted by: Fuqian Huang
MFC After: 1 week
asserted. Some development boards for example will reset on DTR,
and some radio interfaces will transmit on RTS.
This patch allows "stty -f /dev/ttyu9.init -rtsdtr" to prevent
RTS and DTR from being asserted on open(), allowing these devices
to be used without problems.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D20031
between each byte either sent or received). However, most transitions
actually complete in 2-3 microseconds.
By polling the status register with a delay of 4us with exponential
backoff, the performance of most IPMI operations is significantly
improved:
- A BMC update on a Supermicro x9 or x11 motherboard goes from ~1 hour
to ~6-8 minutes.
- An ipmitool sensor list time improves by a factor of 4.
Testing showed no significant improvements on a modern server by using
a lower delay.
The changes should also generally reduce the total amount of CPU or
I/O bandwidth used for a given IPMI operation.
Submitted by: Loic Prylli <lprylli@netflix.com>
Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20527
detection pins to the Marvell Xenon SDHCI controller.
These features are enable by 'vqmmc-supply' and 'cd-gpios' properties in the
DTS.
This fixes the SD Card detection on espressobin.
Sponsored by: Rubicon Communications, LLC (Netgate)
Enable synaptics and elantech touchpads, as well as IBM/Lenovo TrackPoints
by default, instead of having users find and toggle a loader tunable.
This makes things like two finger scroll and other modern features work out
of the box with X. By enabling these settings by default, we get a better
desktop experience in X, since xserver and evdev can make use of the more
advanced synaptics and elantech features.
Reviewed by: imp, wulf, 0mp
Approved by: imp
Sponsored by: B3 Init (zeising)
Differential Revision: https://reviews.freebsd.org/D20507
Add strict checks for unused bit states in Elantech trackpoint packet
parser to filter out spurious events produces by some hardware which
are detected as trackpoint packets. See comment on r328191 for example.
Tested by: Andrey Kosachenko <andrey.kosachenko@gmail.com>
Sign bits for X and Y motion data were taken from wrong places.
PR: 238291
Reported by: Andrey Kosachenko <andrey.kosachenko@gmail.com>
Tested by: Andrey Kosachenko <andrey.kosachenko@gmail.com>
MFC after: 2 weeks
Add a CAM-Newbus SDIO support module. This works provides a newbus
infrastructure for device drivers wanting to use SDIO. On the lower end
while it is connected by newbus to SDHCI, it talks CAM using the MMCCAM
framework to get to it.
This also duplicates the usbdevs framework to equally create sdiodev
header files with #defines for "vendors" and "products".
Submitted by: kibab (initial work, see https://reviews.freebsd.org/D12467)
Reviewed by: kibab, imp (comments on earlier version)
MFC after: 6 weeks
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19749
Currently slot_printf() uses two printf() calls to print the
device-slot name, and actual message. When other printf()s are
ongoing in parallel this can lead to interleaved message on the console,
which is especially unhelpful for debugging or error messages.
Take a hit on the stack and vsnprintf() the message to the buffer.
This way it can be printed along with the device-slot name in one go
avoiding console gibberish.
Reviewed by: marius
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19747
Add cam_sim_alloc_dev() as a wrapper to cam_sim_alloc() which takes
a device_t instead of the unit_number (which we can derive from the
dev again).
Add device_t sim_dev to struct cam_sim. It will be used to pass through
the bus for cases when both sides of CAM speak newbus already and we want
to link them (yet make the calls through CAM for now).
SDIO will be the first consumer of this. For that make use of
cam_sim_alloc_dev() in sdhci under MMCCAM.
This will also allow people to start iterating more on the idea
to newbus-ify CAM without changing 50+ device drivers from the start.
Also to be clear there are callers to cam_sim_alloc() which do not
have a device_t (e.g., XPT) or provide their own unit number so we cannot
simply switch the KPI entirely.
Submitted by: kibab (original idea, see https://reviews.freebsd.org/D12467)
Reviewed by: imp, chuck
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19746
Differentiate between PCI Express Endpoint devices and Root Complex
Integrated Endpoints in the nda driver. The Link Status and Capability
registers are not valid for Integrated Endpoints and should not be
displayed. The bhyve emulated NVMe device will advertise as being an
Integrated Endpoint.
Reviewed by: imp
Approved byL imp (mentor)
Differential Revision: https://reviews.freebsd.org/D20282
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone. But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place. No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20470
This fixes a panic in Espressobin when gpioregulator fails to allocate the
GPIO pin (the GPIO controller is not there).
Sponsored by: Rubicon Communications, LLC (Netgate)
Provide the acpi handle path as the location string for the nvdimm
children of the nvdimm_root device.
Reviewed by: kib
Approved by: jhb (mentor)
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D20528
Prior to this commit, if PCIEM_SLOT_STA_ABP and PCIEM_SLOT_STA_PDC are
asserted simultaneously, FreeBSD sets a 5 second "hardware going away" timer
and then processes the "presence detect" change. In the (physically
challenging) case that someone presses the "attention button" and inserts
a new PCIe device at exactly the same moment, this results in FreeBSD
recognizing that the device is present, attaching it, and then detaching it
5 seconds later.
On EC2 "bare metal" hardware this is the precise sequence of events which
takes place when a new EBS volume is attached; virtual machines have no
difficulty effecting physically implausible simultaneity.
This patch changes the handling of PCIEM_SLOT_STA_ABP to only detach a
device if the presence of a device was detected *before* the interrupt
which reports the Attention Button push.
Reported by: Matt Wilson
Reviewed by: jhb
MFC after: 1 week
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D20499
if a USB transfer is cancelled that we need to fake a completion event.
Implement missing support in ugen_fs_copy_out() to handle this.
This fixes issues with webcamd(8) and firefox.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Register MODULE_PNP_INFO for virtio devices using the newbus PNP information
provided by the previous commit. Matching can be quite simple; existing
probe routines only matched on bus (implicit) and device_type. The same
matching criteria are retained exactly, but is now also available to
devmatch(8).
Reviewed by: bryanv, markj; imp (earlier version)
Differential Revision: https://reviews.freebsd.org/D20407
Expose the same fields and widths from both vtio buses, even though they
don't quite line up; several virtio drivers can attach to both buses,
and sharing a PNP info table for both seems more convenient.
In practice, I doubt any virtio driver really needs to match on anything
other than bus and device_type (eliminating the unused entries for
vtmmio), and also in practice device_type is << 2^16 (so far, values
range from 1 to 20). So it might be fine to only expose a 16-bit
device_type for PNP purposes. On the other hand, I don't see much harm
in overkill here.
Reviewed by: bryanv, markj (earlier version)
Differential Revision: https://reviews.freebsd.org/D20406
random(4) masks unregistered entropy sources. Prior to this revision,
virtio_random(4) did not correctly register a random_source and did not
function as a source of entropy.
Random source registration for loadable pure sources requires registering a
poll callback, which is invoked periodically by random(4)'s harvestq
kthread. The periodic poll makes virtio_random(4)'s periodic entropy
collection redundant, so this revision removes the callout.
The current random source API is somewhat limiting, so simply fail to attach
any virtio_random devices if one is already registered as a source. This
scenario is expected to be uncommon.
While here, handle the possibility of short reads from the hypervisor random
device gracefully / correctly. It is not clear why a hypervisor would
return a short read or if it is allowed by spec, but we may as well handle
it.
Reviewed by: bryanv (earlier version), markm
Security: yes (note: many other "pure" random sources remain broken)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20419
This change enables natural scrolling with two finger scroll enabled
and when user is using a trackpad (mouse and trackpoint are not affected).
Depending on trackpad model it can be activated with setting of
hw.psm.synaptics.natural_scroll or hw.psm.elantech.natural_scroll sysctl
values to 1.
Evdev protocol is not affected by this change too. Tune userland client
e.g. libinput to enable natural scrolling in that case.
Submitted by: nyan_myuji.xyz
Reviewed by: wulf
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20447
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.
Reviewed by: jhb and markj (both prior versions)
Differential Revision: https://reviews.freebsd.org/D20478
When starting a command also print the opcode and flags.
More consitently print flags as hex.
Use slot_printf rather than printf in one case.
MFC after: 6 weeks
Reviewed by: marius, kibab, imp
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19748
receive sockbuf's high water mark.
Calculate rx credits on the spot instead of tracking sbused/sb_cc and
rx_credits in the toepcb. The previous method worked when the high
water mark changed due to SB_AUTOSIZE but not when it was adjusted
directly (for example, by the soreserve in nfsrvd_addsock).
This fixes a connection hang while running iozone over an NFS mounted
share where nfsd's TCP sockets are being handled by t4_tom.
MFC after: 3 days
Sponsored by: Chelsio Communications
I introduced an obvious compiler error in r346282, so this change fixes
that.
Unfortunately, RANDOM_LOADABLE isn't covered by our existing tinderbox, and
it seems like there were existing latent linking problems. I believe these
were introduced on accident in r338324 during reduction of the boolean
expression(s) adjacent to randomdev.c and hash.c. It seems the
RANDOM_LOADABLE build breakage has gone unnoticed for nine months.
This change correctly annotates randomdev.c and hash.c with !random_loadable
to match the pre-r338324 logic; and additionally updates the HWRNG drivers
in MD 'files.*', which depend on random_device symbols, with
!random_loadable (it is invalid for the kernel to depend on symbols from a
module).
(The expression for both randomdev.c and hash.c was the same, prior to
r338324: "optional random random_yarrow | random !random_yarrow
!random_loadable". I.e., "random && (yarrow || !loadable)." When Yarrow
was removed ("yarrow := False"), the expression was incorrectly reduced to
"optional random" when it should have retained "random && !loadable".)
Additionally, I discovered that virtio_random was missing a MODULE_DEPEND on
random_device, which breaks kld load/link of the driver on RANDOM_LOADABLE
kernels. Address that issue as well.
PR: 238223
Reported by: Eir Nym <eirnym AT gmail.com>
Reviewed by: delphij, markm
Approved by: secteam(delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20466
misleading indentation. This is found by gcc -Wmisleading-indentation
Approved by: erj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20428
it hasn't been initialized.
This fixes a bug in r346570 that could cause a panic when servicing
TCP_INFO for offloaded connections.
MFC after: 3 days
Sponsored by: Chelsio Communications
ENAv2 introduces many new features, bug fixes and improvements.
Main new features are LLQ (Low Latency Queues) and independent queues
reconfiguration using sysctl commands.
The year in copyright notice was updated to 2019.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
For easier debugging, the reset is being triggered and the reset reason is
being set only in case it is done for the first time. Such approach will
ensure that the first reset reason is not going to be overwritten and
will make it easier for debugging.
Also, add a reset trigger upon invalid Tx requested ID.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
If the call to ena_up() in ena_restore_device() fails, next usage of
`ifconfig up` will cause NULL pointer dereference.
This patch adds additional checks to prevent that.
Submitted by: Rafal Kozik <rk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Some messages were missing new line character and traces were not having
unified behavior. To fix that, each trace and printout should add new
line character at the end of each string - that should improve
readability.
Submitted by: Rafal Kozik <rk@semihalf.com>
Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
If the headers of the packets are split into multiple segments of the
mbuf chain, the previous version of ena_tx_csum which was assuming,
that all segments will lay in the first mbuf, will eventually fail to
map the headers properties to meta descriptor.
That will cause Tx checksum offload to do not work and was leading to
memory corruption. It could even cause the crash of the system.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
For alignment with Linux driver and better handling ena_detach(), the
reset is now calling ena_device_restore() and ena_device_destroy().
The ena_device_destroy() is also being called on ena_detach(), so the
code will be more readable.
The watchdog is now being activated after reset only, if it was active
before.
There were added additional checks to ensure, that there is no race with
the link state change AENQ handler.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
As the ENA can have multiple states turned on/off, it is more convenient
to store them in single bitfield instead of multiple boolean variables.
The bitset FreeBSD API was used for the bitfield implementation, as it
provides flexible structure together with API which also supports atomic
bitfield operations.
For better readability basic macros from API were wrapped into custom
ENA_FLAG_* macros, which are filling up common parameters for all calls.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Before the patch, error handling was not releasing all resources and
was not issuing device reset if the reset task failed.
That could cause memory leak and fault of the device.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The host info bdf field is the abbreviation for the bus, device,
function of the PCI on which the device is being attached to.
Now the driver is filling information about that using FreeBSD RID
resource.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The new ENA HAL is introducing API, which can determine on Tx path if
the doorbell is needed.
That way, it can tell the driver, that it should call an doorbell.
The old threshold value wasn't removed, as not all HW is supporting this
feature - so it was reworked to also work with the new API.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The Rx ring size can be as high as 8k. Because of that we want to limit
the cleanup threshold by maximum value of 256.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
LLQ (Low Latency Queue) is the feature, that allows pushing header
directly to the device through PCI before even DMA is triggered.
It reduces latency, because device can start preparing packet before
payload is sent through DMA.
To speed up sending data through PCI, the Write Combining is enabled,
which allows hardware to buffer data before sending them on the PCI - it
allows to reduce number of PCI IO operations.
ENAv2 is using special descriptor for the negotiation of the LLQ.
Currently, only the default configuration is supported.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Handle IO interrupts using filter routine. That way, the main cleanup
task could be moved to the separate thread using taskqueue.
The deferred Rx cleanup task was removed, and now the cleanup task is
begin called instead. That way, the Rx lock could be removed.
In addition, Queue management (wake up and stop TX ring) was added, so
the TX cleanup task can be performed mostly lockless.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The driver now supports per adapter tuning of buffer ring size and HW Rx
ring size.
It can be achieved using sysctl node dev.ena.X.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
If the requested ID was out of range, the tx_info structure was NULL and
the function was trying to access the field of the NULL object.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The if_detach was causing crash if the MSI-x configuration in the attach
failed. To prevent this issue, the ifnet is being configured at the end
of the attach function.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
In rare case, when the ifconfig is called just before kldunload, it is
possible, that ena_up routine will be called after queue locks are
released.
To prevent that, ifp is detached before the last ena_down is called and
further, the ifp is freed at the end of the function.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The ENA driver needs at least 2 MSI-x - one for admin queue, and one for
IO queues pair. If there were not enough resources to allocate more than
one MSI-x, the device should not be attached.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
bus_alloc_resource_any() is not returning error value in case of an
error.
If the function call fails, the error value was not passed to the
ena_up() function.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
To prevent errors from assigning values from the DMA structure in case
of an error, zero the vaddr and paddr values upon failure.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The DMA in FreeBSD requires explicit synchronization. ENA driver was
only doing PREREAD and PREWRITE synchronizations. Missing
bus_dmamap_sync() calls were added.
It is also required to synchronize DMA engine before unloading DMA map.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
If the first MSI-x won't be executed, then the timer service will detect
that and trigger device reset.
The checking for missing Tx completion was reworked, so it will also
check for missing interrupts. Checking number of missing Tx completions
can be performed after loop, instead of checking it every iteration.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The new ena_com allows the number of CPUs to be passed to the device in
the host info structure as a hint.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Information about Tx error should be only displayed, if packet
preparation failed due to error other than out of memory.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Whenever the driver will receive too many descriptors from the device,
it should trigger the device reset, as it is indicating that the device
is in invalid state.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Receive Side Scaling is optional feature that could be enabled in kernel
configuration by defining flag RSS.
Kernel uses hash to store and find protocol control block which is
stored in hash tables.
Kernel and NIC hash functions must be consistent. Otherwise case lookup
fails.
To achieve this kernel provides API to set proper hash key to NIC.
As it is not possible to change key for virtual ENA NIC, this driver
cannot support RSS function.
ENA is designed to work in virtual environments so supporting hardware
version of this card is unnecessary.
Submitted by: Rafal Kozik <rk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Notification AENQ handler is responsible for handling requests from ENA
device. Missing Tx threshold, Tx timeout and keep alive timeout can be
set using hints from the aenq descriptor which can be delivered in the
ENA admin notification.
The queue suspending and resuming tasks are not supported by the
driver.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
ENA_ADMIN_FATAL_ERROR and ENA_ADMIN_WARNING aenq groups were indicated
as supported, so the unimplemented_aenq_handler() will print out error
message, whenever an error will occur within the ENA admin context.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
As the ENA is working only in virtualized environment, the active media
is not specified. Instead, the active link type is set as unknown.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Recent HAL change preparing to support ENAv2 required minor driver
modifications.
The ena_com_sq_empty_space() is not available in this ena-com, so it had
to be replaced with ena_com_free_desc().
Moreover, the ena_com_admin_init() is no longer using 3rd argument
indicating if the spin lock should be initialized, so it was removed.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
the code is not guarded by the if clause and has misleading indentation.
Approved by: scottl
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20427
Since drm2 removal, there has not been any consumer of the feature in the
tree. I am also unaware of any out-of-tree consumer.
More importantly, the feature has been broken from the very start, both
before and after r306589, because the ivar was set on a device that does
not support it and it was read from another device that also does not
support it.
A bus-wide no-stop flag cannot be implemented as an ivar as iicbus
attaches as a child of various drivers. Implementing the ivar in each
and every I2C driver is just impractical.
If we ever want to implement this feature properly, then probably the
easiest way to do it would be via a flag in the softc of iicbus.
In fact, we might have to do that in the stable branches if we want to
fix the code for them.
Reported by: ian (long time ago)
MFC after: 1 month (maybe)
X-MFC-note: cannot just merge the change, must keep drm2 happy
As I see, different NICs in different configurations may have different
numbers of TX and RX queues. The code was assuming 1:1 mapping between
event queues (interrupts) and TX/RX queues. Since number of interrupts
is set to maximum of TX and RX queues, when those two are different, the
system is doomed.
I have no documentation or deep knowledge about this hardware, so this
change is based on general observations and code reading. If some of my
guesses are wrong, please do better. I just confirmed HP NC550SFP NICs
are working now.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Prior to this revision, vtpci's BUS_READ_IVAR method on VIRTIO_IVAR_SUBVENDOR
accidentally returned the PCI subdevice.
The typo seems to have been introduced with the original commit adding
VIRTIO_IVAR_{{SUB,}DEVICE,{SUB,}VENDOR} to virtio_pci. The commit log and code
strongly suggest that the ivar was intended to return the subvendor rather than
the subdevice; it was likely just a copy/paste mistake.
Go ahead and rectify that.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
The point of r345008 was to reset the Command Reference Number (CRN)
in some situations where a device stayed in the topology, but had
changed somehow.
This can include moving from a switch connection to a direct
connection or vice versa, or a device that temporarily goes away
and comes back. (e.g. moving to a different switch port)
There were a couple of bugs in that change:
- We were reporting that a device had not changed whenever the
Establish Image Pair bit was not set. That is not quite correct.
Instead, if the Establish Image Pair bit stays the same (set or
not), the device hasn't changed in that way.
- We weren't setting PRLI Word0 in the port database when a new
device arrived, so comparisons with the old value for the
Establish Image Pair bit weren't really possible. So, make sure
PRLI Word0 is set in the port database for new devices.
- We were resetting the CRN whenever the Establish Image Pair bit
was set for a device, even when the device had stayed the same
and the value of the bit hadn't changed. Now, only reset the
CRN for devices that have changed, not devices that sayed the
same.
The result of all of this was that if we had a single FC device on
an FC port and it went away and came back, we would wind up
correctly resetting the CRN.
But, if we had multiple devices connected via a switch, and there
was any change in one or more of those devices, all of the devices
that stayed the same would also have their CRN values reset.
The result, from a user standpoint, is that the tape drives, etc.
would all start to time out commands and the initiator would send
aborts.
sys/dev/isp/isp.c:
In isp_pdb_add_update(), look at whether the Establish
Image Pair bit has changed as part of the check to
determine whether a device is still the same. This was
causing erroneous change notifications. Also, when
creating a new port database entry, initialize the
PRLI Word 0 values.
sys/dev/isp/isp_freebsd.c:
In isp_async(), in the changed/stayed case, instead of
looking at the Establish Image Pair bit to determine
whether to reset the CRN, look at the command value.
(Changed vs. Stayed.) Only reset the CRN for devices
that have changed.
Sponsored by: Spectra Logic
MFC after: 3 days
That made, for example, gpioc -l output quite hard to read and parse.
Also, fix formatting of a nearby statement with too long lines.
MFC after: 2 weeks
Pull the responsibility for zeroing events, which is general to any
conceivable implementation of a random device algorithm, out of the
algorithm-specific Fortuna code and into the callers. Most callers
indirect through random_fortuna_process_event(), so add the logic there.
Most callers already explicitly bzeroed the events they provided, so the
logic in Fortuna was mostly redundant.
Add one missing bzero in randomdev_accumulate(). Also, remove a redundant
bzero in the same function -- randomdev_hash_finish() is obliged to bzero
the hash state.
Reviewed by: delphij
Approved by: secteam(delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20318