Commit graph

151 commits

Author SHA1 Message Date
Gordon Bergling
6e8ab6715d nvmw(4): Fix a typo in a source code comment
- s/inaccessable/inaccessible/

MFC after:	3 days
2022-06-04 11:46:03 +02:00
Warner Losh
3740a8db13 nvme: Further refinements in Host Memory Buffer Sizing
Host Memory Buffer units are a mix. For those in the identify structure,
the size is in 4kiB chunks. For specifying the buffer description,
though, they are in terms of the drive's MPS. Add comments to this
effect and change PAGE_SIZE to ctrlr->page_size where needed, as well as
correct a mistaken use of NVME_HPS_UNITS in 214df80a9c as pointed out
by rpokala@ after the commit. No functional change is intended, as
page_size is still 4k which matches all current hosts' PAGE_SIZE, but to
support 16k pages on arm, we need to differentiate these two cases.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D34871
2022-04-15 14:46:19 -06:00
Warner Losh
3086efe895 nvme: Remove NVME_MAX_XFER_SIZE, replace inline calculation
NVME_MAX_XFER_SIZE used to be a constant (back when MAXPHYS was a
constant) to denote the smaller of MAXPHYS or the largest PRP we could
encode with our prealloation scheme. However, it's no longer constant
since MAXPHYS varies at runtime. In addition, the actual maximum is now
based on the drive's currently in use page_size, which is also a runtime
expression. As such, remove the define and expand it inline in the one
place its used still in the tree.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34870
2022-04-15 14:46:18 -06:00
Warner Losh
3a468f2010 nvme: Use saved mps when initializing drive
Make sure we set the MPS we cached (currently the drives minimum mps) in
CC (Controller Configuration) when reinitializing the drive. It must
match the page_size that we're going to use. Also retire less specific
NVME_PAGE_SHIFT since it's now unused.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34869
2022-04-15 14:46:18 -06:00
Warner Losh
55412ef90a nvme: Rename min_page_size to page_size and save mps
The Memory Page Size sets the basic unit of operation for the drive. We
currently set this to the drive's minimum page size, but we could set it
to any page size the drive supports in the future. Replace min_page_size
(it's now unused for that purpose) with page_size to reflect this and
cache the MPS we want to use. Use NVME_MPS_SHIFT to compute page_size.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34868
2022-04-15 14:46:18 -06:00
Warner Losh
6e3deec8ca nvme: Base maximum data transfer size directly on MPSMIN in cap_hi
Calculate the maxmimum transfer size based on the MPSMIN we have in our
cached copy of cap_hi rather than using min_page_size in the controller.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34867
2022-04-15 14:46:18 -06:00
Warner Losh
214df80a9c nvme: new define for size of host memory buffer sizes
The nvme spec defines the various fields that specify sizes for host
memory buffers in terms of 4096 chunks. So, rather than use a bare 4096
here, use NVME_HMB_UNITS. This is explicitly not the host page size of
4096, nor the default memory page size (mps) of the NVMe drive, but its
own thing and needs its own define.

No functional change is intended, only the logical spelling of 4k.

Sponsored by:		Netflix
2022-04-08 23:05:25 -06:00
Warner Losh
6af6a52ee4 nvme: Save cap_lo and cap_hi
Save the capabilities for the drive.

Sponsored by:		Netflix
2022-03-31 21:12:38 -06:00
Warner Losh
a70b5660f3 nvme: MPS is a power of two, not a size / 8k
Setting MPS in the CC should be a power of 2 number (it specifies the
page size of the host is 2^(12+MPS)), so adjust the calcuation. There is
no functional change because we do not support any architecutres != 4k
pages (yet). Other changes are needed for architectures with 16k or 64k
pages, especially when the underlying NVMe drive doesn't support that
page size (Most drives support a range that's small, and many only
support 4k), but let's at least do this calculation correctly. 12 - 12
is just as much 0 as 4096 >> 13 is :)

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D34707
2022-03-31 21:12:38 -06:00
Warner Losh
83581511d9 nvme: Use adaptive spinning when polling for completion or state change
We only use nvme_completion_poll in the initialization path. The
commands they queue and wait for finish quickly as they involve no I/O
to the drive's media. These command take about 20-200 microsecnds
each. Set the wait time to 1us and then increase it by 1.5 each
successive iteration (max 1ms). This reduces initialization time by
80ms in cpervica's tests.

Use this same technique waiting for RDY state transitions. This saves
another 20ms. In total we're down from ~330ms to ~2ms.

Tested by:		cperciva
Sponsored by:		Netflix
Reviewed by:		mav
Differential Review:	https://reviews.freebsd.org/D32259
2021-10-01 19:17:55 -06:00
Warner Losh
4b3da659bf nvme: Only reset once on attach.
The FreeBSD nvme driver has reset the nvme controller twice on attach to
address a theoretical issue assuring the hardware is in a known
state. However, exierence has shown the second reset is unnecessary and
increases the time to boot. Eliminate the second reset. Should there be
a situation when you need a second reset (for buggy or at least somewhat
out of the mainstream hardware), the hardware option NVME_2X_RESET will
restore the old behavior. Document this in nvme(4).

If there's any trouble at all with this, I'll add a sysctl tunable to
control it.

Sponsored by:		Netflix
Reviewed by:		cperciva, mav
Differential Revision:	https://reviews.freebsd.org/D32241
2021-10-01 11:09:34 -06:00
Warner Losh
e5e26e4a24 nvme: Remove pause while resetting
After some study of the code and the standard, I think we can just drop
the pause(), unconditionally.  If we're not initialized, then there's
nothing to wait for from a software perspective.  If we are initialized,
then there might be outstanding I/O. If so, then the qpair 'recovery
state' will transition to WAITING in nvme_ctrlr_disable_qpairs, which
will ignore any interrupts for items that complete before we complete
the reset by setting cc.en=0.

If we go on to fail the controller, we'll cancel the outstanding I/O
transactions.  If we reset the controller, the hardware throws away
pending transactions and we retry all the pending I/O transactions. Any
transactions that happend to complete before cc.en=0 will have the same
effect in the end (doing the same transaction twice is just inefficient,
it won't affect the state of the device any differently than having done
it once).

The standard imposes no wait times here, so it isn't needed from that
perspective.

Unanswered Question: Do we may need to disable interrupts while we
disable in legacy mode since those are level-sensitive.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32248
2021-10-01 11:09:05 -06:00
Warner Losh
77054a897f nvme: Explain a workaround a little better
The don't touch the mmio of the drive after we do a EN 1->0 transition
is only for a tiny number of dirves that have this unforunate issue.

Sponsored by:		Netflix
2021-10-01 10:56:10 -06:00
Warner Losh
a245627a4e nvme_ctrlr_enable: Small style nits
Rewrite the nested if's using the preferred FreeBSD style for branches
of ifs that return. NFC. Minor tweaks to the comments to better fit new
code layout.

Sponsored by:		Netflix
Reviewed by:		mav, chuck (prior rev, but comments rolled in)
Differential Revision:	https://reviews.freebsd.org/D32245
2021-10-01 10:56:10 -06:00
Warner Losh
26259f6ab9 nvme: Use MS_2_TICKS rather than rolling our own
Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32246
2021-10-01 10:56:10 -06:00
Warner Losh
d5fca1dc1d nvme_ctrlr_enable: Remove unnecessary 5ms delays
Remove the 5ms delays after writing the administrative queue
registers. These delays are from the very earliest days of the driver
(they are in the first commit) and were most likely vestiges of the
Chatham NVMe prototype card that was used to create this driver. Many of
the workarounds necessary for it aren't necessary for standards
compliant cards. The original driver had other areas marked for Chatham,
but these were not. They are unneeded. There's three lines of supporting
evidence.

First, the NVMe standards make no mention of a delay time after these
registers are written. Second, the Linux driver doesn't have them, even
as an option. Third, all my nvme cards work w/o them.

To be safe, add a write barrier between setting up the admin queue and
enabling the controller.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32247
2021-10-01 10:56:10 -06:00
Warner Losh
502dc84a8b nvme: Use shared timeout rather than timeout per transaction
Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D28583
2021-09-23 16:42:08 -06:00
Colin Percival
bad42df9bf Add some nvme initialization routines to TSLOG
About 335 ms of EC2 instance boot time is being spent here.
2021-09-05 12:48:43 -07:00
Alexander Motin
e3bdf3da76 nvme(4): Add MSI and single MSI-X support.
If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared.  If still no luck, fall back to shared INTx.

This provides maximal flexibility in some limited scenarios.  For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.

MFC after:	1 week
2021-08-31 13:45:46 -04:00
Alexander Motin
31111372e6 nvme(4): Do not panic on admin queue construct error.
MFC after:	1 week
2021-08-30 20:38:23 -04:00
Warner Losh
f0f4712165 nvme: fix a race between failing the controller and failing requests
Part of the nvme recovery process for errors is to reset the
card. Sometimes, this results in failing the entire controller. When nda
is in use, we free the sim, which will sleep until all the I/O has
completed. However, with only one thread, the request fail task never
runs once the reset thread sleeps here. Create two threads to allow I/O
to fail until it's all processed and the reset task can proceed.

This is a temporary kludge until I can work out questions that arose
during the review, not least is what was the race that queueing to a
failure task solved. The original commit is vague and other error paths
in the same context do a direct failure. I'll investigate that more
completely before committing changing that to a direct failure. mav@
raised this issue during the review, but didn't otherwise object.

Multiple threads, though, solve the problem in the mean time until other
such means can be perfected.

Reviewed by:		jhb@
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D30366
2021-05-28 23:05:40 -06:00
Alexander Motin
4fbbe52365 nvme: Replace potentially long DELAY() with pause().
In some cases like broken hardware nvme(4) may wait minutes for
controller response before timeout.  Doing so in a tight spin loop
made whole system unresponsive.

Reviewed by:	imp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29309
Sponsored by:	iXsystems, Inc.
2021-03-17 10:35:49 -04:00
Warner Losh
8423f5d4c1 nvme: use config_intrhook_drain to avoid removable card races
nvme drives are configured early in boot. However, a number of the configuration
steps takes which take a while, so we defer those to a config intrhook that runs
before the root filesystem is mounted. At the same time, the PCI hot plug wakes
up and tests the status of the card. It may decide that the card has gone away
and deletes the child. As part of that process nvme_detach is called. If this
call happens after the config_intrhook starts to run, but before it is finished,
there's a race where we can tear down the device's soft state while the
config_intrhook is still using it. Use the new config_intrhook_drain to
disestablish the hook. Either it will be removed w/o running, or the routine
will wait for it to finish. This closes the race and allows safe hotplug at any
time, even very early in boot.

Sponsored by:		Netflix, Inc
Reviewed by:		jhb, mav
Differential Revision:	https://reviews.freebsd.org/D29006
2021-03-11 09:45:10 -07:00
Warner Losh
dd2516fc07 nvme: Make nvme_ctrlr_hw_reset static
nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so
make it static. If we need to change this in the future we can.
2021-02-08 13:29:24 -07:00
Warner Losh
9600aa31aa nvme: use NVME_GONE rather than hard-coded 0xffffffff
Make it clearer that the value 0xfffffff is being used to detect the device is
gone. We use it other places in the driver for other meanings.
2021-02-08 13:08:48 -07:00
Alexander Motin
1770bae5f8 Remove aligment requirements for passthrough buffer.
After r368124 vmapbuf() should happily map misaligned maxphys-sized buffers
thanks to extra page added to pbuf_zone.
2020-11-29 00:57:19 +00:00
Alexander Motin
ac90f70d1e Increase nvme(4) maximum transfer size from 1MB to 2MB.
With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.

On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs.  The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.

Reviewed by:	imp
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-11-29 00:20:31 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Alexander Motin
0bed3eabc5 Add PMRCAP printing and fix earlier CAP_HI.
MFC after:	3 days
2020-11-14 01:45:34 +00:00
Alexander Motin
46fbd8004f Fix panic if NVMe is detached before the intrhook call.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-11-12 20:20:43 +00:00
Alexander Motin
c44441f8fd Print NVMe controller capabilities in verbose dmesg.
Those values are not reported in controller identification, while sometimes
interesting for development and debugging.

MFC after:	1 week
2020-10-28 15:43:29 +00:00
Brooks Davis
44ca4575ea vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf.  Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf.  (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by:	jhb
Reviewed by:	imp, jhb
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26784
2020-10-21 16:00:15 +00:00
Alexander Motin
915f019715 Use RTD3 Entry Latency value as shutdown timeout.
This field was not in specs when the driver was written, but now there
are SSDs with the reported latency of 10s, where hardcoded value of 5s
seems to be not enough sometimes, causing shutdown timeout messages.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-10-14 15:50:28 +00:00
David Bright
e32d47f32d Add an ioctl to get an NVMe device's maximum transfer size
Reviewed by:	imp, chuck
Obtained from:	Dell EMC Isilon
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26390
2020-09-21 15:41:47 +00:00
Mateusz Guzik
d87b31e159 nvme: clean up empty lines in .c and .h files 2020-09-01 22:03:10 +00:00
Warner Losh
881534f09c Use symbolic names for asych events
Rather than |= 0x300, define and use asyn event names for the name
space changes and the firmware activations that we're asking for.
2020-08-31 19:38:03 +00:00
Alexander Motin
701267ad19 Fix few panics on NVMe's timing out initialization requests.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-25 20:29:29 +00:00
Alexander Motin
ead7e10308 Make polled request timeout less invasive.
Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests.  If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.

In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-18 19:16:03 +00:00
Alexander Motin
550d5d64fe Fix admin qpair leak if detached during initial reset.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-17 17:51:40 +00:00
Alexander Motin
92390644e3 Fix config_intrhook leak on initial reset failure.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-12 14:14:01 +00:00
David Bright
4053f8ac4d Fix various Coverity-detected errors in nvme driver
This fixes several Coverity-detected errors in the nvme driver.

CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975,
1403980

Reviewed by:	imp@, vangyzen@
MFC after:	5 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D24532
2020-05-02 20:47:58 +00:00
Warner Losh
4e6a434b6b Make sure that we get the sbuf resources we need.
Since we're calling sbuf_new with NOWAIT, make sure it can allocate a
buffer to use. Don't print anything if we can't get it.

Noticed by: rpokala
2020-04-30 00:43:11 +00:00
Warner Losh
244b805397 Generate a devctl event for interesting events
When we reset the controller, and when the controller tells us about a
critical warning, send an event.
2020-04-30 00:27:19 +00:00
Alexander Motin
b2cdfb72f4 Fix copy-paste bug in HMB free code.
MFC after:	2 weeks
X-MFC-with:	r356474
2020-01-08 18:26:23 +00:00
Alexander Motin
6de4e458fa Minor adjustments to r356474 and r356480.
Reported by:	jkim, imp
MFC after:	2 weeks
X-MFC-with:	r356474
2020-01-07 23:29:54 +00:00
Alexander Motin
1c7dd40e58 Increate HMB limit from 1% to 5%.
SSD capacity in laptops is growing faster then RAM size, so my original
guess seems too low on second thought.  Hopefully nobody will build large
array of those crappy SSDs.

MFC after:	2 weeks
X-MFC-with:	356474
2020-01-07 23:10:38 +00:00
Alexander Motin
67abaee9fc Add Host Memory Buffer support to nvme(4).
This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about
1MB per 1GB on the devices I have) for its metadata cache, significantly
improving random I/O performance.  Device reports minimal and preferable
size of the buffer.  The code limits it to 1% of physical RAM by default.
If the buffer can not be allocated or below minimal size, the device will
just have to work without it.

MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
2020-01-07 21:17:11 +00:00
Warner Losh
7588c6cc36 Move to using bool instead of boolean_t
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.

Differential Revision: https://reviews.freebsd.org/D20999
2019-12-13 18:35:48 +00:00
Warner Losh
66e5985084 Move reset to the interrutp processing stage
This trims the boot time a bit more for AWS and other platforms that have nvme
drives. There's no reason too do this inline. This has been in my tree a while,
but IIRC I talked to Jim Harris about this at one of our face to face meetings.

MFC After: 2 weeks
2019-12-11 22:51:02 +00:00
Alexander Motin
1eab19cbec Make nvme(4) driver some more NUMA aware.
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
 - Allocate most of queue pair memory from the domain it is bound to.
 - Bind callouts to the same CPUs as queue pair to avoid migrations.
 - Do not assign queue pairs to each SMT thread.  It just wasted
resources and increased lock congestions.
 - Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
 - If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-23 17:53:47 +00:00