Commit graph

354 commits

Author SHA1 Message Date
Alexander Motin
a69c096462 nvme: Print CRD, M and DNR status bits on errors.
It may help with some issues debugging.

MFC after:	1 week
2022-08-05 10:58:19 -04:00
Gordon Bergling
6e8ab6715d nvmw(4): Fix a typo in a source code comment
- s/inaccessable/inaccessible/

MFC after:	3 days
2022-06-04 11:46:03 +02:00
John Baldwin
1093caa1bb nvme: Remove unused devclass arguments to DRIVER_MODULE. 2022-05-06 15:46:55 -07:00
John Baldwin
82496a256f nvme: Use devclass_find to lookup the nvme devclass.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D34995
2022-04-21 10:29:14 -07:00
Warner Losh
0fd4cd405b nvme: Use controller's page size instead of PAGE_SIZE to create qpair
When constructing qpair, use the controller's notion of page size rather
than the host's PAGE_SIZE. Currently, these are both 4k, but the arm 16k
page size support requires decoupling.

There's a "hidden" PAGE_SIZE in btoc, so we must change btoc(x) to
howmany(x, ctrlr->page_size) to properly count the number of pages (in
the drive's world view) are needed for various calculations.

With these changes, we the nvme driver operates at production level load
for both host 4k and host 16k page size.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D34873
2022-04-15 14:46:19 -06:00
Warner Losh
c5ed67dc90 nvme: Prefer nvme_printf to printf when reporting formatting error
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D34872
2022-04-15 14:46:19 -06:00
Warner Losh
3740a8db13 nvme: Further refinements in Host Memory Buffer Sizing
Host Memory Buffer units are a mix. For those in the identify structure,
the size is in 4kiB chunks. For specifying the buffer description,
though, they are in terms of the drive's MPS. Add comments to this
effect and change PAGE_SIZE to ctrlr->page_size where needed, as well as
correct a mistaken use of NVME_HPS_UNITS in 214df80a9c as pointed out
by rpokala@ after the commit. No functional change is intended, as
page_size is still 4k which matches all current hosts' PAGE_SIZE, but to
support 16k pages on arm, we need to differentiate these two cases.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D34871
2022-04-15 14:46:19 -06:00
Warner Losh
3086efe895 nvme: Remove NVME_MAX_XFER_SIZE, replace inline calculation
NVME_MAX_XFER_SIZE used to be a constant (back when MAXPHYS was a
constant) to denote the smaller of MAXPHYS or the largest PRP we could
encode with our prealloation scheme. However, it's no longer constant
since MAXPHYS varies at runtime. In addition, the actual maximum is now
based on the drive's currently in use page_size, which is also a runtime
expression. As such, remove the define and expand it inline in the one
place its used still in the tree.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34870
2022-04-15 14:46:18 -06:00
Warner Losh
3a468f2010 nvme: Use saved mps when initializing drive
Make sure we set the MPS we cached (currently the drives minimum mps) in
CC (Controller Configuration) when reinitializing the drive. It must
match the page_size that we're going to use. Also retire less specific
NVME_PAGE_SHIFT since it's now unused.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34869
2022-04-15 14:46:18 -06:00
Warner Losh
55412ef90a nvme: Rename min_page_size to page_size and save mps
The Memory Page Size sets the basic unit of operation for the drive. We
currently set this to the drive's minimum page size, but we could set it
to any page size the drive supports in the future. Replace min_page_size
(it's now unused for that purpose) with page_size to reflect this and
cache the MPS we want to use. Use NVME_MPS_SHIFT to compute page_size.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34868
2022-04-15 14:46:18 -06:00
Warner Losh
6e3deec8ca nvme: Base maximum data transfer size directly on MPSMIN in cap_hi
Calculate the maxmimum transfer size based on the MPSMIN we have in our
cached copy of cap_hi rather than using min_page_size in the controller.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34867
2022-04-15 14:46:18 -06:00
Warner Losh
a7218e7a6b nvme: Fix old intel alignment size
The intel raid stripe alignment parameter is based on CAP.MPSMIN, so use
that directly now that we have it available.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34866
2022-04-15 14:46:18 -06:00
Warner Losh
e66c1b5185 nvme: Define NVME_MPS_SHIFT
The memory page size (MPS) is expressed in terms of a 2^(number + 12)
and other items in the system inherit this. Create a define rather than
sprinkling 12 everywehere.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34865
2022-04-15 14:46:18 -06:00
Gordon Bergling
dfa01f4f98 nvme(4): Fix a typo in a source code comment
- s/is is/is/

MFC after:	3 days
2022-04-09 09:24:34 +02:00
Warner Losh
214df80a9c nvme: new define for size of host memory buffer sizes
The nvme spec defines the various fields that specify sizes for host
memory buffers in terms of 4096 chunks. So, rather than use a bare 4096
here, use NVME_HMB_UNITS. This is explicitly not the host page size of
4096, nor the default memory page size (mps) of the NVMe drive, but its
own thing and needs its own define.

No functional change is intended, only the logical spelling of 4k.

Sponsored by:		Netflix
2022-04-08 23:05:25 -06:00
Warner Losh
161fcf7994 nvme: Publish the drive's capabilities
Add cap_lo and cap_hi sysctl to each nvme drive. This publishes the raw
capabilities of the drive. Now we can only discover these with
bootverbose.

Sponsored by:		Netflix
2022-03-31 21:13:16 -06:00
Warner Losh
6af6a52ee4 nvme: Save cap_lo and cap_hi
Save the capabilities for the drive.

Sponsored by:		Netflix
2022-03-31 21:12:38 -06:00
Warner Losh
a70b5660f3 nvme: MPS is a power of two, not a size / 8k
Setting MPS in the CC should be a power of 2 number (it specifies the
page size of the host is 2^(12+MPS)), so adjust the calcuation. There is
no functional change because we do not support any architecutres != 4k
pages (yet). Other changes are needed for architectures with 16k or 64k
pages, especially when the underlying NVMe drive doesn't support that
page size (Most drives support a range that's small, and many only
support 4k), but let's at least do this calculation correctly. 12 - 12
is just as much 0 as 4096 >> 13 is :)

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D34707
2022-03-31 21:12:38 -06:00
Chuck Tuffli
c2318cf80a nvme: fix spelling of Namespace
Fix spelling of a macro definition.

Reviewed by:	mav, imp
Differential Revision:	https://reviews.freebsd.org/D34330
2022-02-21 10:34:46 -08:00
Chuck Tuffli
e71afa1202 nvme: Add OAES bit-field definitions
Create definitions for the Optional Asynchronous Events Supported (OAES)
values. Also adds a helper macro for the common use case of "mask and
shift". E.g.
    value = NVME_CTRLR_DATA_OAES_NS_ATTR_MASK << NVME_CTRLR_DATA_OAES_NS_ATTR_SHIFT;
becomes
    value = NVMEB(NVME_CTRLR_DATA_OAES_NS_ATTR);

Reviewed by:	mav, imp
Differential Revision:	https://reviews.freebsd.org/D34300
2022-02-21 10:34:14 -08:00
Alexander Motin
b3c9b6060f nvme: Do not rearm timeout for commands without one.
Admin queues almost always have several ASYNC_EVENT_REQUEST outstanding.
They have no timeouts, but their presence in qpair->outstanding_tr caused
useless timeout callout rearming twice a second.

While there, relax timeout callout period from 0.5s to 0.5-1s to improve
aggregation.  Command timeouts are measured in seconds, so we don't need
to be precise here.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33781
2022-01-07 12:59:16 -05:00
Warner Losh
8f07932272 nvme_sim: Only report PCI related stats when we can
For AHCI attached devices, we report the location and identification
information of the AHCI controller that we're attached to. We also
don't reprot link speed in that case, since we can't get to the PCIe
config space registers to find that out.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D33287
2021-12-06 10:23:40 -07:00
Warner Losh
7cf8d63c88 nvme_ahci: Mark AHCI devices as such in the controller
Add a quirk to flag AHCI attachment to the controller. This is for any
of the strategies for attaching nvme devices as children of the AHCI
device for Intel's RAID devices. This also has a side effect of cleaning
up resource allocation from failed nvme_attach calls now.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D33285
2021-12-06 10:23:40 -07:00
Warner Losh
053f8ed6eb nvme: Move to a quirk for the Intel alignment data
Prior to NVMe 1.3, Intel produced a series of drives that had
performance alignment data in the vendor specific space since no
standard had been defined. Move testing the versions to a quick so the
NVMe NS code doesn't know about PCI device info.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D33284
2021-12-06 10:23:40 -07:00
Gordon Bergling
5f8ccf6515 nvme(4): Correct a typo in a sysctl description
- s/printting/printing/

MFC after:	3 days
2021-11-30 10:26:25 +01:00
Warner Losh
2ec165e3f0 nvme: Reduce traffic to the doorbell register
Reduce traffic to doorbell register when processing multiple completion
events at once. Only write it at the end of the loop after we've
processed everything (assuming we found at least one completion,
even if that completion wasn't valid).

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32470
2021-10-14 08:44:37 -06:00
Warner Losh
18dc12bfd2 nvme: Restore hotplug warning
Restore hotplug warning in recovery state machine. No functional change
other than what message gets printed.

Sponsored by:		Netflix
2021-10-12 14:26:54 -06:00
Warner Losh
83581511d9 nvme: Use adaptive spinning when polling for completion or state change
We only use nvme_completion_poll in the initialization path. The
commands they queue and wait for finish quickly as they involve no I/O
to the drive's media. These command take about 20-200 microsecnds
each. Set the wait time to 1us and then increase it by 1.5 each
successive iteration (max 1ms). This reduces initialization time by
80ms in cpervica's tests.

Use this same technique waiting for RDY state transitions. This saves
another 20ms. In total we're down from ~330ms to ~2ms.

Tested by:		cperciva
Sponsored by:		Netflix
Reviewed by:		mav
Differential Review:	https://reviews.freebsd.org/D32259
2021-10-01 19:17:55 -06:00
Warner Losh
4b3da659bf nvme: Only reset once on attach.
The FreeBSD nvme driver has reset the nvme controller twice on attach to
address a theoretical issue assuring the hardware is in a known
state. However, exierence has shown the second reset is unnecessary and
increases the time to boot. Eliminate the second reset. Should there be
a situation when you need a second reset (for buggy or at least somewhat
out of the mainstream hardware), the hardware option NVME_2X_RESET will
restore the old behavior. Document this in nvme(4).

If there's any trouble at all with this, I'll add a sysctl tunable to
control it.

Sponsored by:		Netflix
Reviewed by:		cperciva, mav
Differential Revision:	https://reviews.freebsd.org/D32241
2021-10-01 11:09:34 -06:00
Warner Losh
e5e26e4a24 nvme: Remove pause while resetting
After some study of the code and the standard, I think we can just drop
the pause(), unconditionally.  If we're not initialized, then there's
nothing to wait for from a software perspective.  If we are initialized,
then there might be outstanding I/O. If so, then the qpair 'recovery
state' will transition to WAITING in nvme_ctrlr_disable_qpairs, which
will ignore any interrupts for items that complete before we complete
the reset by setting cc.en=0.

If we go on to fail the controller, we'll cancel the outstanding I/O
transactions.  If we reset the controller, the hardware throws away
pending transactions and we retry all the pending I/O transactions. Any
transactions that happend to complete before cc.en=0 will have the same
effect in the end (doing the same transaction twice is just inefficient,
it won't affect the state of the device any differently than having done
it once).

The standard imposes no wait times here, so it isn't needed from that
perspective.

Unanswered Question: Do we may need to disable interrupts while we
disable in legacy mode since those are level-sensitive.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32248
2021-10-01 11:09:05 -06:00
Warner Losh
77054a897f nvme: Explain a workaround a little better
The don't touch the mmio of the drive after we do a EN 1->0 transition
is only for a tiny number of dirves that have this unforunate issue.

Sponsored by:		Netflix
2021-10-01 10:56:10 -06:00
Warner Losh
a245627a4e nvme_ctrlr_enable: Small style nits
Rewrite the nested if's using the preferred FreeBSD style for branches
of ifs that return. NFC. Minor tweaks to the comments to better fit new
code layout.

Sponsored by:		Netflix
Reviewed by:		mav, chuck (prior rev, but comments rolled in)
Differential Revision:	https://reviews.freebsd.org/D32245
2021-10-01 10:56:10 -06:00
Warner Losh
26259f6ab9 nvme: Use MS_2_TICKS rather than rolling our own
Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32246
2021-10-01 10:56:10 -06:00
Warner Losh
d5fca1dc1d nvme_ctrlr_enable: Remove unnecessary 5ms delays
Remove the 5ms delays after writing the administrative queue
registers. These delays are from the very earliest days of the driver
(they are in the first commit) and were most likely vestiges of the
Chatham NVMe prototype card that was used to create this driver. Many of
the workarounds necessary for it aren't necessary for standards
compliant cards. The original driver had other areas marked for Chatham,
but these were not. They are unneeded. There's three lines of supporting
evidence.

First, the NVMe standards make no mention of a delay time after these
registers are written. Second, the Linux driver doesn't have them, even
as an option. Third, all my nvme cards work w/o them.

To be safe, add a write barrier between setting up the admin queue and
enabling the controller.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32247
2021-10-01 10:56:10 -06:00
Warner Losh
36a87d0c6f nvme: Sanity check completion id
Make sure the completion ID is in the range of [0..num_trackers) since
the values past the end of the act_tr array are never going to be valid
trackers and will lead to pain and suffering if we try to dereference
them to get the tracker or to set the tracker back to NULL as we
complete the I/O.

Sponsored by:		Netflix
Reviewed by:		mav, chs, chuck
Differential Revision:	https://reviews.freebsd.org/D32088
2021-09-28 21:21:50 -06:00
Warner Losh
587aa25525 nvme: count number of ignored interrupts
Count the number of times we're asked to process completions, but that
we ignore because the state of the qpair isn't in RECOVERY_NONE.

Sponsored by:		Netflix
Reviewed by:		mav, chuck
Differential Revision:	https://reviews.freebsd.org/D32212
2021-09-28 21:18:00 -06:00
Warner Losh
7d5eebe0f4 nvme: Add sanity check for phase on startup.
The proper phase for the qpiar right after reset in the first interrupt
is 1. For it, make sure that we're not still in phase 0. This is an
illegal state to be processing interrupts and indicates that we've
failed to properly protect against a race between initializing our state
and processing interrupts. Modify stat resetting code so it resets the
number of interrpts to 1 instead of 0 so we don't trigger a false
positive panic.

Sponsored by:		Netflix
Reviewed by:		cperciva, mav (prior version)
Differential Revision:	https://reviews.freebsd.org/D32211
2021-09-28 21:18:00 -06:00
Warner Losh
fa81f3731d nvme: start qpair in state RECOVERY_WAITING
An interrupt happens on the admin queue right away after the reset, so
as soon as we enable interrupts, we'll get a call to our interrupt
handler. It is safe to ignore this interrupt if we're not yet
initialized, or	to process it if we are. If we are initialized,	we'll
see there's no completion records and return. If we're not, we'll
process	no completion records and return. Either way, nothing is
processed and nothing is lost.

Until we've completely setup the qpair, we need to avoid processing
completion records. Start the qpair in the waiting recovery state so we
return immediately when we try to process completions. The code already
sets it to 'NONE' when we're initialization is complete. It's safe to
defer completion processing here because we don't send any commands
before the initialization of the software state of the qpair is
complete. And even if we were to somehow send a command prior to that
completing, the completion record for that command would be processed
when we send commands to the admin qpair after we've setup the software
state. There's no good central point to add an assert for this last
condition.

This fixes an KASSERT "received completion for unknown cmd" panic on
boot.

Fixes:			502dc84a8b
Sponsored by:		Netflix
Reviewed by:		mav, cperciva, gallatin
Differential Revision:	https://reviews.freebsd.org/D32210
2021-09-28 21:16:19 -06:00
Warner Losh
502dc84a8b nvme: Use shared timeout rather than timeout per transaction
Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D28583
2021-09-23 16:42:08 -06:00
Warner Losh
4b977e6dda nvme/nda: Fail all nvme I/Os after controller fails
Once the controller has failed, fail all I/O w/o sending it to the
device. The reset of the nvme driver won't schedule any I/O to the
failed device, and the controller is in an indeterminate state and can't
accept I/O. Fail both at the top end of the sim and the bottom
end. Don't bother queueing up the I/O for failure in a different task.

Reviewed by:		chuck
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31341
2021-09-17 16:09:21 -06:00
Colin Percival
bad42df9bf Add some nvme initialization routines to TSLOG
About 335 ms of EC2 instance boot time is being spent here.
2021-09-05 12:48:43 -07:00
Alexander Motin
e3bdf3da76 nvme(4): Add MSI and single MSI-X support.
If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared.  If still no luck, fall back to shared INTx.

This provides maximal flexibility in some limited scenarios.  For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.

MFC after:	1 week
2021-08-31 13:45:46 -04:00
Alexander Motin
31111372e6 nvme(4): Do not panic on admin queue construct error.
MFC after:	1 week
2021-08-30 20:38:23 -04:00
Alexander Motin
b776de6796 Mark some sysctls as CTLFLAG_MPSAFE.
MFC after:	2 weeks
2021-08-10 20:44:27 -04:00
Warner Losh
fc9a084023 nvme: Enable interrupts after qpair fully constructed
To guard against the ill effects of a spurious interrupt during
construction (or one that was bogusly pending), enable interrupts after
the qpair is completely constructed. Otherwise, we can die with null
pointer dereferences in nvme_qpair_process_completions. This has been
observed in at least one pre-release NVMe drive where the MSIX interrupt
fired while the queue was being created, before we'd started the NVMe
controller card.

The alternative of only turning on the interrupts after the rest was
tried, but was insufficient to work around this bug and made the code
more complicated w/o benefit.

Reviewed by:		mav, chuck
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31182
2021-07-15 16:17:23 -06:00
Alexander Motin
e3bcd07d83 nvme(4): Report NPWA before NPWG as stripesize.
New Samsung 980 SSDs report Namespace Preferred Write Alignment of
8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB).
My quick tests show that 16KB is a minimal sequential write size
when the SSD reaches peak IOPS, so writing much less is very slow.
But writing slightly less or slightly more does not change much,
so it seems not so much a size granularity as minimum I/O size.

Thinking about different stripesize consumers:
 - Partition alignment should be based on NPWA by definition.
 - ZFS ashift in part of forcing alignment of all I/Os should also
be based on NPWA.  In part of forcing size granularity, if really
needed, it may be set to NPWG, but too big value can make ZFS too
space-inefficient, and the 16KB is actually the biggest supported
value there now.
 - ZFS recordsize/volblocksize could potentially be tuned up toward
NPWG to work as I/O size granularity, but enabled compression makes
it too fuzzy.  And those are normally user-configurable things.
 - ZFS I/O aggregation code could definitely use Optimal Write Size
value and may be NPWG, but we don't have fields in GEOM now to report
the minimal and optimal I/O sizes, and even maximal is not reported
outside GEOM DISK to be used by ZFS.

MFC after:	1 week
2021-07-05 23:13:15 -04:00
Warner Losh
aa0ab681ae nvme: coherently read status of completion records
Coherently read the phase bit of the status completion record. We loop
over the completion record array, looking for all the transactions in
the same phase that have been completed. In doing that, we have to be
careful to read the status field first, and if it indicates a complete
record, we need to read and process that record. Otherwise, the host
might be overtaken by device when reading this completion record,
leading to a mistaken belief that the record is in phase. This leads to
the code using old values and looking at an already completed entry, which
has no current tracker.

To work around this problem, we read the status and make sure it is in
phase, we then re-read the entire completion record guaranteeing it's
complete, valid, and consistent . In addition we resync the dmatag to
reflect changes since the prior loop for the bouncing dma case.

Reviewed by:		jrtc27@, chuck@
Found by:		jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31002
2021-07-02 16:05:19 -06:00
Warner Losh
fea3cf1d6d nvme: Fix alignment on nvme structures
Remove __packed from nvme_command, nvme_completion and
nvme_dsm_trim. Add super-alignment to nvme_completion since it's always
at least that aligned in hardware (and in our existing uses of it
embedded in structures). It generates better code in
nvme_qpair_process_completions on riscv64 because otherwise the ABI
assumes a 4-byte alignment, and the same on all other platforms.

Reviewed by:		jrtc27@, mav@, chuck@
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31001
2021-07-02 16:05:19 -06:00
Warner Losh
80a75155e1 nvme: style nit
Put the { on the same line as the struct nvme_foo when we define these
structures. It's FreeBSD standard and these were inconsistent.

Sponsored by:		Netflix
2021-07-02 16:05:19 -06:00
Warner Losh
f0f4712165 nvme: fix a race between failing the controller and failing requests
Part of the nvme recovery process for errors is to reset the
card. Sometimes, this results in failing the entire controller. When nda
is in use, we free the sim, which will sleep until all the I/O has
completed. However, with only one thread, the request fail task never
runs once the reset thread sleeps here. Create two threads to allow I/O
to fail until it's all processed and the reset task can proceed.

This is a temporary kludge until I can work out questions that arose
during the review, not least is what was the race that queueing to a
failure task solved. The original commit is vague and other error paths
in the same context do a direct failure. I'll investigate that more
completely before committing changing that to a direct failure. mav@
raised this issue during the review, but didn't otherwise object.

Multiple threads, though, solve the problem in the mean time until other
such means can be perfected.

Reviewed by:		jhb@
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D30366
2021-05-28 23:05:40 -06:00