Certain invalid operations trigger hardware error conditions. Error
conditions that only halt one channel can be detected and recovered by
resetting the channel. Error conditions that halt the whole device are
generally not recoverable.
Add a sysctl to inject channel-fatal HW errors,
'dev.ioat.<N>.force_hw_error=1'.
When a halt due to a channel error is detected, ioat(4) blocks new
operations from being queued on the channel, completes any outstanding
operations with an error status, and resets the channel before allowing
new operations to be queued again.
Update ioat.4 to document error recovery; document blockfill introduced
in r290021 while we are here; document ioat_put_dmaengine() added in
r289907; document DMA_NO_WAIT added in r289982.
Sponsored by: EMC / Isilon Storage Division
whether an error is recoverable. Always re-dirty the buffer on errors
from write requests. The invalidation we used to do for errors not EIO
doesn't need to be done for a device that's really gone, since that's
done in a different path.
Reviewed by: mckusick@, kib@
mips24k/mips74k document that we need an explicit SYNC so to order
things correctly, even with access to uncachable memory.
We were doing calls to SYNC in the cache ops (inv, wbinv) but we
weren't doing it for uncachable memory.
When I ported this code from netbsd I was .. slightly mips74k greener.
I used writethrough because (a) it's what netbsd did, and (b) if I used
writethrough then things "didn't work."
Fast-forward a couple years, more MIPS hacking and a whole lot more
understanding of the bus APIs (the last few commits notwithstanding;
it's been a long week, ok?) and I have this working for arge,
argemdio, spi and ath. Hans has it working for USB. The ath barrier
code will come in a later commit.
This gets the routing throughput up from 220mbit -> 337mbit.
I'm sure the bridging throughput will be similarly improved.
Tested:
* QCA955x SoC, routing workload.
* use barriers in a slightly better fashion. You can blame this
glass of whiskey on putting barriers in the wrong spot. Grr adrian.
* steal/rewrite the mdio busy check from ag7100 from openwrt and
refactor the existing code out. This is .. more correct.
This seems to fix the boot-to-boot variation that I've been seeing
and it quietens the switch port status flapping.
Tested:
* QCA9558 SoC (AP135.)
Obtained from: Linux OpenWRT
This driver and the linux ag71xx driver both treat the transmit ring
as a circular linked list of descriptors. There's no "end" pointer
that is ever NULL - instead, it expects the MAC to hit a finished
descriptor (ARGE_DESC_EMPTY) and stop.
Now, since it's a circular buffer, we may end up with the hardware
hitting the beginning of our multi-descriptor frame before we've finished
setting it up. It then DMA's it in, starts sending it, and we finish
writing out the new descriptor. The hardware may then write its
completion for the next descriptor out; then we do, and when we next
read it it'll show up as "not done" and transmit completion stops.
This unfortunately manifests itself as the transmit queue always
being active and a massive TX interrupt storm. We need to actively
ACK packets back from the transmit engine and if we don't (eg because
we think the transmit isn't finished but it is) then the unit will
just keep generating interrupts.
I hit this finally with the below testing setup. This fixed it for me.
Strictly speaking I should put in a sync in between writing out all of
the descriptors and writing out that final descriptor.
Tested:
* QCA9558 SoC (AP135 reference board) w/ arge1 + vlans acting as a
router, and iperf -d (tcp, bidirectional traffic.)
Obtained from: Linux OpenWRT (ag71xx_main.c.)
The MIPS busdma sync operations currently are a big no-op on coherent memory.
This isn't strictly correct behaviour as we need a SYNC in here to ensure that
the writes have finished and are visible in main memory before the MMIO accesses
occur. This will have to be addressed in a later commit.
But, before that happens, let's at least do a flush here to make things
more "correct".
This is required for even remotely sensible behaviour on mips74k with
write-through memory enabled.
The mips74k programmers guide notes that reads can be re-ordered, even
uncached ones, so we need an explicit SYNC between them.
Yes, this is a case of a driver author actively doing a bus barrier
operation.
This ends up being necessary when the mips74k core is run in write-back
mode rather than write-through mode. That's coming in an upcoming
commit.
Tested:
* mips74k, QCA9558 SoC (AP135 reference board), arge<->arge interface
routing traffic tests.
send frames.
This matches the other check for space.
"enough" is a misnomer, for "reasons". The biggest reason is that
the TX ring is actually a circular linked list, with no head/tail pointers.
This is just a bit more headroom between head/tail so we have time to
schedule frames before we hit where the hardware is at.
Ideally this would be tunable and a little larger.
This flushes out the write to the system before anything continues.
The mips74k guide, chapter 3.3.3 (write gathering) notes that writes
can be buffered in FIFOs - even uncached ones - so we can't guarantee
the device has felt its effects. Now, since we're all lazy driver
authors and don't pepper read/write barriers everywhere, fake it here.
tested:
* mips74k - QCA9558 SoC (AP135 reference board)
self-documented, and eases addition of new ops.
For the similar reasons, eliminate UMTX_OP_MAX. nitems() handles the
only use of the symbol.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Fixes race condition observed under following circumstances:
1) I/O split on 128KB boundary with Intel NVMe controller.
Current Intel controllers produce better latency when
I/Os do not span a 128KB boundary - even if the I/O size
itself is less than 128KB.
2) Per-CPU I/O queues are enabled.
3) Child I/Os are submitted on different submission queues.
4) Interrupts for child I/O completions occur almost
simultaneously.
5) ithread for child I/O A increments bio_inbed, then
immediately is preempted (rendezvous IPI, higher priority
interrupt).
6) ithread for child I/O B increments bio_inbed, then completes
parent bio since all children are now completed.
7) parent bio is freed, and immediately reallocated for a VFS
or gpart bio (including setting bio_children to 1 and
clearing bio_driver1).
8) ithread for child I/O A resumes processing. bio_children
for what it thinks is the parent bio is set to 1, so it
thinks it needs to complete the parent bio.
Result is either calling a NULL callback function, or double freeing
the bio to its uma zone.
PR: 203746
Reported by: Drew Gallatin <gallatin@netflix.com>,
Marc Goroff <mgoroff@quorum.net>
Tested by: Drew Gallatin <gallatin@netflix.com>
MFC after: 3 days
Sponsored by: Intel
if they are not required for mounting rootfs. However, it's possible
that some setups try to mount them in mountcritlocal (ie from fstab).
Export the list of current root mount holds using a new sysctl,
vfs.root_mount_hold, and make mountcritlocal retry if "mount -a" fails
and the list is not empty.
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3709
status registers for every interrupt. Check a common host channel
status interrupt register first, then conditionally read the
individual host channel status registers.
Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after: 1 week
b_asize can be zero if the block is compressed into an empty block
(ZIO_COMPRESS_EMPTY) and the trim code asserts that meaningless
zero-sized trimming is not attempted.
The logic for calling trim_map_free() is extracted into a new function
l2arc_trim() to minimize code duplication.
PR: 203473
Reported by: Willem Jan Withagen <wjw@digiware.nl>
Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 11 days
We can't use copyout because destination memory is userland address
in another process but we have reference to respective page so map
the page into kernel address space and copy fragments there
receives video memory address from VideoCore through property mailbox
channel. Older versions of firmware (and the one that is currently part
of sysutils/u-boot-rpi and sysutils/u-boot-rpi2) returned real physical
address, newer one returns VideoCore bus address, so we need to convert
it to actual physical address. this version works with both older and
newer interface.
point, e.g. on RaspberryPi 2 when control is passed from loader to kernel
it contains garbage. So we use cpsr as a base for new cpsr value: if we
have reached this point it means current value is OK
Reviewed by: andrew
When using route-to (or reply-to) pf sends the packet directly to the output
interface. If that interface doesn't support checksum offloading the checksum
has to be calculated in software.
That was already done in the IPv4 case, but not for the IPv6 case. As a result
we'd emit packets with pseudo-header checksums (i.e. incorrect checksums).
This issue was exposed by the changes in r289316 when pf stopped performing full
checksum calculations for all packets.
Submitted by: Luoqi Chen
MFC after: 1 week
pmap_change_attr must change the memory type of both the requested KVA
and the corresponding DMAP mappings (if such mappings exist), to satisfy
an Intel requirement that two or more mappings to the same physical
pages must have the same memory type.
However, not all kernel mapped pages have corresponding DMAP mappings --
for example, 64-bit BARs. Skip fixing up the DMAP for out-of-bounds
addresses.
Submitted by: Steve Wahl <steve_wahl@dell.com>
Reviewed by: alc, jhb
Sponsored by: Dell Compellent
Differential Revision: https://reviews.freebsd.org/D4030
When destroying a character device the si_devsw field is set to NULL
before all references are gone, to indicate the character device is
going away. This can cause a NULL-dereference fault inside physio().
The callers of physio() should own a thread reference on the cdev and
if si_devsw is seen as non-NULL, it is usable during the execution of
the function. Else an ENXIO error code is returned.
Reviewed by: kib
MFC after: 2 weeks
- Move all files related to the LinuxKPI into sys/compat/linuxkpi and
its subfolders.
- Update sys/conf/files and some Makefiles to use new file locations.
- Added description of COMPAT_LINUXKPI to sys/conf/NOTES which in turn
adds the LinuxKPI to all LINT builds.
- The LinuxKPI can be added to the kernel by setting the
COMPAT_LINUXKPI option. The OFED kernel option no longer builds the
LinuxKPI into the kernel. This was done to keep the build rules for
the LinuxKPI in sys/conf/files simple.
- Extend the LinuxKPI module to include support for USB by moving the
Linux USB compat from usb.ko to linuxkpi.ko.
- Bump the FreeBSD_version.
- A universe kernel build has been done.
Reviewed by: np @ (cxgb and cxgbe related changes only)
Sponsored by: Mellanox Technologies
AMD64 pmap assumes ranges will be in the DMAP, which isn't necessarily
true for NTB memory windows (especially 64-bit BARs).
Suggested by: pmap_change_attr_locked -> kassert_panic
Sponsored by: EMC / Isilon Storage Division
Allows DMA from/to arbitrary KVA or physical address. /dev/ioat_test
must be enabled by root and is only R/W root, so this is approximately
as dangerous as /dev/mem and /dev/kmem.
Sponsored by: EMC / Isilon Storage Division
suggested by RFC 6675.
Currently differnt places in the stack tries to guess this in suboptimal ways.
The main problem is that current calculations don't take sacked bytes into
account. Sacked bytes are the bytes receiver acked via SACK option. This is
suboptimal because it assumes that network has more outstanding (unacked) bytes
than the actual value and thus sends less data by setting congestion window
lower than what's possible which in turn may cause slower recovery from losses.
As an example, one of the current calculations looks something like this:
snd_nxt - snd_fack + sackhint.sack_bytes_rexmit
New proposal from RFC 6675 is:
snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit
which takes sacked bytes into account which is a new addition to the sackhint
struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being
considered lost and thus adjusting pipe based on that which makes this
calculation a bit on conservative side.
The approach is very simple. We already process each ack with sack info in
tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track
this new variable sacked_bytes which keeps track of total sacked bytes reported.
One downside to this approach is that we may get incorrect count of sacked_bytes
if the other end decides to drop sack info in the ack because of memory pressure
or some other reasons. But in this (not very likely) case also the pipe
calculation would be conservative which is okay as opposed to being aggressive
in sending packets into the network.
Next step is to use this more accurate pipe estimation to drive congestion
window adjustments.
In collaboration with: rrs
Reviewed by: jason_eggnet dot com, rrs
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D3971
For the most of chips (except anscient ones) port handlers have no relation
to port IDs. In such situation old code scanning first 125 handlers was
quite naive. Instead of doing that, send to chip single request to get full
list of port handlers available on specific virtual port and scan only them.
Old code had problems with case of several virtual ports enabled, when port
handlers allocated from global address space could easily go above 125.
This change was successfully tested on 23xx, 24xx and 25xx chips in loop
mode with 4 virtual initiator ports, each seing 50 virtual target ports.
This should make it easier to track down interrupt storms from arge.
Tested:
* AP135 (QCA955x) SoC - defaults to ARGE_DEBUG enabled
* Carambola2 (AR9331 SoC) - defaults to ARGE_DEBUG disabled
the dynamic linker copy them, but not relocate them at the new location.
This allows us to run sqlite3 without it crashing.
Sponsored by: ABT Systems Ltd
This call may be used when device cannot continue to operate normally
(e.g., throws firmware error, watchdog timer expires)
and need to be restarted.
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D3998
window in number of segments on fly. It is set to 10 segments by default.
Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove
the parent node net.inet.tcp.experimental as it's not needed anymore and also
because it was not well thought out.
Differential Revision: https://reviews.freebsd.org/D3858
In collaboration with: lstewart
Reviewed by: gnn (prev version), rwatson, allanjude, wblock (man page)
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Limelight Networks
* refactor out the rx filter and operating mode code into a separate
method.
* add some comments about what's left with setting the operating mode
based on what carl9170 does.
* comment out some init from otus_init_mac() - it's no longer needed as
it's always init'ed now.
* add debugging and a missing return around a failure to call m_get2() -
during monitor mode operation I found RXing of frames > 2k, which
fails allocation. I'm sure they're valid (it's configuring 11n RX and
receiving 11n frames even though the driver doesn't "do" 11n)
and may be A-MSDU; but allocations fail and we should handle that
gracefully.
Tested:
* UB82 reference NIC (AR9170 + AR9104 2x2 dual band NIC); STA and
monitor mode operation.
degradation (7%) for host host TCP connections over 10Gbps links,
even when there were no secuirty policies in place. There is no
change in performance on 1Gbps network links. Testing GENERIC vs.
GENERIC-NOIPSEC vs. GENERIC with this change shows that the new
code removes any overhead introduced by having IPSEC always in the
kernel.
Differential Revision: D3993
MFC after: 1 month
Sponsored by: Rubicon Communications (Netgate)
The IOAT hardware supports writing a 64-bit pattern to some destination
buffer. The same limitations on buffer length apply as for copy
operations. Throughput is a bit higher (probably because fill does not
have to spend bandwidth reading from a source in memory).
Support for testing Block Fill has been added to ioatcontrol(8) and the
ioat_test device. ioatcontrol(8) accepts the '-f' flag, which tests
Block Fill. (If the flag is omitted, the tool tests copy by default.)
The '-V' flag, in conjunction with '-f', verifies that buffers are
filled in the expected pattern.
Tested on: Broadwell DE (Xeon D-1500)
Sponsored by: EMC / Isilon Storage Division
Add generic hw descriptor struct and generic control flags struct, in
preparation for other kinds of IOAT operation.
Sponsored by: EMC / Isilon Storage Division
Now on 24xx and above chips it is really possible to simulate several
virtual FC ports with single physical one. For example, it allows to
configure several targets in ctl.conf, assign each of them to separate
virtual port, and let user to control access to them with switch zoning.
I still doubt that all problems are solved there, but at now it passes
at least basic tests.
The new load_ma implementation can cause dereferences when used with
certain drivers, back it out until the reason is found:
Fatal trap 12: page fault while in kernel mode
cpuid = 11; apic id = 03
fault virtual address = 0x30
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff808a2d22
stack pointer = 0x28:0xfffffe07cc737710
frame pointer = 0x28:0xfffffe07cc737790
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 13 (g_down)
trap number = 12
panic: page fault
cpuid = 11
KDB: stack backtrace:
#0 0xffffffff80641647 at kdb_backtrace+0x67
#1 0xffffffff80606762 at vpanic+0x182
#2 0xffffffff806067e3 at panic+0x43
#3 0xffffffff8084eef1 at trap_fatal+0x351
#4 0xffffffff8084f0e4 at trap_pfault+0x1e4
#5 0xffffffff8084e82f at trap+0x4bf
#6 0xffffffff80830d57 at calltrap+0x8
#7 0xffffffff8063beab at _bus_dmamap_load_ccb+0x1fb
#8 0xffffffff8063bc51 at bus_dmamap_load_ccb+0x91
#9 0xffffffff8042dcad at ata_dmaload+0x11d
#10 0xffffffff8042df7e at ata_begin_transaction+0x7e
#11 0xffffffff8042c18e at ataaction+0x9ce
#12 0xffffffff802a220f at xpt_run_devq+0x5bf
#13 0xffffffff802a17ad at xpt_action_default+0x94d
#14 0xffffffff802c0024 at adastart+0x8b4
#15 0xffffffff802a2e93 at xpt_run_allocq+0x193
#16 0xffffffff802c0735 at adastrategy+0xf5
#17 0xffffffff80554206 at g_disk_start+0x426
Uptime: 2m29s
Add a new flag for DMA operations, DMA_NO_WAIT. It behaves much like
other NOWAIT flags -- if queueing an operation would sleep, abort and
return NULL instead.
When growing the internal descriptor ring, the memory allocation is
performed outside of all locks. A lock-protected flag is used to avoid
duplicated work. Threads that cannot sleep and attempt to queue
operations when the descriptor ring is full allocate a larger ring with
M_NOWAIT, or bail if that fails.
ioat_reserve_space() could become an external API if is important to
callers that they have room for a sequence of operations, or that those
operations succeed each other directly in the hardware ring.
This patch splits the internal head index (->head) from the hardware's
head-of-chain (DMACOUNT) register (->hw_head). In the future, for
simplicity's sake, we could drop the 'ring' array entirely and just use
a linked list (with head and tail pointers rather than indices).
Suggested by: Witness
Sponsored by: EMC / Isilon Storage Division
Internal busses (thus ECAM access) should be mapped to
all values from 0 to 143.
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D3753
When one tries to allocate a resource with unspecified range,
read already configured BAR values (by UEFI or whatever).
This is necessary to make VNIC VFs working and to allow them to be
properly allocated.
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D3752
Assertion used here was invalid. If current thread helds any of locks,
we never want to recurse on them.
Obtained from: Semihalf
Submitted by: Bartosz Szczepanek <bsz@semihalf.com>
Differential revision: https://reviews.freebsd.org/D3903
Add e6000sw driver supporting Marvell 88E6352, 88E6172, 88E6176 switches.
It needs to be attached to mdio interface, exporting SMI access
functionality. e6000sw supports port-based VLAN configuration, per-port
media changing, accessing PHY and switch registers.
e6000sw attaches miibuses and PHY drivers as children. Instead of typical
tick as callout, kthread-based tick is used. This combined with SX locks
allows MDIO read/write calls to sleep. It is expected, because this
hardware requires long delays in SMI read/write procedures, which can not
be handled by busy-waiting.
Reviewed by: adrian
Obtained from: Semihalf
Submitted by: Bartosz Szczepanek <bsz@semihalf.com>
Differential revision: https://reviews.freebsd.org/D3902
This commit introduces support for etherswitch devices that utilize SMI as
a way of accessing its registers. SMI register is located in address space
of mge -- access to it was exported through MDIO interface.
Attachment functions were enhanced so as to ensure proper initialisation
in both cases: 1) PHYs attached directly to mge, 2) PHYs attached to
switch device and switch attached to mge. Attachment of etherswitch device
depends on dts entry with compatible="mrvl,sw" property. If none is found,
typical PHY attachment procedure follows.
In case of switch attached, PHYs' status and configuration is accessible
via etherswitchcfg, and ifconfig shows always-up, non-configurable mge
interfaces.
Due to the fact that there may be simultaneous accessess to SMI
registers (e.g. from PHY attached to one of mge instances and switch
to the other), SMI access interlock was added. It is SX lock,
because sleep ability is necessary -- busy-waiting would result
in poor performance due to long delays required by hardware.
Underlying switch driver is obliged to use sleepable locks as well.
Reviewed by: adrian
Obtained from: Semihalf
Submitted by: Bartosz Szczepanek <bsz@semihalf.com>
Differential revision: https://reviews.freebsd.org/D3900
is 0. Without this change it was sleeping for one tick. Maybe not a big
deal, but it makes share/dtrace/blocking script to report that.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D3814
Sponsored by: Wheel Systems, http://wheelsystems.com
IPv4 packets (when it should return FALSE). It happens because PF_ANEQ() doesn't
stop if first 32 bits of IPv4 packets are equal and starts to check next 3*32
bits (like for IPv6 packet). Those bits containt some garbage and in result
PF_ANEQ() wrongly returns TRUE.
Fix: Check if packet is of AF_INET type and if it is then compare only first 32
bits of data.
PR: 204005
Submitted by: Miłosz Kaniewski
We need to reset the chancmp and chainaddr MMIO registers to bring the
device back to a working state.
Name the chanerr bits while we're here.
Sponsored by: EMC / Isilon Storage Division
We only need to borrow a mutex for the drain sleep and the 0->1
transition, so just reuse an existing one for now.
The wchan is arbitrary. Using refcount itself would have required
__DEVOLATILE(), so use the lock's address instead.
Different uses are tagged by kind, although we only do anything with
that information in INVARIANTS builds.
Sponsored by: EMC / Isilon Storage Division
Callers should have acquired this lock when they invoked ioat_acquire()
before issuing operations. Assert it is held.
Sponsored by: EMC / Isilon Storage Division
This is still the worst possible way to allocate memory if it will ever
be under pressure, but at least it won't deadlock.
Suggested by: WITNESS
Sponsored by: EMC / Isilon Storage Division
Pull out the timer callout delay into IOAT_INTR_TIMO and shorten it
considerably (5s -> 100ms). Single operations do not take 5-10 seconds
and when interrupts aren't working, waiting 100ms sucks a lot less than
5s.
Sponsored by: EMC / Isilon Storage Division
I couldn't test arge0->arge1 bridging, only arge0 VLAN bridging.
The DIR-825C1 only hooks up arge0 to the switch GMAC0 and so
you need to abuse VLANs to test.
Tested:
* DIR-825C1 (AR9344)
the temporary file to vers.c at the end of the script
The previous logic wrote out to vers.c multiple times, so the file
could be incorrectly interpreted as being completely written out
after one of the echo calls with recursive make, when in reality it
was only partially written.
Also, in the event the build was interrupted when creating vers.c
(small race window), it would have a leftover file that needed to
be cleaned up before resuming the build.
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
pager. It is enough to execute VOP_BMAP() once to obtain both the
disk block address for the requested page, and the before/after limits
for the contiguous run. The clipping of the vm_page_t array passed to
the vnode_pager_generic_getpages() and the disk address for the first
page in the clipped array can be deduced from the call results.
While there, remove some noise (like if (1) {...}) and adjust nearby
code.
Reviewed by: alc
Discussed with: glebius
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
ordered with the MFENCE instruction. Similar weak guarantees are also
specified by the AMD APM vol. 3 rev. 3.22. x86 pmap methods
pmap_invalidate_cache_range() and pmap_invalidate_cache_pages() braced
CLFLUSH loop with MFENCE both before and after the loop.
In the revision 56 of SDM, Intel stated that all existing
implementations of CLFLUSH are strict, CLFLUSH instructions execution
is ordered WRT other CLFLUSH and writes. Also, the strict behaviour
is made architectural.
A new instruction CLFLUSHOPT (which was documented for some time in
the Instruction Set Extensions Programming Reference) provides the
weak behaviour which was previously attributed to CLFLUSH.
Use CLFLUSHOPT when available. When CLFLUSH is used on Intel CPUs, do
not execute MFENCE before and after the flushing loop.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
is a dcache invalidate to point of coherency just like dcache_inv_poc(), but
a slightly different version specific to dma operations. Elaborate the
comment about how and why it's different.
Now 24xx and above chips support full 8-byte LUN address space.
Older FC chips may support up to 16K LUNs when firmware allows.
Tested in both initiator and target modes for 23xx, 24xx and 25xx.
per-map. The per-tag scheme is not safe, and a mutex can't be used to
protect it because the mapping routines can't sleep. Code brought in
from armv6 implementation.