opnsense-src

mirror of https://github.com/opnsense/src.git synced 2026-04-25 08:07:28 -04:00

Author	SHA1	Message	Date
Stephen J. Kiernan	e895e7fce7	Fix typo where opening brace was needed. Reported by: Michael Butler Reviewed by: sjg Approved by: sjg (mentor)	2017-02-13 18:52:26 +00:00
Stephen J. Kiernan	d2e6391342	For MD_PRELOAD type md(4) devices, if there is a file name in the preloaded meta-data, copy it into the softc structure. When returning md(4) device details to the caller, include the file name in any MD_PRELOAD type devices if it is set (first character is not NUL.) In mdconfig, for "preload" type md(4) devices, if there is file config available, print it in the file column of the output. Reviewed by: brooks Approved by: sjg (mentor) MFC after: 1 month Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D9529	2017-02-13 17:44:07 +00:00
Maxim Sobolev	c3d1c73fa9	For the MD_ROOT option don't inject /dev/md0 as root dev when ROOTDEVNAME is defined explicitly. It's kinda pointless and results in extra step in boot sequence which is not really needed, i.e.: md0: Embedded image 1331200 bytes at 0x8038b7b4 Trying to mount root from ufs:/dev/md0 []... Mounting from ufs:/dev/md0 failed with error 22. Trying to mount root from ufs:md0.uzip []... warning: no time-of-day clock registered, system time will not be set accurately start_init: trying /sbin/init	2016-03-09 19:36:25 +00:00
Adrian Chadd	f4c1f0b9eb	Fix MFS builds when both MD_ROOT_SIZE and MFS_IMAGE are specified MD_ROOT_SIZE and embed_mfs.sh were basically retired as part of https://reviews.freebsd.org/D2903 . However, when building a kernel with 'options MD_ROOT_SIZE' specified, this results in a non-working MFS, as within sys/dev/md/md.c we fall within the wrong # ifdef. This patch implements the following: * Allow kernels to be built without the MD_ROOT_SIZE option, which results in a kernel built as per D2903. * Allow kernels to be built with the MD_ROOT_SIZE option, which results in a kernel built similarly to the pre-D2903 way, with the following differences: * The MFS is now put in a separate section within the kernel (oldmfs, so it differs from the mfs section introduced by D2903). * embed_mfs.sh is changed, so it looks up the oldmfs section within the kernel, gets its size and offset, sees if the MFS will fit within the allocated oldmfs section and only if all is well does a dd of the MFS image into the kernel. Submitted by: Stanislav Galabov <sgalabov@gmail.com> Reviewed by: brooks, imp Differential Revision: https://reviews.freebsd.org/D5093	2016-02-02 07:02:51 +00:00
Gleb Smirnoff	b0cd20172d	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-12-16 21:30:45 +00:00
Konstantin Belousov	d5f998ba70	In md(4) over vnode, correct handling of the unaligned unmapped io requests which page alignment + size is greater than MAXPHYS. Right now md(4) over vnode would use the physical buffer of the size MAXPHYS to map a data of size MAXPHYS + page offset of the user buffer. This typically corrupts next pbuf, or, if the pbuf used was the last pbuf in the map, the next page after the pbuf's map. Split request up to the size of io which fits into pbuf KVA with alignment, and retry if a part of the bio is left unprocessed. Reported by: Fabian Keil <fk@fabiankeil.de> Tested by: Fabian Keil, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-12-12 14:08:29 +00:00
Kenneth D. Merry	a9934668aa	Add asynchronous command support to the pass(4) driver, and the new camdd(8) utility. CCBs may be queued to the driver via the new CAMIOQUEUE ioctl, and completed CCBs may be retrieved via the CAMIOGET ioctl. User processes can use poll(2) or kevent(2) to get notification when I/O has completed. While the existing CAMIOCOMMAND blocking ioctl interface only supports user virtual data pointers in a CCB (generally only one per CCB), the new CAMIOQUEUE ioctl supports user virtual and physical address pointers, as well as user virtual and physical scatter/gather lists. This allows user applications to have more flexibility in their data handling operations. Kernel memory for data transferred via the queued interface is allocated from the zone allocator in MAXPHYS sized chunks, and user data is copied in and out. This is likely faster than the vmapbuf()/vunmapbuf() method used by the CAMIOCOMMAND ioctl in configurations with many processors (there are more TLB shootdowns caused by the mapping/unmapping operation) but may not be as fast as running with unmapped I/O. The new memory handling model for user requests also allows applications to send CCBs with request sizes that are larger than MAXPHYS. The pass(4) driver now limits queued requests to the I/O size listed by the SIM driver in the maxio field in the Path Inquiry (XPT_PATH_INQ) CCB. There are some things things would be good to add: 1. Come up with a way to do unmapped I/O on multiple buffers. Currently the unmapped I/O interface operates on a struct bio, which includes only one address and length. It would be nice to be able to send an unmapped scatter/gather list down to busdma. This would allow eliminating the copy we currently do for data. 2. Add an ioctl to list currently outstanding CCBs in the various queues. 3. Add an ioctl to cancel a request, or use the XPT_ABORT CCB to do that. 4. Test physical address support. Virtual pointers and scatter gather lists have been tested, but I have not yet tested physical addresses or scatter/gather lists. 5. Investigate multiple queue support. At the moment there is one queue of commands per pass(4) device. If multiple processes open the device, they will submit I/O into the same queue and get events for the same completions. This is probably the right model for most applications, but it is something that could be changed later on. Also, add a new utility, camdd(8) that uses the asynchronous pass(4) driver interface. This utility is intended to be a basic data transfer/copy utility, a simple benchmark utility, and an example of how to use the asynchronous pass(4) interface. It can copy data to and from pass(4) devices using any target queue depth, starting offset and blocksize for the input and ouptut devices. It currently only supports SCSI devices, but could be easily extended to support ATA devices. It can also copy data to and from regular files, block devices, tape devices, pipes, stdin, and stdout. It does not support queueing multiple commands to any of those targets, since it uses the standard read(2)/write(2)/writev(2)/readv(2) system calls. The I/O is done by two threads, one for the reader and one for the writer. The reader thread sends completed read requests to the writer thread in strictly sequential order, even if they complete out of order. That could be modified later on for random I/O patterns or slightly out of order I/O. camdd(8) uses kqueue(2)/kevent(2) to get I/O completion events from the pass(4) driver and also to send request notifications internally. For pass(4) devcies, camdd(8) uses a single buffer (CAM_DATA_VADDR) per CAM CCB on the reading side, and a scatter/gather list (CAM_DATA_SG) on the writing side. In addition to testing both interfaces, this makes any potential reblocking of I/O easier. No data is copied between the reader and the writer, but rather the reader's buffers are split into multiple I/O requests or combined into a single I/O request depending on the input and output blocksize. For the file I/O path, camdd(8) also uses a single buffer (read(2), write(2), pread(2) or pwrite(2)) on reads, and a scatter/gather list (readv(2), writev(2), preadv(2), pwritev(2)) on writes. Things that would be nice to do for camdd(8) eventually: 1. Add support for I/O pattern generation. Patterns like all zeros, all ones, LBA-based patterns, random patterns, etc. Right Now you can always use /dev/zero, /dev/random, etc. 2. Add support for a "sink" mode, so we do only reads with no writes. Right now, you can use /dev/null. 3. Add support for automatic queue depth probing, so that we can figure out the right queue depth on the input and output side for maximum throughput. At the moment it defaults to 6. 4. Add support for SATA device passthrough I/O. 5. Add support for random LBAs and/or lengths on the input and output sides. 6. Track average per-I/O latency and busy time. The busy time and latency could also feed in to the automatic queue depth determination. sys/cam/scsi/scsi_pass.h: Define two new ioctls, CAMIOQUEUE and CAMIOGET, that queue and fetch asynchronous CAM CCBs respectively. Although these ioctls do not have a declared argument, they both take a union ccb pointer. If we declare a size here, the ioctl code in sys/kern/sys_generic.c will malloc and free a buffer for either the CCB or the CCB pointer (depending on how it is declared). Since we have to keep a copy of the CCB (which is fairly large) anyway, having the ioctl malloc and free a CCB for each call is wasteful. sys/cam/scsi/scsi_pass.c: Add asynchronous CCB support. Add two new ioctls, CAMIOQUEUE and CAMIOGET. CAMIOQUEUE adds a CCB to the incoming queue. The CCB is executed immediately (and moved to the active queue) if it is an immediate CCB, but otherwise it will be executed in passstart() when a CCB is available from the transport layer. When CCBs are completed (because they are immediate or passdone() if they are queued), they are put on the done queue. If we get the final close on the device before all pending I/O is complete, all active I/O is moved to the abandoned queue and we increment the peripheral reference count so that the peripheral driver instance doesn't go away before all pending I/O is done. The new passcreatezone() function is called on the first call to the CAMIOQUEUE ioctl on a given device to allocate the UMA zones for I/O requests and S/G list buffers. This may be good to move off to a taskqueue at some point. The new passmemsetup() function allocates memory and scatter/gather lists to hold the user's data, and copies in any data that needs to be written. For virtual pointers (CAM_DATA_VADDR), the kernel buffer is malloced from the new pass(4) driver malloc bucket. For virtual scatter/gather lists (CAM_DATA_SG), buffers are allocated from a new per-pass(9) UMA zone in MAXPHYS-sized chunks. Physical pointers are passed in unchanged. We have support for up to 16 scatter/gather segments (for the user and kernel S/G lists) in the default struct pass_io_req, so requests with longer S/G lists require an extra kernel malloc. The new passcopysglist() function copies a user scatter/gather list to a kernel scatter/gather list. The number of elements in each list may be different, but (obviously) the amount of data stored has to be identical. The new passmemdone() function copies data out for the CAM_DATA_VADDR and CAM_DATA_SG cases. The new passiocleanup() function restores data pointers in user CCBs and frees memory. Add new functions to support kqueue(2)/kevent(2): passreadfilt() tells kevent whether or not the done queue is empty. passkqfilter() adds a knote to our list. passreadfiltdetach() removes a knote from our list. Add a new function, passpoll(), for poll(2)/select(2) to use. Add devstat(9) support for the queued CCB path. sys/cam/ata/ata_da.c: Add support for the BIO_VLIST bio type. sys/cam/cam_ccb.h: Add a new enumeration for the xflags field in the CCB header. (This doesn't change the CCB header, just adds an enumeration to use.) sys/cam/cam_xpt.c: Add a new function, xpt_setup_ccb_flags(), that allows specifying CCB flags. sys/cam/cam_xpt.h: Add a prototype for xpt_setup_ccb_flags(). sys/cam/scsi/scsi_da.c: Add support for BIO_VLIST. sys/dev/md/md.c: Add BIO_VLIST support to md(4). sys/geom/geom_disk.c: Add BIO_VLIST support to the GEOM disk class. Re-factor the I/O size limiting code in g_disk_start() a bit. sys/kern/subr_bus_dma.c: Change _bus_dmamap_load_vlist() to take a starting offset and length. Add a new function, _bus_dmamap_load_pages(), that will load a list of physical pages starting at an offset. Update _bus_dmamap_load_bio() to allow loading BIO_VLIST bios. Allow unmapped I/O to start at an offset. sys/kern/subr_uio.c: Add two new functions, physcopyin_vlist() and physcopyout_vlist(). sys/pc98/include/bus.h: Guard kernel-only parts of the pc98 machine/bus.h header with #ifdef _KERNEL. This allows userland programs to include <machine/bus.h> to get the definition of bus_addr_t and bus_size_t. sys/sys/bio.h: Add a new bio flag, BIO_VLIST. sys/sys/uio.h: Add prototypes for physcopyin_vlist() and physcopyout_vlist(). share/man/man4/pass.4: Document the CAMIOQUEUE and CAMIOGET ioctls. usr.sbin/Makefile: Add camdd. usr.sbin/camdd/Makefile: Add a makefile for camdd(8). usr.sbin/camdd/camdd.8: Man page for camdd(8). usr.sbin/camdd/camdd.c: The new camdd(8) utility. Sponsored by: Spectra Logic MFC after: 1 week	2015-12-03 20:54:55 +00:00
Marcel Moolenaar	c68ea8a640	s/as/at/ in previous commit. Pointed out by: jmallett@	2015-08-13 19:12:55 +00:00
Marcel Moolenaar	cc787e3d0e	Change md(4) to use weak symbols as start, end and size for the embedded root disk. The embedded image is linked into the kernel in the .mfs section. Add rules and variables to kern.pre.mk and kern.post.mk that handle the linking of the image. First objcopy is used to generate an object file. Then, the object file is linked into the kernel. Submitted by: Steve Kiernan <stevek@juniper.net> Reviewed by: brooks@ Obtained from: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D2903	2015-08-13 15:16:34 +00:00
Andrey V. Elsukov	ec170744a7	Use g_conf_printf_escaped() to escape illegal symbols in file name. PR: 202289 MFC after: 1 week	2015-08-13 13:20:29 +00:00
Konstantin Belousov	5d9b4508fd	For md(4), posix shm(3) and tmpfs(5), free swap space used by paged in dirty page, which is written by the process. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-28 14:27:05 +00:00
Attilio Rao	0d8243cc34	vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock, then threads can sleep on the pip condition. Avoid to deadlock such threads by correctly awakening the sleeping ones after the pip is finished. swapoff side of the bug can likely result in shutdown deadlocks. Sponsored by: EMC / Isilon Storage Division Reported by: pho, pluknet Tested by: pho	2014-03-19 01:13:42 +00:00
Konstantin Belousov	60b6e19785	Only assert the length of the passed bio in the mdstart_vnode() when the bio is unmapped, so we must map the bio pages into pbuf. This works around the geom classes which do not follow the MAXPHYS limit on the i/o size, since such classes do not know about unmapped bios either. Reported by: Paolo Pinto <paolo.pinto@netasq.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-12-10 20:52:31 +00:00
Edward Tomasz Napierala	bc2308d49e	Change comment to match code. Discussed with: thompsa Sponsored by: The FreeBSD Foundation	2013-12-04 09:48:52 +00:00
Edward Tomasz Napierala	0efd9bfd47	Add "null" backend to mdconfig(8). This does exactly what the name suggests, and is somewhat useful for benchmarking. MFC after: 1 month No objections from: kib Sponsored by: The FreeBSD Foundation	2013-12-04 07:38:23 +00:00
Alexander Motin	40ea77a036	Merge GEOM direct dispatch changes from the projects/camlock branch. When safety requirements are met, it allows to avoid passing I/O requests to GEOM g_up/g_down thread, executing them directly in the caller context. That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid several context switches per I/O. The defined now safety requirements are: - caller should not hold any locks and should be reenterable; - callee should not depend on GEOM dual-threaded concurency semantics; - on the way down, if request is unmapped while callee doesn't support it, the context should be sleepable; - kernel thread stack usage should be below 50%. To keep compatibility with GEOM classes not meeting above requirements new provider and consumer flags added: - G_CF_DIRECT_SEND -- consumer code meets caller requirements (request); - G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done); - G_PF_DIRECT_SEND -- provider code meets caller requirements (done); - G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request). Capable GEOM class can set them, allowing direct dispatch in cases where it is safe. If any of requirements are not met, request is queued to g_up or g_down thread same as before. Such GEOM classes were reviewed and updated to support direct dispatch: CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE, VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL, MAP, FLASHMAP, etc). To declare direct completion capability disk(9) KPI got new flag equivalent to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk drivers got it set now thanks to earlier CAM locking work. This change more then twice increases peak block storage performance on systems with manu CPUs, together with earlier CAM locking changes reaching more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to 256 user-level threads). Sponsored by: iXsystems, Inc. MFC after: 2 months	2013-10-22 08:22:19 +00:00
Konstantin Belousov	1a42d14a80	Give the page allocations initiated by the swap-backed md(4) a higher priority. If the write is requested by a system daemon, sleeping there would starve resources and cause deadlock. Reported and tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-30 20:12:23 +00:00
Konstantin Belousov	5944de8ecd	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-22 07:39:53 +00:00
Attilio Rao	c7aebda8a1	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
Konstantin Belousov	537cc627d7	Fix the data corruption on the swap-backed md. Assign the rv variable a success code if the pager was not asked for the page. Using an error code from the previous processed page caused zeroing of the valid page, when e.g. the previous page was not available in the pager. Reported by: lstewart Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-24 09:48:42 +00:00
Konstantin Belousov	1ef76554fb	Do not declare that preloaded md(4) supports unmapped bio requests, it does not. Reported by: <mh@kernel32.de> Sponsored by: The FreeBSD Foundation	2013-04-02 19:39:31 +00:00
Konstantin Belousov	59ec9023ca	Support unmapped i/o for the md(4). The vnode-backed md(4) has to map the unmapped bio because VOP_READ() and VOP_WRITE() interfaces do not allow to pass unmapped requests to the filesystem. Vnode-backed md(4) uses pbufs instead of relying on the bio_transient_map, to avoid usual md deadlock. Sponsored by: The FreeBSD Foundation Tested by: pho, scottl	2013-03-19 14:53:23 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Jaakko Heinonen	341b240dc7	Print correct unit number when attaching preloaded memory disks. Retire now unused mdunits variable.	2012-11-21 17:05:57 +00:00
Jaakko Heinonen	734e78dfcb	Disallow attaching preloaded memory disks via ioctl. - The feature is dangerous because the kernel code didn't check validity of the memory address provided from user space. - It seems that mdconfig(8) never really supported attaching preloaded memory disks. - Preloaded memory disks are automatically attached during md(4) initialization. Thus there shouldn't be much use for the feature. PR: kern/169683 Discussed on: freebsd-hackers	2012-11-21 16:56:47 +00:00
Konstantin Belousov	e9f581ba31	Zero the newly allocated md(4) swap-backed page to prevent random kernel memory leakage to userspace. For the typical use, when a filesystem put on the md disk, the change only results in CPU and memory bandwidth spent to zero the page, since filsystems make sure that user never see unwritten content. But if md disk is used as raw device by userspace, the garbage is exposed. Reported by: Paul Schenkeveld <freebsd@psconsult.nl> MFC after: 2 weeks	2012-11-08 03:17:41 +00:00
Marcel Moolenaar	22ff74b2f4	Add a MD_ROOT_FSTYPE kernel option. The option specifies the file system part for the MD_ROOT mount string. Hardcoding the the file system type as "ufs" is too restrictive.	2012-11-03 21:20:55 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Konstantin Belousov	1c771f9222	After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks	2012-08-05 14:11:42 +00:00
Konstantin Belousov	2ddfc13d8d	Remove verbose unused commented out debugging printf. MFC after: 1 week Reviewed by: alc	2012-08-04 18:10:04 +00:00
Jaakko Heinonen	8cb51643e4	Disallow sectorsize larger than MAXPHYS and mediasize smaller than sectorsize. PR: 169947 Submitted by: Filip Palian (original version) Reviewed by: kib	2012-08-02 15:05:34 +00:00
Edward Tomasz Napierala	dc604f0cf6	Make it possible to resize md(4) devices. Reviewed by: kib Sponsored by: FreeBSD Foundation	2012-07-07 20:32:21 +00:00
Eitan Adler	3eb9ab5255	Document a large number of currently undocumented sysctls. While here fix some style(9) issues and reduce redundancy. PR: kern/155491 PR: kern/155490 PR: kern/155489 Submitted by: Galimov Albert <wtfcrap@mail.ru> Approved by: bde Reviewed by: jhb MFC after: 1 week	2011-12-13 00:38:50 +00:00
Andrey V. Elsukov	1f1928092d	Add information about MD_READONLY and MD_COMPRESS flags to the configuration dump. MFC after: 1 week	2011-10-31 10:53:27 +00:00
Andrey V. Elsukov	657bd8b132	Include sys/sbuf.h directly.	2011-07-11 05:19:28 +00:00
Matthew D Fleming	cfb00e5aa7	Move the ZERO_REGION_SIZE to a machine-dependent file, as on many architectures (i386, for example) the virtual memory space may be constrained enough that 2MB is a large chunk. Use 64K for arches other than amd64 and ia64, with special handling for sparc64 due to differing hardware. Also commit the comment changes to kmem_init_zero_region() that I missed due to not saving the file. (Darn the unfamiliar development environment). Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you see fit. Requested by: alc MFC after: 1 week MFC with: r221853	2011-05-13 19:35:01 +00:00
Matthew D Fleming	89cb2a19ec	Usa a globally visible region of zeros for both /dev/zero and the md device. There are likely other kernel uses of "blob of zeros" than can be converted. Reviewed by: alc MFC after: 1 week	2011-05-13 18:48:00 +00:00
Dag-Erling Smørgrav	0abd21bdb8	Implement BIO_DELETE for vnode devices by simply overwriting the deleted sectors with all-zeroes. The zeroes come from a static buffer; null(4) uses a dynamic buffer for the same purpose (for /dev/zero). It might be a good idea to have a static, shared, read-only all-zeroes page somewhere in the kernel that md(4), null(4) and any other code that needs zeroes could use. Reviewed by: kib MFC after: 3 weeks	2011-04-29 21:18:41 +00:00
Marcel Moolenaar	8d5ac6c3cf	Use the preload_fetch_addr() and preload_fetch_size() convenience functions and only create the MD device when we have a non-zero pointer and size. Sponsored by: Juniper Networks	2011-02-09 19:31:10 +00:00
Konstantin Belousov	4a13a769dc	Add support for BIO_DELETE on swap-backed md(4). In the case of BIO_DELETE covering the whole page, free the page. Otherwise, clear the region and mark it clean. Not marking the page dirty could reinstantiate cleared data, but it is allowed by BIO_DELETE specification and saves unneeded write to swap. Reviewed by: alc Tested by: pho MFC after: 2 weeks	2011-01-27 16:10:25 +00:00
Konstantin Belousov	96410b9575	Bio shall not be accessed after g_io_deliver(9). Reported and tested by: pho Reviewed by: ae, phk MFC after: 1 week	2011-01-25 14:00:30 +00:00
Konstantin Belousov	007777f137	Add missed (). Noted by: alc MFC after: 3 days	2011-01-19 16:48:07 +00:00
Alan Cox	18a22f96e0	There is no point in calling vm_object_set_writeable_dirty() on an object that is definitively known to be swap backed since its only effects are on vnode-backed objects. Reviewed by: kib	2011-01-19 15:43:54 +00:00
Konstantin Belousov	d91e813c7b	Add reporting of GEOM::candelete BIO_GETATTR for md(4) and geom_disk(4). Non-zero value of attribute means that device supports BIO_DELETE. Suggested and reviewed by: pjd Tested by: pho MFC after: 1 week	2010-12-29 12:11:07 +00:00
Konstantin Belousov	c44d423ed8	Add sysctl vm.md_malloc_wait, non-zero value of which switches malloc-backed md(4) to using M_WAITOK malloc calls. M_NOWAITOK allocations may fail when enough memory could be freed, but not immediately. E.g. SU UFS becomes quite unhappy when metadata write return error, that would happen for failed malloc() call. Reported and tested by: pho MFC after: 1 week	2010-12-29 11:39:15 +00:00
Marcel Moolenaar	3d5c947d9d	Allow the MDIOCATTACH ioctl operation to originate from within the kernel. To protect against malicious software, we demand that the file name is at a particular location (i.e. appended to the mdio structure) for it to be treated as in-kernel.	2010-10-18 04:26:32 +00:00
Jaakko Heinonen	b42f40b8eb	- Remove some extra white space. - Wrap g_md_dumpconf() prototype to 80 columns.	2010-07-26 10:37:14 +00:00
Jaakko Heinonen	f4e7c5a894	Convert md(4) to use alloc_unr(9) and alloc_unr_specific(9) for unit number allocation. The old approach had some problems such as it allowed an overflow to occur in the unit number calculation. PR: kern/122288	2010-07-22 10:24:28 +00:00
Konstantin Belousov	d12fc952b7	Calculate nshift only once. Also noted by: avg MFC after: 1 week	2010-07-06 18:22:57 +00:00
Alan Cox	ecd5dd957d	Eliminate unnecessary page queues locking.	2010-06-15 18:37:31 +00:00
Konstantin Belousov	fc0c3802f0	Lock the page around vm_page_activate() and vm_page_deactivate() calls where it was missed. The wrapped fragments now protect wire_count with page lock. Reviewed by: alc	2010-05-03 20:31:13 +00:00
Edward Tomasz Napierala	5ed1eb2bb0	Fix panic on invalid 'mdconfig -at preload' usage. PR: kern/80136	2010-02-27 10:41:30 +00:00
Antoine Brodin	13e403fdea	(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument. Fix some wrong usages. Note: this does not affect generated binaries as this argument is not used. PR: 137213 Submitted by: Eygene Ryabinkin (initial version) MFC after: 1 month	2009-12-28 22:56:30 +00:00
Konstantin Belousov	3364c323e6	Implement global and per-uid accounting of the anonymous memory. Add rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)	2009-06-23 20:45:22 +00:00
Marcel Moolenaar	dbb95048da	Add cpu_flush_dcache() for use after non-DMA based I/O so that a possible future I-cache coherency operation can succeed. On ARM for example the L1 cache can be (is) virtually mapped, which means that any I/O that uses temporary mappings will not see the I-cache made coherent. On ia64 a similar behaviour has been observed. By flushing the D-cache, execution of binaries backed by md(4) and/or NFS work reliably. For Book-E (powerpc), execution over NFS exhibits SIGILL once in a while as well, though cpu_flush_dcache() hasn't been implemented yet. Doing an explicit D-cache flush as part of the non-DMA based I/O read operation eliminates the need to do it as part of the I-cache coherency operation itself and as such avoids pessimizing the DMA-based I/O read operations for which D-cache are already flushed/invalidated. It also allows future optimizations whereby the bcopy() followed by the D-cache flush can be integrated in a single operation, which could be implemented using on-chips DMA engines, by-passing the D-cache altogether.	2009-05-18 18:37:18 +00:00
John Baldwin	33fc362512	Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a filesystem supports additional operations using shared vnode locks. Currently this is used to enable shared locks for open() and close() of read-only file descriptors. - When an ISOPEN namei() request is performed with LOCKSHARED, use a shared vnode lock for the leaf vnode only if the mount point has the extended shared flag set. - Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but not O_CREAT. - Use a shared vnode lock around VOP_CLOSE() if the file was opened with O_RDONLY and the mountpoint has the extended shared flag set. - Adjust md(4) to upgrade the vnode lock on the vnode it gets back from vn_open() since it now may only have a shared vnode lock. - Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since FIFO's require exclusive vnode locks for their open() and close() routines. (My recent MPSAFE patches for UDF and cd9660 already included this change.) - Enable extended shared operations on UFS, cd9660, and UDF. Submitted by: ups Reviewed by: pjd (ZFS bits) MFC after: 1 month	2009-03-11 14:13:47 +00:00
Alan Cox	b72cca38a6	Remove unnecessary page queues locking around vm_page_wakeup(). (This change is applicable to RELENG_7 but not RELENG_6.) MFC after: 1 week	2009-02-22 02:50:31 +00:00
Edward Tomasz Napierala	a9ebb31183	Add the possibility to specify "-o force" with "mdconfig -du". Reviewed by: scottl Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation	2009-01-10 17:17:18 +00:00
Edward Tomasz Napierala	41c8b468e6	Fix forced mdconfig -du. E.g. the following would previously result in panic: mdconfig -af blah.img -o force mount /dev/md0 /mnt mdconfig -du 0 Reviewed by: scottl Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation	2008-12-16 20:59:27 +00:00
Attilio Rao	0359a12ead	Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>	2008-08-28 15:23:18 +00:00
Ed Schouten	06d425f92e	Remove the distinction between device minor and unit numbers. Even though we got rid of device major numbers some time ago, device drivers still need to provide unique device minor numbers to make_dev(). These numbers are only used inside the kernel. They are not related to device major and minor numbers which are visible in devfs. These are actually based on the inode number of the device. It would eventually be nice to remove minor numbers entirely, but we don't want to be too agressive here. Because the 8-15 bits of the device number field (si_drv0) are still reserved for the major number, there is no 1:1 mapping of the device minor and unit numbers. Because this is now unused, remove the restrictions on these numbers. The MAXMAJOR definition was actually used for two purposes. It was used to convert both the userspace and kernelspace device numbers to their major/minor pair, which is why it is now named UMINORMASK. minor2unit() and unit2minor() have now become useless. Both minor() and dev2unit() now serve the same purpose. We should eventually remove some of them, at least turning them into macro's. If devfs would become completely minor number unaware, we could consider using si_drv0 directly, just like si_drv1 and si_drv2. Approved by: philip (mentor)	2008-05-29 12:50:46 +00:00
Philip Paeps	3cf74e539b	Zero sc->vnode if mdsetcred() fails. This fixes the panic which happens when mdcreate_vnode() calls vn_close() and mddestroy() calls it again further down the error handling path. Reviewed by: kris, kib MFC after: 3 days	2008-02-28 18:31:54 +00:00
Attilio Rao	22db15c06f	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>	2008-01-13 14:44:15 +00:00
Attilio Rao	cb05b60a89	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>	2008-01-10 01:10:58 +00:00
Maxim Sobolev	a03be42da6	Put back devstat support that was lost during GEOM transition. Initially, I've tried to move md(4) to use geom_disk class, like real disks do, but this requires major rework of some of the existing features such as configuration dumping for example. Therefore just putting devstat support directly into md(4) seems to be optimal solution. Now you can see md(4) stats in `systat -vm' again. MFC after: 2 weeks	2007-11-07 22:47:41 +00:00
Julian Elischer	3745c395ec	Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.	2007-10-20 23:23:23 +00:00
Jeff Roberson	982d11f836	Commit 14/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)	2007-06-05 00:00:57 +00:00
Konstantin Belousov	9e223287c0	Revert UF_OPENING workaround for CURRENT. Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation argument from being file descriptor index into the pointer to struct file. Proposed and reviewed by: jhb Reviewed by: daichi (unionfs) Approved by: re (kensmith)	2007-05-31 11:51:53 +00:00
Konstantin Belousov	3b7b5496a7	Resolve two deadlocks that could be caused by busy md device backed by vnode. Allow for md thread and the thread that owns lock on vnode backing the md device to do the write even when runningbufspace is exhausted. Tested by: Peter Holm Reviewed by: tegge MFC after: 2 weeks	2006-12-14 11:34:07 +00:00
Pawel Jakub Dawidek	a777323904	Style nits.	2006-11-01 18:59:06 +00:00
Pawel Jakub Dawidek	5541f25ec7	Fix md(4) panic which occurs when I/O request different than BIO_READ/BIO_WRITE is sent to vnode-backed provider (BIO_DELETE or BIO_FLUSH). Reported by: ceri Add support for BIO_FLUSH to vnode-backed md(4) devices based on VOP_FSYNC().	2006-11-01 18:56:18 +00:00
John Baldwin	a08d2e7fe1	- Conditionally acquire Giant in mdstart_vnode(), mdcreate_vnode(), and mddestroy() only if the file is from a non-MPSAFE VFS. - No longer unconditionally hold Giant in the md kthread for vnode-backed kthreads. - Improve the handling of the thread exit race when destroying an md device.	2006-03-28 21:25:11 +00:00
Wojciech A. Koszek	c27a895433	Teach md(4) and mdconfig(8) how to understand XML. Right now there won't be a problem with listing large number of md(4) devices. Either 'list' or 'query' mode uses XML. Additionally, new functionality was introduced. It's possible to pass multiple devices to -u: # ./mdconfig -l -u md0,md1 Approved by: cognet (mentor)	2006-03-26 23:21:11 +00:00
Luigi Rizzo	de64f22aa4	make sure that the start and end preloaded MFS markers are in contiguous strings, and that the compiler does not optimize them away because it thinks they are unused.	2006-01-31 13:35:30 +00:00
Pawel Jakub Dawidek	b322d85d53	Call NDFREE() only when vn_open() succeeded. MFC after: 3 days	2006-01-27 11:27:55 +00:00
Maxim Konovalov	6c3cd0e2f6	o Fix typos in the comments. Submitted by: Wojciech A. Koszek	2005-12-28 15:18:18 +00:00
Robert Watson	5bb84bc84b	Normalize a significant number of kernel malloc type names: - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.	2005-10-31 15:41:29 +00:00
Poul-Henning Kamp	947fc8de03	Make sure that the worker thread knows the type early enough to grab Giant for vnode backing. Found by: pho & tegge	2005-10-06 19:47:04 +00:00
Poul-Henning Kamp	9b00ca1961	Fix configuration locking in MD. Remove md_mtx. Remove GIANT from the mdctl device driver and avoid DROP_GIANT, PICKUP_GIANT and geom events since we can call into GEOM directly now. Pick up Giant around vn_close(). Apply an exclusive sx around mdctls ioctl and preloading to protect lists etc.. Don't initialize our lock (md_mtx or md_sx) from a SYSINIT when there is a perfectly good pair of _fini/_init functions to do it from. Prune any final fractional sector from the mediasize to keep GEOM happy. Cleanups: Unify MDIOVERSION check in (x)mdctlioctl() Add pointer to start() routine to softc to eliminate a switch{} Inline guts of mddetach(). Always pass error pointer to mdnew(), simplify implementation.	2005-09-19 06:55:27 +00:00
Poul-Henning Kamp	9fbea3e365	Do not destroy the queue mutex until the thread is done with it.	2005-09-11 12:35:32 +00:00
Pawel Jakub Dawidek	7ee3c044d0	- Add md_mtx lock to protect ID number and list of devices. - Always check mdnew() return value, as even in !autounit case kthread_create() can fail. Those two changes fix serval panics provked by simple stress test. Tested by: Kris The BugMagnet MFC after: 3 days	2005-08-31 19:45:11 +00:00
Christian S.J. Peron	8677689134	Ensure that file flags such as schg, sappnd (and others) are honored by md(4). Before this change, it was possible to by-pass these flags by creating memory disks which used a file as a backing store and writing to the device. This was discussed by the security team, and although this is problematic, it was decided that it was not critical as we never guarantee that root will be restricted. This change implements the following behavior changes: -If the user specifies the readonly flag, unset write operations before opening the file. If the FWRITE mask is unset, the device will be created with the MD_READONLY mask set. (readonly) -Add a check in g_md_access which checks to see if the MD_READONLY mask is set, if so return EROFS -Do not gracefully downgrade access modes without telling the user. Instead make the user specify their intentions for the device (assuming the file is read only). This seems like the more correct way to handle things. This is a RELENG_6 candidate. PR: kern/84635 Reviewed by: phk	2005-08-17 01:24:55 +00:00
Alan Cox	e340fc602b	Request a CPU private mapping from sf_buf_alloc(). If the swap-backed memory disk is larger than the number of available sf_bufs, this improves performance on SMPs by eliminating interprocessor TLB shootdowns. For example, with 6656 sf_bufs, the default on my test machine, and a 256MB swap-backed memory disk, I see the command "dd if=/dev/md0 of=/dev/null bs=64k" achieve ~489MB/sec with the default, shared mappings, and ~587MB/sec with CPU private mappings.	2005-02-13 21:51:50 +00:00
Poul-Henning Kamp	d9aaa28f63	Use MAXMINOR	2005-01-29 16:50:04 +00:00
Pawel Jakub Dawidek	1db17c6db2	- Don't destroy UMA zone on error in mdcreate_malloc(), because we need it in mddestroy() to properly free already allocated memory. This fixes a panic when we want to create too big memory backed device with preallocate memory (-o reserve). - Remove redundant { }. MFC after: 1 week	2005-01-22 19:56:03 +00:00
Poul-Henning Kamp	9d3a77c463	Add a couple of mtx_asserts() to try to narrow down the window on a bug repeatedly reported.	2005-01-22 19:08:50 +00:00
Warner Losh	098ca2bda9	Start each of the license/copyright comments with /*-, minor shuffle of lines	2005-01-06 01:43:34 +00:00
Alan Cox	c935314fae	Add needed synchronization to the error handling code that was introduced in revision 1.141. Lock assertion failures reported by: Kris Kennaway	2005-01-05 05:32:52 +00:00
John Baldwin	63710c4d35	Stop explicitly touching td_base_pri outside of the scheduler and simply set a thread's priority via sched_prio() when that is the desired action. The schedulers will start managing td_base_pri internally shortly.	2004-12-30 20:29:58 +00:00
Pawel Jakub Dawidek	88b5b78d59	Rewrite piece of code which I committed some time ago that allows to show file name for 'mdconfig -l -u <x>' command. This allows to preserve API/ABI compatibility with version 0 (that's why I changed version number back to 0) and will allow to merge this change to RELENG_5. MFC after: 5 days	2004-12-27 17:20:06 +00:00
Marcel Moolenaar	8b6fc67a49	Fix the MDIOCDETACH ioctl() for md(4). Now that the md_file field in the mdio structure is an array and not a pointer, we cannot test for it to be NULL. It never is. Instead, test for md_file[0] to be '\0'.	2004-11-13 05:00:12 +00:00
Pawel Jakub Dawidek	e3ed29a739	Be consistent and use 'if (error != 0)' instead of 'if (error)' everywhere.	2004-11-06 13:16:35 +00:00
Pawel Jakub Dawidek	61a6eb62ec	For file backed md(4) devices output their source file via 'mdconfig -l -u <unit>'. Bump version number, as this change breaks ABI/API.	2004-11-06 13:07:02 +00:00
Poul-Henning Kamp	3b66ad07db	Don't explicitly call g_waitidle(), it happens automagically now.	2004-10-23 20:50:06 +00:00
Brian Feldman	812851b6c9	Account for failure in vm_pager_allocate() or vm_pager_get_pages() in md(8). The former is generally not going to fail, but the latter can fail when the underlying swap device returns an error. There are still plenty of other places where vm_pager_get_pages() failing will lead directly to crashes, so it's a good idea to put your swap on RAID if you care enough to put any of your disks on RAID....	2004-10-12 04:47:16 +00:00
Pawel Jakub Dawidek	e4cdd0d4b5	Actually this order (unlock, wakeup) in this case is race-safe and can save us 2 context switches. Explained by: njl	2004-09-18 09:16:19 +00:00
Pawel Jakub Dawidek	b830359bc5	- Make md(4) 64-bit clean. After this change it should be possible to use very big md(4) devices. - Clean up and simplify the code a bit. - Use humanize_number(3) to print size of md(4) devices. - Add 't' suffix which stands for terabyte. - Make '-S' to really work with all types of devices. - Other minor changes.	2004-09-16 21:32:13 +00:00
Pawel Jakub Dawidek	fcd57fbe6f	There is no need to keep 'npage' value inside our softc structure, it is only used in one function. While doing so, change its type to vm_ooffset_t. We are still limited for swap-backed devices to 16TB on 32-bit architectures where PAGE_SIZE is 4096 bytes.	2004-09-16 20:38:11 +00:00
Pawel Jakub Dawidek	a8a58d03f6	- Do not use bio_pblkno as it is going away anyway. - Prefer bio_length than bio_bcount.	2004-09-16 19:42:17 +00:00
Pawel Jakub Dawidek	4b07ede4a7	First wakeup, then unlock.	2004-09-16 18:59:19 +00:00

1 2 3 4 5 ...

285 commits