opnsense-src

mirror of https://github.com/opnsense/src.git synced 2026-04-28 09:37:08 -04:00

Author	SHA1	Message	Date
Gleb Smirnoff	6351b3857b	Plug route reference underleak that happens with FLOWTABLE after r297225. Submitted by: Mike Karels <mike karels.net>	2016-05-27 17:31:02 +00:00
Don Lewis	91336b403a	Import Dummynet AQM version 0.2.1 (CoDel, FQ-CoDel, PIE and FQ-PIE). Centre for Advanced Internet Architectures Implementing AQM in FreeBSD * Overview <http://caia.swin.edu.au/freebsd/aqm/index.html> * Articles, Papers and Presentations <http://caia.swin.edu.au/freebsd/aqm/papers.html> * Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html> Overview Recent years have seen a resurgence of interest in better managing the depth of bottleneck queues in routers, switches and other places that get congested. Solutions include transport protocol enhancements at the end-hosts (such as delay-based or hybrid congestion control schemes) and active queue management (AQM) schemes applied within bottleneck queues. The notion of AQM has been around since at least the late 1990s (e.g. RFC 2309). In recent years the proliferation of oversized buffers in all sorts of network devices (aka bufferbloat) has stimulated keen community interest in four new AQM schemes -- CoDel, FQ-CoDel, PIE and FQ-PIE. The IETF AQM working group is looking to document these schemes, and independent implementations are a corner-stone of the IETF's process for confirming the clarity of publicly available protocol descriptions. While significant development work on all three schemes has occured in the Linux kernel, there is very little in FreeBSD. Project Goals This project began in late 2015, and aims to design and implement functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE in FreeBSD (with code BSD-licensed as much as practical). We have chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall and traffic shaper. Implementation of these AQM schemes in FreeBSD will: * Demonstrate whether the publicly available documentation is sufficient to enable independent, functionally equivalent implementations * Provide a broader suite of AQM options for sections the networking community that rely on FreeBSD platforms Program Members: * Rasool Al Saadi (developer) * Grenville Armitage (project lead) Acknowledgements: This project has been made possible in part by a gift from the Comcast Innovation Fund. Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au> X-No objection: core MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D6388	2016-05-26 21:40:13 +00:00
John Baldwin	052a5418e8	Don't reuse the source mbuf in tcp_respond() if it is not writable. Not all mbufs passed up from device drivers are M_WRITABLE(). In particular, the Chelsio T4/T5 driver uses a feature called "buffer packing" to receive multiple frames in a single receive buffer. The mbufs for these frames all share the same external storage so are treated as read-only by the rest of the stack when multiple frames are in flight. Previously tcp_respond() would blindly overwrite read-only mbufs when INVARIANTS was disabled or panic with an assertion failure if INVARIANTS was enabled. Note that the new case is a bit of a mix of the two other cases in tcp_respond(). The TCP and IP headers must be copied explicitly into the new mbuf instead of being inherited (similar to the m == NULL case), but the addresses and ports must be swapped in the reply (similar to the m != NULL case). Reviewed by: glebius	2016-05-26 18:35:37 +00:00
Michael Tuexen	4f3b84b524	Make struct sctp_paddrthlds compliant to RFC 7829.	2016-05-26 11:38:26 +00:00
Hans Petter Selasky	fc271df341	Use optimised complexity safe sorting routine instead of the kernel's "qsort()". The kernel's "qsort()" routine can in worst case spend O(N*N) amount of comparisons before the input array is sorted. It can also recurse a significant amount of times using up the kernel's interrupt thread stack. The custom sorting routine takes advantage of that the sorting key is only 64 bits. Based on set and cleared bits in the sorting key it partitions the array until it is sorted. This process has a recursion limit of 64 times, due to the number of set and cleared bits which can occur. Compiled with -O2 the sorting routine was measured to use 64-bytes of stack. Multiplying this by 64 gives a maximum stack consumption of 4096 bytes for AMD64. The same applies to the execution time, that the array to be sorted will not be traversed more than 64 times. When serving roughly 80Gb/s with 80K TCP connections, the old method consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4% CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting as measured by Intel Vtune. The testing was done using a sysctl to toggle between "qsort()" and "tcp_lro_sort()". Differential Revision: https://reviews.freebsd.org/D6472 Sponsored by: Mellanox Technologies Tested by: Netflix Reviewed by: gallatin, rrs, sephe, transport	2016-05-26 11:10:31 +00:00
Michael Tuexen	f88d0cfe7a	When sending in ICMP response to an SCTP packet, * include the SCTP common header, if possible * include the first 8 bytes of the INIT chunk, if possible This provides the necesary information for the receiver of the ICMP packet to process it. MFC after: 1 week	2016-05-25 22:16:11 +00:00
Michael Tuexen	6d7270a580	Send an ICMP packet indicating destination unreachable/protocol unreachable if we don't handle the packet in the kernel and not in userspace. MFC after: 1 week	2016-05-25 15:54:21 +00:00
Michael Tuexen	ad2cbb09ef	Count packets as not being delivered only if they are neither processed by a kernel handler nor by a raw socket. MFC after: 1 week	2016-05-25 13:48:26 +00:00
Don Lewis	883054b4c3	Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on control to a three way setting. 0 - Totally disable ECN. (no change) 1 - Enable ECN if incoming connections request it. Outgoing connections will request ECN. (no change from present != 0 setting) 2 - Enable ECN if incoming connections request it. Outgoing conections will not request ECN. Change the default value of net.inet.tcp.ecn.enable from 0 to 2. Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have similar capabilities. The actual values above match Linux, and the default matches the current Linux default. Reviewed by: eadler MFC after: 1 month MFH: yes Sponsored by: https://reviews.freebsd.org/D6386	2016-05-19 22:20:35 +00:00
Gleb Smirnoff	f59d975e10	Tiny refactor of r294869/r296881: use defines to mask the VNET() macro. Suggested by: bz	2016-05-17 23:14:17 +00:00
Randall Stewart	5105a92c49	This small change adopts the excellent suggestion for using named structures in the add of a new tcp-stack that came in late to me via email after the last commit. It also makes it so that a new stack may optionally get a callback during a retransmit timeout. This allows the new stack to clear specific state (think sack scoreboards or other such structures). Sponsored by: Netflix Inc. Differential Revision: http://reviews.freebsd.org/D6303	2016-05-17 09:53:22 +00:00
Andrey V. Elsukov	2685841b38	Make named objects set-aware. Now it is possible to create named objects with the same name in different sets. Add optional manage_sets() callback to objects rewriting framework. It is intended to implement handler for moving and swapping named object's sets. Add ipfw_obj_manage_sets() function that implements generic sets handler. Use new callback to implement sets support for lookup tables. External actions objects are global and they don't support sets. Modify eaction_findbyname() to reflect this. ipfw(8) now may fail to move rules or sets, because some named objects in target set may have conflicting names. Note that ipfw_obj_ntlv type was changed, but since lookup tables actually didn't support sets, this change is harmless. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2016-05-17 07:47:23 +00:00
Mark Johnston	565e7fd3bc	opt_kdtrace.h is not needed for SDT probes as of r258541.	2016-05-15 20:04:43 +00:00
Mark Johnston	e8800f3c2f	Fix a few style issues in the ICMP sysctl descriptions. MFC after: 1 week	2016-05-15 03:19:53 +00:00
Michael Tuexen	574679afe9	Fix a locking bug which only shows up on Mac OS X. MFC after: 1 week	2016-05-14 13:44:49 +00:00
Michael Tuexen	5f05199c19	Fix a bug introduced by the implementation of I-DATA support. There was the requirement that two structures are in sync, which is not valid anymore. Therefore don't rely on this in the code anymore. Thanks to Radek Malcic for reporting the issue. He found this when using the userland stack. MFC after: 1 week	2016-05-13 09:11:41 +00:00
Michael Tuexen	fd60718d17	Retire net.inet.sctp.strict_sacks and net.inet.sctp.strict_data_order sysctl's, since they where only there to interop with non-conformant implementations. This should not be a problem anymore.	2016-05-12 16:34:59 +00:00
Michael Tuexen	d88a626a1d	Enable SACK Immediately per default. This has been tested for a long time and implements covered by RFC 7053. MFC after: 1 week	2016-05-12 15:48:08 +00:00
Michael Tuexen	e7f232a0db	Use a format string in snprintf() for consistency. This was reported by Radek Malcic when using the userland stack in combination with MinGW. MFC after: 1 week	2016-05-12 14:41:53 +00:00
Sepherosa Ziehau	e6ec45f869	tcp/syncache: Add comment for syncache_respond Suggested by: hiren, hps Reviewed by: sbruno Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6148	2016-05-10 04:59:04 +00:00
Hiren Panchasara	7c375daa61	Add an option to use rfc6675 based pipe/inflight bytes calculation in htcp. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> MFC after: 1 week Sponsored by: Limelight Networks	2016-05-09 19:19:03 +00:00
Michael Tuexen	a807fe2d83	Cleanup a comment. MFC after: 1 week	2016-05-09 16:35:05 +00:00
Pedro F. Giffuni	a4641f4eaa	sys/net*: minor spelling fixes. No functional change.	2016-05-03 18:05:43 +00:00
Sepherosa Ziehau	51e3c20d36	tcp/lro: Refactor the active list operation. Ease more work concerning active list, e.g. hash table etc. Reviewed by: gallatin, rrs (earlier version) Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6137	2016-05-03 08:13:25 +00:00
Michael Tuexen	afd6748258	Undo a spell fix introduced in r298942, which breaks compilation.	2016-05-02 21:23:05 +00:00
Pedro F. Giffuni	cd0a4ff6a5	netinet/sctp*: minor spelling fixes in comments. No functional change. Reviewed by: tuexen	2016-05-02 20:56:11 +00:00
Michael Tuexen	ec70917ffa	When a client uses UDP encapsulation and lists IP addresses in the INIT chunk, enable UDP encapsulation for all those addresses. This helps clients using a userland stack to support multihoming if they are not behind a NAT. MFC after: 1 week	2016-05-01 21:48:55 +00:00
Michael Tuexen	7154bf4a41	Add the UDP encaps port as a parameter to sctp_add_remote_addr(). This is currently only a code change without any functional change. But this allows to set the remote encapsulation port in a more detailed way, which will be provided in a follow-up commit. MFC after: 1 week	2016-04-30 14:25:00 +00:00
Michael Tuexen	3c3f9e2a46	Don't assign, just compare...	2016-04-29 20:33:20 +00:00
Michael Tuexen	fd7af143e2	Add support for handling ICMP and ICMP6 messages sent in response to SCTP/UDP/IP and SCTP/UDP/IPv6 packets.	2016-04-29 20:22:01 +00:00
Sepherosa Ziehau	9340a8d5b9	tcp/syncache: Set flowid and hash type properly for SYN\|ACK So the underlying drivers can use it to select the sending queue properly for SYN\|ACK instead of rolling their own hash. Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6120	2016-04-29 07:23:08 +00:00
Randall Stewart	abb901c5d7	Complete the UDP tunneling of ICMP msgs to those protocols interested in having tunneled UDP and finding out about the ICMP (tested by Michael Tuexen with SCTP.. soon to be using this feature). Differential Revision: http://reviews.freebsd.org/D5875	2016-04-28 15:53:10 +00:00
Randall Stewart	e5ad64562a	This cleans up the timers code in TCP to start using the new async_drain functionality. This as been tested in NF as well as by Verisign. Still to do in here is to remove all the old flags. They are currently left being maintained but probably are no longer needed. Sponsored by: Netflix Inc. Differential Revision: http://reviews.freebsd.org/D5924	2016-04-28 13:27:12 +00:00
Pedro F. Giffuni	a5b50fbc20	ipdivert: Remove unnecessary and incorrectly typed variable. In principle n is only used to carry a copy of ipi_count, which is unsigned, in the non-VIMAGE case, however ipi_count can be used directly so it is not needed at all. Removing it makes things look cleaner.	2016-04-28 02:46:08 +00:00
Sepherosa Ziehau	9b436b180c	tcp/lro: Fix more typo Noticed by: hiren MFC after: 1 week Sponsored by: Microsoft OSTC	2016-04-28 01:43:18 +00:00
Michael Tuexen	c09a15342a	Don't use the control argument after calling sctp_add_to_readq(). This breaks the userland stack. There should be no functional change for the FreeBSD kernel stack. While there, use consistent variable nameing.	2016-04-27 18:58:47 +00:00
Sepherosa Ziehau	9e3db01282	tcp/lro: Fix typo. MFC after: 1 week Sponsored by: Microsoft OSTC	2016-04-27 09:40:55 +00:00
Conrad Meyer	2769d06203	in_lltable_alloc and in6 copy: Don't leak LLE in error path Fix a memory leak in error conditions introduced in r292978. Reported by: Coverity CIDs: 1347009, 1347010 Sponsored by: EMC / Isilon Storage Division	2016-04-26 23:13:48 +00:00
Conrad Meyer	bac5bedf44	tcp_usrreq: Free allocated buffer in relock case The disgusting macro INP_WLOCK_RECHECK may early-return. In tcp_default_ctloutput() the TCP_CCALGOOPT case allocates memory before invoking this macro, which may leak memory. Add a _CLEANUP variant that takes a code argument to perform variable cleanup in the early return path. Use it to free the 'pbuf' allocated in tcp_default_ctloutput(). I am not especially happy with this macro, but I reckon it's not any worse than INP_WLOCK_RECHECK already was. Reported by: Coverity CID: 1350286 Sponsored by: EMC / Isilon Storage Division	2016-04-26 23:02:18 +00:00
Michael Tuexen	7e372b1a40	Remove a function, which is not used anymore.	2016-04-23 09:15:58 +00:00
Jonathan T. Looney	b8c2cd15e9	Prevent underflows in tp->snd_wnd if the remote side ACKs more than tp->snd_wnd. This can happen, for example, when the remote side responds to a window probe by ACKing the one byte it contains. Differential Revision: https://reviews.freebsd.org/D5625 Reviewed by: hiren Obtained from: Juniper Networks (earlier version) MFC after: 2 weeks Sponsored by: Juniper Networks	2016-04-21 15:06:53 +00:00
Pedro F. Giffuni	63b6b7a74a	Indentation issues. Contract some lines leftover from r298310. Mea culpa.	2016-04-20 16:19:44 +00:00
Pedro F. Giffuni	02abd40029	kernel: use our nitems() macro when it is available through param.h. No functional change, only trivial cases are done in this sweep, Discussed in: freebsd-current	2016-04-19 23:48:27 +00:00
Michael Tuexen	b1deed45e6	Address issues found by the XCode code analyzer.	2016-04-18 20:16:41 +00:00
Michael Tuexen	f8ee69bf81	Fix signed/unsigned warnings.	2016-04-18 11:39:41 +00:00
Michael Tuexen	a39ddef038	Fix a warning about an unused variable.	2016-04-18 09:39:46 +00:00
Michael Tuexen	98d5fd976b	Put panic() calls under INVARIANTS.	2016-04-18 09:29:14 +00:00
Michael Tuexen	f2ea2a2d5f	Cleanup debug output.	2016-04-18 06:58:07 +00:00
Michael Tuexen	e187bac213	Don't use anonymous unions.	2016-04-18 06:38:53 +00:00
Michael Tuexen	24a9e1b53b	Remove a left-over debug printf().	2016-04-18 06:32:24 +00:00
Michael Tuexen	b9dd6a90b6	Fix the ICMP6 handling for SCTP. Keep the IPv4 code in sync. MFC after: 1 week	2016-04-16 21:34:49 +00:00
Pedro F. Giffuni	99d628d577	netinet: for pointers replace 0 with NULL. These are mostly cosmetical, no functional change. Found with devel/coccinelle. Reviewed by: ae. tuexen	2016-04-15 15:46:41 +00:00
Andrey V. Elsukov	2acdf79f53	Add External Actions KPI to ipfw(9). It allows implementing loadable kernel modules with new actions and without needing to modify kernel headers and ipfw(8). The module registers its action handler and keyword string, that will be used as action name. Using generic syntax user can add rules with this action. Also ipfw(8) can be easily modified to extend basic syntax for external actions, that become a part base system. Sample modules will coming soon. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2016-04-14 22:51:23 +00:00
Michael Tuexen	4d6b853ad6	Allow the handling of ICMP messages sent in response to SCTP packets containing an INIT chunk. These need to be handled in case the peer does not support SCTP and returns an ICMP messages indicating destination unreachable, protocol unreachable. MFC after: 1 week	2016-04-14 19:59:21 +00:00
Michael Tuexen	f77b842746	When delivering an ICMP packet to the ctlinput function, ensure that the outer IP header, the ICMP header, the inner IP header and the first n bytes are stored in contgous memory. The ctlinput functions currently rely on this for n = 8. This fixes a bug in case the inner IP header had options. While there, remove the options from the outer header and provide a way to increase n to allow improved ICMP handling for SCTP. This will be added in another commit. MFC after: 1 week	2016-04-14 19:51:29 +00:00
Luiz Otavio O Souza	de89d74b70	Do not overwrite the dchg variable. It does not cause any real issues because the variable is overwritten only when the packet is forwarded (and the variable is not used anymore). Obtained from: pfSense MFC after: 2 weeks Sponsored by: Rubicon Communications (Netgate)	2016-04-14 18:57:30 +00:00
Michael Tuexen	08b9595770	Refactor the handling of ICMP/IPv4 packets for SCTP/IPv4. This cleansup the code and prepares upcoming handling of ICMP/IPv4 packets for SCTP/UDP/IPv4 packets. IPv6 changes will follow... MFC after: 3 days	2016-04-12 21:40:54 +00:00
Michael Tuexen	cf4476eb39	When processing an ICMP packet containing an SCTP packet, it is required to check the verification tag. However, this requires the verification tag to be not 0. Enforce this. For packets with a verification tag of 0, we need to check it it contains an INIT chunk and use the initiate tag for the validation. This will be a separate commit, since it touches also other code. MFC after: 1 week	2016-04-12 11:48:54 +00:00
Bjoern A. Zeeb	806929d514	Mfp: r296310,r296343 It looks like as with the safety belt of DELAY() fastened () we can completely tear down and free all memory for TCP (after r281599). () in theory a few ticks should be good enough to make sure the timers are all really gone. Could we use a better matric here and check a tcbcb count as an optimization? PR: 164763 Reviewed by: gnn, emaste MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5734	2016-04-09 12:05:23 +00:00
Bjoern A. Zeeb	8586a9635f	Mfp: r296260 The tcp_inpcb (pcbinfo) zone should be safe to destroy. PR: 164763 Reviewed by: gnn MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5732	2016-04-09 11:27:47 +00:00
Bjoern A. Zeeb	f254aeda60	Mfp: r296259 We attach the "counter" to the tcpcbs. Thus don't free the TCP Fastopen zone before the tcpcbs are gone, as otherwise the zone won't be empty. With that it should be safe to destroy the "tfo" zone without leaking the memory. PR: 164763 Reviewed by: gnn MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5731	2016-04-09 10:58:08 +00:00
Bjoern A. Zeeb	dc95d65555	Mfp: r296309 While there is no dependency interaction, stopping the timer before freeing the rest of the resources seems more natural and avoids it being scheduled an extra time when it is no longer needed. Reviewed by: gnn, emaste MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5733	2016-04-09 10:51:07 +00:00
Bjoern A. Zeeb	e18b26d377	Mfp: r296345 No need to keep type stability on raw sockets zone. We've also been running with a KASSERT since r222488 to make sure the ipi_count is 0 on destroy. PR: 164763 Reviewed by: gnn MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5735	2016-04-09 10:44:57 +00:00
Bjoern A. Zeeb	4c86b2bc13	Mfp: r296346 No reason identified to keep UMA_ZONE_NOFREE here. Reviewed by: gnn MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5736	2016-04-09 10:39:54 +00:00
Randall Stewart	9d18771f69	A couple of minor changes that I missed that Michael had done, most noted in these is the change to non-strict ordering for incoming data (this will make pkt-drill test 14 fail but its expected).	2016-04-07 09:34:41 +00:00
Randall Stewart	44249214d3	This is work done by Michael Tuexen and myself at the IETF. This adds the new I-Data (Interleaved Data) message. This allows a user to be able to have complete freedom from Head Of Line blocking that was previously there due to the in-ability to send multiple large messages without the TSN's being in sequence. The code as been tested with Michaels various packet drill scripts as well as inter-networking between the IETF's location in Argentina and Germany.	2016-04-07 09:10:34 +00:00
Michael Tuexen	e2823e8570	Set the chunk id for ERROR chunks. This is work with rrs@. MFC after: 1 week	2016-04-01 20:38:15 +00:00
Sepherosa Ziehau	1ea448225c	tcp/lro: Change SLIST to LIST, so that removing an entry is O(1) This is kinda critical to the performance when the CPU is slow and network bandwidth is high, e.g. in the hypervisor. Reviewed by: rrs, gallatin, Dexuan Cui <decui microsoft com> Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5765	2016-04-01 06:43:05 +00:00
Sepherosa Ziehau	6dd38b8716	tcp/lro: Use tcp_lro_flush_all in device drivers to avoid code duplication And factor out tcp_lro_rx_done, which deduplicates the same logic with netinet/tcp_lro.c Reviewed by: gallatin (1st version), hps, zbb, np, Dexuan Cui <decui microsoft com> Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5725	2016-04-01 06:28:33 +00:00
George V. Neville-Neil	ce223fb715	Unbreak the RSS/PCBGROUp build.	2016-03-31 00:53:23 +00:00
Edward Tomasz Napierala	35030a5dd4	Remove some NULL checks for M_WAITOK allocations. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-29 13:56:59 +00:00
Michael Tuexen	a08b29253d	Don't allow the user to set a peer primary which is restricted and not pending. MFC after: 1 week	2016-03-28 19:32:13 +00:00
Michael Tuexen	76f8482a93	Restrict local addresses until they are acked by the peer. MFC after: 1 week	2016-03-28 19:31:10 +00:00
Michael Tuexen	5114dccbd4	Trigger sending of queued ASCONF chunks if outstanding ones are ACKED. MFC after: 1 week	2016-03-28 11:32:20 +00:00
Michael Tuexen	9a8e308861	Improve compilation on windows 64-bit (for the userland stack). MFC after: 1 week	2016-03-27 10:04:25 +00:00
Sepherosa Ziehau	489f0c3c17	tcp/lro: Return TCP_LRO_NO_ENTRIES if we are short of LRO entries. So that callers could react accordingly. Reviewed by: gallatin (no objection) MFC after: 1 week Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5695	2016-03-25 02:54:13 +00:00
Bjoern A. Zeeb	4f321dbd1c	Fix compile errors after r297225: - properly V_irtualise variable access unbreaking VIMAGE kernels. - remove the volatile from the function return type to make architecture using gcc happy [-Wreturn-type] "type qualifiers ignored on function return type" I am not entirely happy with this solution putting the u_int there but it will do for now.	2016-03-24 11:40:10 +00:00
George V. Neville-Neil	84cc0778d0	FreeBSD previously provided route caching for TCP (and UDP). Re-add route caching for TCP, with some improvements. In particular, invalidate the route cache if a new route is added, which might be a better match. The cache is automatically invalidated if the old route is deleted. Submitted by: Mike Karels Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4306	2016-03-24 07:54:56 +00:00
Michael Tuexen	ed65436366	Add const to several constants. Thanks to Nicholas Nethercote for providing the patch via https://bugzilla.mozilla.org/show_bug.cgi?id=1255655 MFC after: 1 week	2016-03-23 13:28:04 +00:00
Jonathan T. Looney	5d20f97461	to_flags is currently a 64-bit integer; however, we only use 7 bits. Furthermore, there is no reason this needs to be a 64-bit integer for the forseeable future. Also, there is an inconsistency between to_flags and the mask in tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t but left the mask in tcp_addoptions() as a u_int, meaning that these variables will only be the same width on platforms with 64-bit integers. Convert both to_flags and the mask in tcp_addoptions() to be explicitly 32-bit variables. This may save a few cycles on 32-bit platforms, and avoids unnecessarily mixing types. Differential Revision: https://reviews.freebsd.org/D5584 Reviewed by: hiren MFC after: 2 weeks Sponsored by: Juniper Networks	2016-03-22 15:55:17 +00:00
Hans Petter Selasky	d4d32b9fec	Fix kernel build after adding new sysctl asserts in r296933.	2016-03-16 10:42:24 +00:00
Gleb Smirnoff	bf840a1707	Redo r294869. The array of counters for TCP states doesn't belong to struct tcpstat, because the structure can be zeroed out by netstat(1) -z, and of course running connection counts shouldn't be touched. Place running connection counts into separate array, and provide separate read-only sysctl oid for it.	2016-03-15 00:15:10 +00:00
Gleb Smirnoff	2f06d2ab91	Comment fix: statistics are not read-only.	2016-03-14 18:06:59 +00:00
Bjoern A. Zeeb	19edab1711	Remove duplicate external declaration of tcprexmtthresh making gcc compiles barf.	2016-03-13 21:26:18 +00:00
John Baldwin	47cedcbd72	Use SI_SUB_LAST instead of SI_SUB_SMP as the "catch-all" subsystem. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D5515	2016-03-11 23:18:06 +00:00
Michael Tuexen	1fabc43e9f	Actually send a asconf chunk, not only queue one. MFC after: 3 days	2016-03-10 00:27:10 +00:00
Randall Stewart	ec64c84ddc	Fix a sneaky bug where we were missing an extern to get the rxt threshold.. and thus created our own defaulted to 0 :-( Sponsored by: Netflix Inc	2016-03-08 00:16:34 +00:00
Jonathan T. Looney	737d4f6c93	As reported on the transport@ and current@ mailing lists, the FreeBSD TCP stack is not compliant with RFC 7323, which requires that TCP stacks send a timestamp option on all packets (except, optionally, RSTs) after the session is established. This patch adds that support. It also adds a TCP signature option to the packet, if appropriate. PR: 206047 Differential Revision: https://reviews.freebsd.org/D4808 Reviewed by: hiren MFC after: 2 weeks Sponsored by: Juniper Networks	2016-03-07 15:00:34 +00:00
Jonathan T. Looney	9cbade8feb	Some cleanup in tcp_respond() in preparation for another change: - Reorder variables by size - Move initializer closer to where it is used - Remove unneeded variable Differential Revision: https://reviews.freebsd.org/D4808 Reviewed by: hiren MFC after: 2 weeks Sponsored by: Juniper Networks	2016-03-07 14:59:49 +00:00
George V. Neville-Neil	e79cb051d5	Fix dtrace probes (introduced in 287759): debug__input was used for output and drop; connect didn't always fire a user probe some probes were missing in fastpath Submitted by: Hannes Mehnert Sponsored by: REMS, EPSRC Differential Revision: https://reviews.freebsd.org/D5525	2016-03-03 17:46:38 +00:00
Bryan Drewery	6971a63795	Fix build after r29592.	2016-02-23 21:21:47 +00:00
Randall Stewart	6e0efc6a39	This fixes the fastpath code to have a better module initialization sequence when included in loader.conf. It also fixes it so that no matter if some one incorrectly specifies a load order, the lists and such will be initialized on demand at that time so no one can make that mistake. Reviewed by: hiren Differential Revision: D5189	2016-02-23 17:53:39 +00:00
Michael Tuexen	64a3a6304e	Use the SCTP level pointer, not the interface level. MFC after: 3 days	2016-02-19 11:25:18 +00:00
Michael Tuexen	861f6d1196	Add protection code. MFC after: 3 days CID: 748858	2016-02-18 21:33:10 +00:00
Michael Tuexen	fdc4c9d067	Add some protection code. CID: 1331893 MFC after: 3 days	2016-02-18 21:21:45 +00:00
Sepherosa Ziehau	7ae3d4bf54	tcp/lro: Allow drivers to set the TCP ACK/data segment aggregation limit ACK aggregation limit is append count based, while the TCP data segment aggregation limit is length based. Unless the network driver sets these two limits, it's an NO-OP. Reviewed by: adrian, gallatin (previous version), hselasky (previous version) Approved by: adrian (mentor) MFC after: 1 week Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5185	2016-02-18 04:58:34 +00:00
Michael Tuexen	828318e155	Add protection code for issues reported by PVS / D5245. MFC after: 3 days	2016-02-17 18:12:38 +00:00
Michael Tuexen	815f806b82	Code cleanup which will silence a warning in PVS / D5245.	2016-02-17 18:04:22 +00:00
Michael Tuexen	7b0fd8f2af	Address a warning reported by D5245 / PVS. MFC after: 3 days	2016-02-17 17:52:46 +00:00
Michael Tuexen	467f0d55b4	Whitespace changes.	2016-02-16 20:33:18 +00:00
Michael Tuexen	2b1c7de4d8	Improve the teardown of the SCTP stack. Obtained from: bz@ MFC after: 1 week	2016-02-16 19:36:25 +00:00
Michael Tuexen	e51963a7bb	Loopback addresses are 127.0.0.0/8, not 127.0.0.1/32. MFC after: 1 week	2016-02-11 22:29:39 +00:00
Michael Tuexen	b028cf319e	Use 4 spaces instead of a tab.	2016-02-11 18:35:46 +00:00
Devin Teske	41c0ec9a16	Merge SVN r295220 (bz) from projects/vnet/ Fix a panic that occurs when a vnet interface is unavailable at the time the vnet jail referencing said interface is stopped. Sponsored by: FIS Global, Inc.	2016-02-11 17:07:19 +00:00
Hans Petter Selasky	3e9470b721	Use a pair of ifs when comparing the 32-bit flowid integers so that the sign bit doesn't cause an overflow. The overflow manifests itself as a sorting index wrap around in the middle of the sorted array, which is not a problem for the LRO code, but might be a problem for the logic inside qsort(). Reviewed by: gnn @ Sponsored by: Mellanox Technologies Differential Revision: https://reviews.freebsd.org/D5239	2016-02-11 10:03:50 +00:00
Gleb Smirnoff	b4b12e52fb	Garbage collect unused arguments of m_init().	2016-02-10 18:54:18 +00:00
Bjoern A. Zeeb	a5243af262	Code duplication but rib_head is special. Not found an easy way to go back and harmize the use cases among RIB, IPFW, PF yet but it's also not the scope of this work. Prevents instant panics on teardown and frees the FIB bits again. Sponsored by: The FreeBSD Foundation	2016-02-03 21:56:51 +00:00
Bjoern A. Zeeb	2414e86439	MfH @r295202 Expect to see panics in routing code at least now.	2016-02-03 11:49:51 +00:00
Alfred Perlstein	7325dfbb59	Increase max allowed backlog for listen sockets from short to int. PR: 203922 Submitted by: White Knight <white_knight@2ch.net> MFC After: 4 weeks	2016-02-02 05:57:59 +00:00
Gleb Smirnoff	8ec07310fa	These files were getting sys/malloc.h and vm/uma.h with header pollution via sys/mbuf.h	2016-02-01 17:41:21 +00:00
Michael Tuexen	5322a0968e	Add missing parentheses. This was reported by ccaughie via GitHub for the userland stack. MFC after: 3 days	2016-01-30 17:32:46 +00:00
Michael Tuexen	3cf729a920	Update the path mtu when turning on/off UDP encapsulation for SCTP. MFC after: 3 days	2016-01-30 16:56:39 +00:00
Michael Tuexen	ca83f93c09	Don't allow a remote encapsulation port change during the SCTP restart procedure. MFC after: 3 days	2016-01-30 12:58:38 +00:00
Michael Tuexen	4edd31fc71	Don't change the remote UDP encapsulation port for SCTP packets containing an INIT chunk. MFC after: 3 days	2016-01-30 11:10:22 +00:00
Michael Tuexen	843d04a89e	Ignore peer addresses in a consistent way also when checking for new addresses during restart. If this is not done, restart doesn't work when the local socket is IPv4 only and the peer uses IPv4 and IPv6 addresses. MFC after: 3 days.	2016-01-30 10:39:05 +00:00
Michael Tuexen	a4cab32319	Remove debug output which was committed by accident. Thanks to Oliver Pinter for reporting. MFC after: 3 days X-MFC with: r294995	2016-01-28 23:12:12 +00:00
Michael Tuexen	79b67faaf6	Always look in the TCP pool. This fixes issues with a restarting peer when the listening 1-to-1 style socket is closed. MFC after: 3 days	2016-01-28 16:05:46 +00:00
Gleb Smirnoff	4644fda3f7	Rename netinet/tcp_cc.h to netinet/cc/cc.h. Discussed with: lstewart	2016-01-27 17:59:39 +00:00
Gleb Smirnoff	af6fef3abb	Fix issues with TCP_CONGESTION handling after r294540: o Return back the buf[TCP_CA_NAME_MAX] for TCP_CONGESTION, for TCP_CCALGOOPT use dynamically allocated *pbuf. o For SOPT_SET TCP_CONGESTION do NULL terminating of string taking from userland. o For SOPT_SET TCP_CONGESTION do the search for the algorithm keeping the inpcb lock. o For SOPT_GET TCP_CONGESTION first strlcpy() the name holding the inpcb lock into temporary buffer, then copyout. Together with: lstewart	2016-01-27 07:34:00 +00:00
Gleb Smirnoff	75dd79d937	Grab a snap amount of TCP connections in syncache from tcpstat.	2016-01-27 00:48:05 +00:00
Gleb Smirnoff	57a78e3bae	Augment struct tcpstat with tcps_states[], which is used for book-keeping the amount of TCP connections by state. Provides a cheap way to get connection count without traversing the whole pcb list. Sponsored by: Netflix	2016-01-27 00:45:46 +00:00
Gleb Smirnoff	d17d4c6b2a	Provide TCPSTAT_DEC() and TCPSTAT_FETCH() macros.	2016-01-27 00:20:07 +00:00
Hiren Panchasara	0645c6049d	Persist timers TCPTV_PERSMIN and TCPTV_PERSMAX are hardcoded with 5 seconds and 60 seconds, respectively. Turn them into sysctls that can be tuned live. The default values of 5 seconds and 60 seconds have been retained. Submitted by: Jason Wolfe (j at nitrology dot com) Reviewed by: gnn, rrs, hiren, bz MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D5024	2016-01-26 16:33:38 +00:00
Alexander V. Chernikov	0d6a516eb8	Convert TCP mtu checks to the new routing KPI.	2016-01-25 10:06:49 +00:00
Alexander V. Chernikov	61eee0e202	MFP r287070,r287073: split radix implementation and route table structure. There are number of radix consumers in kernel land (pf,ipfw,nfs,route) with different requirements. In fact, first 3 don't have _any_ requirements and first 2 does not use radix locking. On the other hand, routing structure do have these requirements (rnh_gen, multipath, custom to-be-added control plane functions, different locking). Additionally, radix should not known anything about its consumers internals. So, radix code now uses tiny 'struct radix_head' structure along with internal 'struct radix_mask_head' instead of 'struct radix_node_head'. Existing consumers still uses the same 'struct radix_node_head' with slight modifications: they need to pass pointer to (embedded) 'struct radix_head' to all radix callbacks. Routing code now uses new 'struct rib_head' with different locking macro: RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing information base). New net/route_var.h header was added to hold routing subsystem internal data. 'struct rib_head' was placed there. 'struct rtentry' will also be moved there soon.	2016-01-25 06:33:15 +00:00
Bjoern A. Zeeb	70a0984741	sctp_asconf_iterator_end() has an unused second argument; compiles better if you add it. Sponsored by: The FreeBSD Foundation	2016-01-23 12:56:28 +00:00
Bjoern A. Zeeb	d30c4f99ed	Noisy comments (not sure if the static would be valid for all SCTP implementations). Reorder some cleanup just to match the general order we normally use. Sponsored by: The FreeBSD Foundation	2016-01-23 12:52:08 +00:00
Bjoern A. Zeeb	765cf0b825	Try to prevent an address (assoc) leak in one way or another when sctp_initiate_iterator() fails. Sponsored by: The FreeBSD Foundation	2016-01-23 12:51:12 +00:00
Bjoern A. Zeeb	ce1d6b0efa	Use sctp_asconf_iterator_end() rather than doing the cleanup manually. Sponsored by: The FreeBSD Foundation	2016-01-23 12:50:02 +00:00
Bjoern A. Zeeb	27a01c6c0c	Try to catch a couple of SCTP teardown race conditions. Saw all the printfs already. Note: not sure the atomics are needed but without them, the condition would never trigger, and we'd still see panics (which could have been due to the insert race). Will work my way backwards in case this stays stable. Sponsored by: The FreeBSD Foundation	2016-01-23 11:05:13 +00:00
Bjoern A. Zeeb	eef5775f02	Fix build and avoid a double-free in the VIMAGE case. Sponsored by: The FreeBSD Foundation	2016-01-22 19:43:26 +00:00
Bjoern A. Zeeb	bb84e3d77d	Correct function arguments for SYSUNINITs. Sponsored by: The FreeBSD Foundation	2016-01-22 18:39:23 +00:00
Bjoern A. Zeeb	1bbe967cc4	Correct function arguments for SYSUNINITs. Obtained from: p4 @180834 Sponsored by: The FreeBSD Foundation	2016-01-22 18:37:17 +00:00
Bjoern A. Zeeb	4ce8702050	Correct function arguments for SYSUNINITs. Add #ifdef VIMAGE, as in other cases it's dead code. Obtained from: p4 @180832 Sponsored by: The FreeBSD Foundation	2016-01-22 18:35:11 +00:00
Bjoern A. Zeeb	8bdb5261e6	Correct function arguments for SYSUNINITs. Obtained from: p4 @180885 Sponsored by: The FreeBSD Foundation	2016-01-22 18:29:02 +00:00
Bjoern A. Zeeb	9ff1c4634f	Correct function arguments for SYSUNINITs. Obtained from: p4 @180886 Sponsored by: The FreeBSD Foundation	2016-01-22 18:26:58 +00:00
Bjoern A. Zeeb	f2cf0121ca	MFp4 @180887: With pr_destroy being gone, call ip_destroy from an ordered VNET_SYSUNINT. Make ip_destroy() static. Sponsored by: The FreeBSD Foundation	2016-01-22 18:22:03 +00:00
Bjoern A. Zeeb	009e81b164	MFH @r294567	2016-01-22 15:11:40 +00:00
Bjoern A. Zeeb	1f12da0e82	Just checkpoint the WIP in order to be able to make the tree update easier. Note: this is currently not in a usable state as certain teardown parts are not called and the DOMAIN rework is missing. More to come soon and find its way to head. Obtained from: P4 //depot/user/bz/vimage/... Sponsored by: The FreeBSD Foundation	2016-01-22 15:00:01 +00:00
Gleb Smirnoff	d519cedbad	Provide new socket option TCP_CCALGOOPT, which stands for TCP congestion control algorithm options. The argument is variable length and is opaque to TCP, forwarded directly to the algorithm's ctl_output method. Provide new includes directory netinet/cc, where algorithm specific headers can be installed. The new API doesn't yet have any in tree consumers. The original code written by lstewart. Reviewed by: rrs, emax Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D711	2016-01-22 02:07:48 +00:00
Gleb Smirnoff	73e263b182	Refactor TCP_CONGESTION setsockopt handling: - Use M_TEMP instead of stack variable. - Unroll error handling, removing several levels of indentation.	2016-01-21 22:53:12 +00:00
Gleb Smirnoff	2de3e790f5	- Rename cc.h to more meaningful tcp_cc.h. - Declare it a kernel only include, which it already is. - Don't include tcp.h implicitly from tcp_cc.h	2016-01-21 22:34:51 +00:00
Gleb Smirnoff	b66d74c138	Cleanup TCP files from unnecessary interface related includes.	2016-01-21 22:24:20 +00:00
Bjoern A. Zeeb	df56caeeb1	The variable is write once only and not used. Recover the vertical space. Sponsored by: The FreeBSD Foundation MFC After: 3 days Obtained from: p4 CH=180830 Reviewed by: gnn, hiren Differential Revision: https://reviews.freebsd.org/D4898	2016-01-21 17:25:41 +00:00
Hans Petter Selasky	e936121d31	Add optimizing LRO wrapper: - Add optimizing LRO wrapper which pre-sorts all incoming packets according to the hash type and flowid. This prevents exhaustion of the LRO entries due to too many connections at the same time. Testing using a larger number of higher bandwidth TCP connections showed that the incoming ACK packet aggregation rate increased from ~1.3:1 to almost 3:1. Another test showed that for a number of TCP connections greater than 16 per hardware receive ring, where 8 TCP connections was the LRO active entry limit, there was a significant improvement in throughput due to being able to fully aggregate more than 8 TCP stream. For very few very high bandwidth TCP streams, the optimizing LRO wrapper will add CPU usage instead of reducing CPU usage. This is expected. Network drivers which want to use the optimizing LRO wrapper needs to call "tcp_lro_queue_mbuf()" instead of "tcp_lro_rx()" and "tcp_lro_flush_all()" instead of "tcp_lro_flush()". Further the LRO control structure must be initialized using "tcp_lro_init_args()" passing a non-zero number into the "lro_mbufs" argument. - Make LRO statistics 64-bit. Previously 32-bit integers were used for statistics which can be prone to wrap-around. Fix this while at it and update all SYSCTL's which expose LRO statistics. - Ensure all data is freed when destroying a LRO control structures, especially leftover LRO entries. - Reduce number of memory allocations needed when setting up a LRO control structure by precomputing the total amount of memory needed. - Add own memory allocation counter for LRO. - Bump the FreeBSD version to force recompilation of all KLDs due to change of the LRO control structure size. Sponsored by: Mellanox Technologies Reviewed by: gallatin, sbruno, rrs, gnn, transport Tested by: Netflix Differential Revision: https://reviews.freebsd.org/D4914	2016-01-19 15:33:28 +00:00
Michael Tuexen	c7e732ae61	Fix a bug in INIT handling on accepted 1-to-1 style sockets when the listener is closed. This fix allows the following packetdrill test to pass: // Setup a connected, blocking 1-to-1 style socket +0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3 // Check the handshake with en empty(!) cookie +0.0 bind(3, ..., ...) = 0 +0.0 listen(3, 1) = 0 +0.0 < sctp: INIT[flgs=0, tag=1, a_rwnd=1500, os=1, is=1, tsn=1] +0.0 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=1, ...] +0.0 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...] +0.0 > sctp: COOKIE_ACK[flgs=0] +0.0 accept(3, ..., ...) = 4 +0.0 close(3) = 0 // Inject an INIT chunk and expect an INIT-ACK +0.0 < sctp: INIT[flgs=0, tag=3, a_rwnd=1500, os=1, is=1, tsn=1] +0.0 > sctp: INIT_ACK[flgs=0, tag=..., a_rwnd=..., os=..., is=..., tsn=..., ...] MFC after: 3 days	2016-01-15 00:26:15 +00:00
Michael Tuexen	ebee3dc229	Fail the SCTP_GET_ASSOC_NUMBER and SCTP_GET_ASSOC_ID_LIST socket options for 1-to-1 style sockets as specified in RFC 6458. MFC after: 3 days	2016-01-14 11:25:28 +00:00
Gleb Smirnoff	f73d9fd2f1	There is a bug in tcp_output()'s implementation of the TCP_SIGNATURE (RFC 2385/TCP-MD5) kernel option. If a tcpcb has TF_NOOPT flag, then tcp_addoptions() is not called, and to.to_signature is an uninitialized stack variable. The value is later used as write offset, which leads to writing to random address. Submitted by: rstone, jtl Security: SA-16:05.tcp	2016-01-14 10:22:45 +00:00
Alexander V. Chernikov	10e0e23528	Remove now-unused wrappers for various routing functions.	2016-01-14 08:54:44 +00:00
Michael Tuexen	fa89f69240	Store the timer type for logging, because the timer can be freed during processing the timerout. MFC after: 3 days	2016-01-13 14:28:12 +00:00
Alexander V. Chernikov	59747033cd	Bring RADIX_MPATH support to new routing KPI to ease migration. Move actual rte selection process from rtalloc_mpath_fib() to the rt_path_selectrte() function. Add public rt_mpath_select() to use in fibX_lookup_ functions.	2016-01-11 08:45:28 +00:00
Alexander V. Chernikov	36402a681f	Finish r275196: do not dereference rtentry in if_output() routines. The only piece of information that is required is rt_flags subset. In particular, if_loop() requires RTF_REJECT and RTF_BLACKHOLE flags to check if this particular mbuf needs to be dropped (and what error should be returned). Note that if_loop() will always return EHOSTUNREACH for "reject" routes regardless of RTF_HOST flag existence. This is due to upcoming routing changes where RTF_HOST value won't be available as lookup result. All other functions require RTF_GATEWAY flag to check if they need to return EHOSTUNREACH instead of EHOSTDOWN error. There are 11 places where non-zero 'struct route' is passed to if_output(). For most of the callers (forwarding, bpf, arp) does not care about exact error value. In fact, the only place where this result is propagated is ip_output(). (ip6_output() passes NULL route to nd6_output_ifp()). Given that, add 3 new 'struct route' flags (RT_REJECT, RT_BLACKHOLE and RT_IS_GW) and inline function (rt_update_ro_flags()) to copy necessary rte flags to ro_flags. Call this function in ip_output() after looking up/ verifying rte. Reviewed by: ae	2016-01-09 16:34:37 +00:00
Alexander V. Chernikov	ea8d14925c	Remove sys/eventhandler.h from net/route.h Reviewed by: ae	2016-01-09 09:34:39 +00:00
Alexander V. Chernikov	f2b2e77a41	(Temporarily) remove route_redirect_event eventhandler. Such handler should pass different set of variables, instead of directly providing 2 locked route entries. Given that it hasn't been really used since at least 2012, remove current code. Will re-add it after finishing most major routing-related changes. Discussed with: np	2016-01-09 06:26:40 +00:00
Jonathan T. Looney	49b375e74b	Apply the changes from r293284 to one additional file. Discussed with: glebius	2016-01-07 11:54:20 +00:00
Gleb Smirnoff	0c39d38d21	Historically we have two fields in tcpcb to describe sender MSS: t_maxopd, and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned up after T/TCP removal. After all permutations over the years the result is that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if timestamps are in action, or is equal to t_maxopd otherwise. That's a very rough estimate of MSS reduced by options length. Throughout the code it was used in places, where preciseness was not important, like cwnd or ssthresh calculations. With this change: - t_maxopd goes away. - t_maxseg now stores MSS not adjusted by options. - new function tcp_maxseg() is provided, that calculates MSS reduced by options length. The functions gives a better estimate, since it takes into account SACK state as well. Reviewed by: jtl Differential Revision: https://reviews.freebsd.org/D3593	2016-01-07 00:14:42 +00:00
Michael Tuexen	79cadff48d	Get struct sctp_net_route in sync with struct route again.	2016-01-04 20:34:40 +00:00
Alexander V. Chernikov	45a8de88c6	Maintain consistent behavior: make fib4_lookup_nh_ext() return rt_ifp pointer by default, as done by other fib lookup functions.	2016-01-04 17:23:10 +00:00
Alexander V. Chernikov	9a1b64d5a0	Add rib_lookup_info() to provide API for retrieving individual route entries data in unified format. There are control plane functions that require information other than just next-hop data (e.g. individual rtentry fields like flags or prefix/mask). Given that the goal is to avoid rte reference/refcounting, re-use rt_addrinfo structure to store most rte fields. If caller wants to retrieve key/mask or gateway (which are sockaddrs and are allocated separately), it needs to provide sufficient-sized sockaddrs structures w/ ther pointers saved in passed rt_addrinfo. Convert: * lltable new records checks (in_lltable_rtcheck(), nd6_is_new_addr_neighbor(). * rtsock pre-add/change route check. * IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because 1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should not be multiple host routes for such hosts 2) if we have multiple routes we should inspect them (which is not done). 3) the entire idea of abusing KRT as storage for ND proxy seems odd. Userland programs should be used for that purpose).	2016-01-04 15:03:20 +00:00
Alexander V. Chernikov	65d2872948	Fix fib4_lookup_nh_ext() flags/flowid order messed up while merging.	2016-01-03 16:13:03 +00:00
Alexander V. Chernikov	6cdb18544d	Remove second EVENTHANDLER_REGISTER slipped in r292978. Describe the reason of doing unconditional M_PREPEND in ether_output().	2016-01-01 10:15:06 +00:00
Alexander V. Chernikov	4fb3a8208c	Implement interface link header precomputation API. Add if_requestencap() interface method which is capable of calculating various link headers for given interface. Right now there is support for INET/INET6/ARP llheader calculation (IFENCAP_LL type request). Other types are planned to support more complex calculation (L2 multipath lagg nexthops, tunnel encap nexthops, etc..). Reshape 'struct route' to be able to pass additional data (with is length) to prepend to mbuf. These two changes permits routing code to pass pre-calculated nexthop data (like L2 header for route w/gateway) down to the stack eliminating the need for other lookups. It also brings us closer to more complex scenarios like transparently handling MPLS nexthops and tunnel interfaces. Last, but not least, it removes layering violation introduced by flowtable code (ro_lle) and simplifies handling of existing if_output consumers. ARP/ND changes: Make arp/ndp stack pre-calculate link header upon installing/updating lle record. Interface link address change are handled by re-calculating headers for all lles based on if_lladdr event. After these changes, arpresolve()/nd6_resolve() returns full pre-calculated header for supported interfaces thus simplifying if_output(). Move these lookups to separate ether_resolve_addr() function which ether returs error or fully-prepared link header. Add <arp\|nd6_>resolve_addr() compat versions to return link addresses instead of pre-calculated data. BPF changes: Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT. Despite the naming, both of there have ther header "complete". The only difference is that interface source mac has to be filled by OS for AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside BPF and not pollute if_output() routines. Convert BPF to pass prepend data via new 'struct route' mechanism. Note that it does not change non-optimized if_output(): ro_prepend handling is purely optional. Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI. It is not needed for ethernet anymore. The only remaining FDDI user is dev/pdq mostly untouched since 2007. FDDI support was eliminated from OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65). Flowtable changes: Flowtable violates layering by saving (and not correctly managing) rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated header data from that lle. Differential Revision: https://reviews.freebsd.org/D4102	2015-12-31 05:03:27 +00:00
Jonathan T. Looney	2d8868dbb7	When checking the inp_ip_minttl restriction for IPv6 packets, don't check the IPv4 header. CID: 1017920 Differential Revision: https://reviews.freebsd.org/D4727 Reviewed by: bz MFC after: 2 weeks Sponsored by: Juniper Networks	2015-12-29 19:20:39 +00:00
Allan Jude	7a3f5d11fb	Replace sys/crypto/sha2/sha2.c with lib/libmd/sha512c.c cperciva's libmd implementation is 5-30% faster The same was done for SHA256 previously in r263218 cperciva's implementation was lacking SHA-384 which I implemented, validated against OpenSSL and the NIST documentation Extend sbin/md5 to create sha384(1) Chase dependancies on sys/crypto/sha2/sha2.{c,h} and replace them with sha512{c.c,.h} Reviewed by: cperciva, des, delphij Approved by: secteam, bapt (mentor) MFC after: 2 weeks Sponsored by: ScaleEngine Inc. Differential Revision: https://reviews.freebsd.org/D3929	2015-12-27 17:33:59 +00:00
Michael Tuexen	1672adc7b1	Don't implicitly terminate a user message when moving it to the send_queue and the socket is closed. This results in strange race conditions for the application. While there, remove a stray character. MFC after: 3 days	2015-12-25 18:11:40 +00:00
Kevin Lo	ddb1359877	Fix typo (s/harware/hardware/)	2015-12-25 14:51:36 +00:00
Patrick Kelsey	281a0fd4f9	Implementation of server-side TCP Fast Open (TFO) [RFC7413]. TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Reviewed by: gnn, jch, stas MFC after: 3 days Sponsored by: Verisign, Inc. Differential Revision: https://reviews.freebsd.org/D4350	2015-12-24 19:09:48 +00:00
Sergey Kandaurov	e62b9bca9a	Fixed comment placement. Before r12296, this comment described the udp_recvspace default value. Spotted by: ru Sponsored by: Nginx, Inc.	2015-12-24 13:57:43 +00:00
Bjoern A. Zeeb	616bc4f476	If bootverbose is enabled every vnet startup and virtual interface creation will print extra lines on the console. We are generally not interested in this (repeated) information for each VNET. Thus only print it for the default VNET. Virtual interfaces on the base system will remain printing information, but e.g. each loopback in each vnet will no longer cause a "bpf attached" line. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4531	2015-12-22 15:00:04 +00:00
Bjoern A. Zeeb	0a03cf8ca6	Since r256624 we've been leaking routing table allocations on vnet enabled jail shutdown. Call the provided cleanup routines for IP versions 4 and 6 to plug these leaks. Sponsored by: The FreeBSD Foundation MFC atfer: 2 weeks Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4530	2015-12-22 14:53:19 +00:00
Jonathan T. Looney	c54a41def7	Fix a panic when launching VNETs after the commit of r292309. Differential Revision: https://reviews.freebsd.org/D4645 Reviewed by: rrs Reported by: kp Tested by: kp Sponsored by: Juniper Networks	2015-12-22 13:41:50 +00:00
Michael Tuexen	fe4a59b30a	Stop processing of a SACK when the association has been aborted. MFC after: 3 days	2015-12-21 18:52:02 +00:00
Steven Hartland	d6e82913c1	Revert r292275 & r292379 glebius has concerns about these changes so reverting those can be discussed and addressed. Sponsored by: Multiplay	2015-12-17 14:41:30 +00:00
Mark Johnston	3616095801	Fix style issues around existing SDT probes. - Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect at the moment, but will be needed for some future changes. - Don't hardcode the module component of the probe identifier. This is set automatically by the SDT framework. MFC after: 1 week	2015-12-16 23:39:27 +00:00
Steven Hartland	3a909afe8e	Fix issues introduced by r292275 * Fix panic for etherswitches which don't have a LLADDR. * Disabled DELAY in unsolicited NDA, which needs further work. * Fixed missing DELAY in carp_send_na. * style(9) fix. Reported by: kp & melifaro X-MFC-With: r292275 MFC after: 1 month Sponsored by: Multiplay	2015-12-16 22:26:28 +00:00
Randall Stewart	f4e476c893	Remove redundant extern's that make the ppc compile fail. Thanks Ed Maste for the heads up.	2015-12-16 15:16:44 +00:00
Alexander V. Chernikov	942e4b4b79	Fix ARP reply handling changed in r286955. If source of ARP request didn't pass the routing check (e.g. not in directly connected network), be polite and still answer the request instead of dropping frame. Reported by: quadro at irc@rusnet	2015-12-16 09:16:06 +00:00
Randall Stewart	55bceb1e2b	First cut of the modularization of our TCP stack. Still to do is to clean up the timer handling using the async-drain. Other optimizations may be coming to go with this. Whats here will allow differnet tcp implementations (one included). Reviewed by: jtl, hiren, transports Sponsored by: Netflix Inc. Differential Revision: D4055	2015-12-16 00:56:45 +00:00
Steven Hartland	52e53e2de0	Fix lagg failover due to missing notifications When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited Neighbour Advertisements (IPv6) are sent to notify other nodes that the address may have moved. This results is slow failover, dropped packets and network outages for the lagg interface when the primary link goes down. We now use the new if_link_state_change_cond with the force param set to allow lagg to force through link state changes and hence fire a ifnet_link_event which are now monitored by rip and nd6. Upon receiving these events each protocol trigger the relevant notifications: * inet4 => Gratuitous ARP * inet6 => Unsolicited Neighbour Announce This also fixes the carp IPv6 NA's that stopped working after r251584 which added the ipv6_route__llma route. The new behavour can be controlled using the sysctls: * net.link.ether.inet.arp_on_link * net.inet6.icmp6.nd6_on_link Also removed unused param from lagg_port_state and added descriptions for the sysctls while here. PR: 156226 MFC after: 1 month Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D4111	2015-12-15 16:02:11 +00:00
Hiren Panchasara	4d16338223	Clean up unused bandwidth entry in the TCP hostcache. Submitted by: Jason Wolfe (j at nitrology dot com) Reviewed by: rrs, hiren Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D4154	2015-12-11 06:22:58 +00:00
Michael Tuexen	9ee7a93696	Retire sctp_validate_no_locks(). This routine checks that there are no locks held for an inp, without having any lock on the inp. This breaks if the inp goes away when it is called. This happens on stress tests on a RPi B+. MFC after: 3 days	2015-12-10 11:49:32 +00:00
Hiren Panchasara	b87170f210	r290122 added 4 bytes and removed 8 in struct sackhint. Add a pad entry of 4 bytes to restore the size. Spotted by: rrs Reviewed by: rrs X-MFC with: r290122 Sponsored by: Limelight Networks	2015-12-10 03:20:10 +00:00
Alexander V. Chernikov	9977be4a64	Make in_arpinput(), inp_lookup_mcast_ifp(), icmp_reflect(), ip_dooptions(), icmp6_redirect_input(), in6_lltable_rtcheck(), in6p_lookup_mcast_ifp() and in6_selecthlim() use new routing api. Eliminate now-unused ip_rtaddr(). Fix lookup key fib6_lookup_nh_basic() which was lost diring merge. Make fib6_lookup_nh_basic() and fib6_lookup_nh_extended() always return IPv6 destination address with embedded scope. Currently rw_gateway has it scope embedded, do the same for non-gatewayed destinations. Sponsored by: Yandex LLC	2015-12-09 11:14:27 +00:00
Hiren Panchasara	a934d06194	Add an option to use rfc6675 based pipe/inflight bytes calculation in newreno. MFC after: 3 weeks Sponsored by: Limelight Networks	2015-12-09 08:53:41 +00:00
Hiren Panchasara	f81bc34eac	Add an option to use rfc6675 based pipe/inflight bytes calculation in cubic. Reviewed by: gnn MFC after: 3 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D4205	2015-12-09 07:56:40 +00:00
Hiren Panchasara	021eaf7996	One of the ways to detect loss is to count duplicate acks coming back from the other end till it reaches predetermined threshold which is 3 for us right now. Once that happens, we trigger fast-retransmit to do loss recovery. Main problem with the current implementation is that we don't honor SACK information well to detect whether an incoming ack is a dupack or not. RFC6675 has latest recommendations for that. According to it, dupack is a segment that arrives carrying a SACK block that identifies previously unknown information between snd_una and snd_max even if it carries new data, changes the advertised window, or moves the cumulative acknowledgment point. With the prevalence of Selective ACK (SACK) these days, improper handling can lead to delayed loss recovery. With the fix, new behavior looks like following: 0) th_ack < snd_una --> ignore Old acks are ignored. 1) th_ack == snd_una, !sack_changed --> ignore Acks with SACK enabled but without any new SACK info in them are ignored. 2) th_ack == snd_una, window == old_window --> increment Increment on a good dupack. 3) th_ack == snd_una, window != old_window, sack_changed --> increment When SACK enabled, it's okay to have advertized window changed if the ack has new SACK info. 4) th_ack > snd_una --> reset to 0 Reset to 0 when left edge moves. 5) th_ack > snd_una, sack_changed --> increment Increment if left edge moves but there is new SACK info. Here, sack_changed is the indicator that incoming ack has previously unknown SACK info in it. Note: This fix is not fully compliant to RFC6675. That may require a few changes to current implementation in order to keep per-sackhole dupack counter and change to the way we mark/handle sack holes. PR: 203663 Reviewed by: jtl MFC after: 3 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D4225	2015-12-08 21:21:48 +00:00
Alexander V. Chernikov	65ff3638df	Merge helper fib* functions used for basic lookups. Vast majority of rtalloc(9) users require only basic info from route table (e.g. "does the rtentry interface match with the interface I have?". "what is the MTU?", "Give me the IPv4 source address to use", etc..). Instead of hand-rolling lookups, checking if rtentry is up, valid, dealing with IPv6 mtu, finding "address" ifp (almost never done right), provide easy-to-use API hiding all the complexity and returning the needed info into small on-stack structure. This change also helps hiding route subsystem internals (locking, direct rtentry accesses). Additionaly, using this API improves lookup performance since rtentry is not locked. (This is safe, since all the rtentry changes happens under both radix WLOCK and rtentry WLOCK). Sponsored by: Yandex LLC	2015-12-08 10:50:03 +00:00
Michael Tuexen	c979034b18	Fix the allocation of outgoing streams: * When processing a cookie, use the number of streams announced in the INIT-ACK. * When sending an INIT-ACK for an existing association, use the value from the association, not from the end-point. MFC after: 1 week	2015-12-06 16:17:57 +00:00
Alexander V. Chernikov	f8aee88f0b	Remove LLE read lock from IPv4 fast path. LLE structure is mostly unchanged during its lifecycle. To be more specific, there are 2 things relevant for fast path lookup code: 1) link-level address change. Since r286722, these updates are performed under AFDATA WLOCK. 2) Some sort of feedback indicating that this particular entry is used so we re-send arp request to perform reachability verification instead of expiring entry. The only signal that is needed from fast path is something like binary yes/no. The latter is solved by the following changes: 1) introduce special r_skip_req field which is read lockless by fast path, but updated under (new) req_mutex mutex. If this field is non-zero, then fast path will acquire lock and set it back to 0. 2) introduce simple state machine: incomplete->reachable<->verify->deleted. Before that we implicitely had incomplete->reachable->deleted state machine, with V_arpt_keep between "reachable" and "deleted". Verification was performed in runtime 5 seconds before V_arpt_keep expire. This is changed to "change state to verify 5 seconds before V_arpt_keep, set r_skip_req to non-zero value and check it every second". If the value is zero - then send arp verification probe. These changes do not introduce any signifficant control plane overhead: typically lle callout timer would fire 1 time more each V_arpt_keep (1200s) for used lles and up to arp_maxtries (5) for dead lles. As a result, all packets towards "reachable" lle are handled by fast path without acquiring lle read lock. Additional "req_mutex" is needed because callout / arpresolve_slow() or eventhandler might keep LLE lock for signifficant amount of time, which might not be feasible for fast path locking (e.g. having rmlock as ether AFDATA or lltable own lock). Differential Revision: https://reviews.freebsd.org/D3688	2015-12-05 09:50:37 +00:00
Michael Tuexen	a4889f2dd0	Fix a bug where a stream reset request wasn't retranmitted when the peer indicated "In progress". MFC after: 1 week	2015-12-04 08:49:27 +00:00
Michael Tuexen	d96bef9c77	Ensure that outgoing streams get reset when they run dry. MFC after: 1 week	2015-12-03 15:19:29 +00:00
Michael Tuexen	4821b41e21	Minor cleanup. No functional change. MFC after: 1 week	2015-12-02 22:44:42 +00:00
Michael Tuexen	60862d8e48	Adjust the MTU when accepting an SCTP association using UDP encapsulation. MFC after: 1 week	2015-12-02 16:29:36 +00:00
Andrey V. Elsukov	b4e63e2d15	In the same way fix the problem described in r291578 for IGMPv3. In case when router has a lot of multicast groups, the reply can take several packets due to MTU limitation. Also we have a limit IGMP_MAX_RESPONSE_BURST == 4, that limits the number of packets we send in one shot. Then we recalculate the timer value and schedule the remaining packets for sending. The problem is that when we call igmp_v3_dispatch_general_query() to send remaining packets, we queue new reply in the same mbuf queue. And when number of packets is bigger than IGMP_MAX_RESPONSE_BURST, we get endless reply of IGMPv3 reports. To fix this, add the check for remaining packets in the queue. MFC after: 1 week Sponsored by: Yandex LLC	2015-12-01 11:24:30 +00:00
Alexander V. Chernikov	c00c4e46e3	Remove in_setifarnh definition.	2015-11-30 06:02:35 +00:00
Alexander V. Chernikov	e8b0643eee	Add new rt_foreach_fib_walk_del() function for deleting route entries by filter function instead of picking into routing table details in each consumer. Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED user). This simplifies future nexthops/mulitipath changes and rtrequest1_fib() locking refactoring. Actual changes: Add "rt_chain" field to permit rte grouping while doing batched delete from routing table (thus growing rte 200->208 on amd64). Add "rti_filter" / "rti_filterdata" / "rti_spare" fields to rt_addrinfo to pass filter function to various routing subsystems in standard way. Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate rt_expunge().	2015-11-30 05:51:14 +00:00
Michael Tuexen	c6d2bd4812	Take also the send queue and sent queue into account when triggering the sending of outgoing stream reset requests. MFC after: 3 days	2015-11-27 22:11:46 +00:00
Michael Tuexen	f0067f2251	When the sending of an SCTP outgoing stream reset request fails, don't report it to the user since all stream have been marked as pending. MFC after: 1 week	2015-11-26 23:12:41 +00:00
Michael Tuexen	52f175be70	When receiving an SCTP/UDP packet and the interface performed the UDP checksum computation and signals that it was OK, clear this bit when passing the packet to SCTP. Since the bits indicating a valid UDP checksum and a valid SCTP checksum are the same, the SCTP stack would assume that also an SCTP checksum check has been performed. MFC after: 1 week	2015-11-26 09:25:20 +00:00
Fabien Thomas	edd0e0b098	The r241129 description was wrong that the scenario is possible only for read locks on pcbs. The same race can happen with write lock semantics as well. The race scenario: - Two threads (1 and 2) locate pcb with writer semantics (INPLOOKUP_WLOCKPCB) and do in_pcbref() on it. - 1 and 2 both drop the inp hash lock. - Another thread (3) grabs the inp hash lock. Then it runs in_pcbfree(), which wlocks the pcb. They must happen faster than 1 or 2 come INP_WLOCK()! - 1 and 2 congest in INP_WLOCK(). - 3 does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which doesn't free the pcb due to two references on it. Then it unlocks the pcb. - 1 (or 2) gets wlock on the pcb, runs in_pcbrele_wlocked(), which doesn't report inp as freed, due to 2 (or 1) still helding extra reference on it. The thread tries to do smth with a disconnected pcb and crashes. Submitted by: emeric.poupon@stormshield.eu Reviewed by: gleb@ MFC after: 1 week Sponsored by: Stormshield Tested by: Cassiano Peixoto, Stormshield	2015-11-25 14:45:43 +00:00

... 2 3 4 5 6 ...

5689 commits