It looks like as with the safety belt of DELAY() fastened (*) we can
completely tear down and free all memory for TCP (after r281599).
(*) in theory a few ticks should be good enough to make sure the timers
are all really gone. Could we use a better matric here and check a
tcbcb count as an optimization?
PR: 164763
Reviewed by: gnn, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5734
The tcp_inpcb (pcbinfo) zone should be safe to destroy.
PR: 164763
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5732
We attach the "counter" to the tcpcbs. Thus don't free the
TCP Fastopen zone before the tcpcbs are gone, as otherwise
the zone won't be empty.
With that it should be safe to destroy the "tfo" zone without
leaking the memory.
PR: 164763
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5731
While there is no dependency interaction, stopping the timer before
freeing the rest of the resources seems more natural and avoids it
being scheduled an extra time when it is no longer needed.
Reviewed by: gnn, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5733
No need to keep type stability on raw sockets zone.
We've also been running with a KASSERT since r222488 to make sure the
ipi_count is 0 on destroy.
PR: 164763
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5735
adds the new I-Data (Interleaved Data) message. This allows a user
to be able to have complete freedom from Head Of Line blocking that
was previously there due to the in-ability to send multiple large
messages without the TSN's being in sequence. The code as been
tested with Michaels various packet drill scripts as well as
inter-networking between the IETF's location in Argentina and Germany.
This is kinda critical to the performance when the CPU is slow and
network bandwidth is high, e.g. in the hypervisor.
Reviewed by: rrs, gallatin, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5765
And factor out tcp_lro_rx_done, which deduplicates the same logic with
netinet/tcp_lro.c
Reviewed by: gallatin (1st version), hps, zbb, np, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5725
So that callers could react accordingly.
Reviewed by: gallatin (no objection)
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5695
- properly V_irtualise variable access unbreaking VIMAGE kernels.
- remove the volatile from the function return type to make architecture
using gcc happy [-Wreturn-type]
"type qualifiers ignored on function return type"
I am not entirely happy with this solution putting the u_int there
but it will do for now.
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.
Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306
Furthermore, there is no reason this needs to be a 64-bit integer
for the forseeable future.
Also, there is an inconsistency between to_flags and the mask in
tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in
tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t
but left the mask in tcp_addoptions() as a u_int, meaning that these
variables will only be the same width on platforms with 64-bit integers.
Convert both to_flags and the mask in tcp_addoptions() to be explicitly
32-bit variables. This may save a few cycles on 32-bit platforms, and
avoids unnecessarily mixing types.
Differential Revision: https://reviews.freebsd.org/D5584
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.
Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.
stack is not compliant with RFC 7323, which requires that TCP stacks send
a timestamp option on all packets (except, optionally, RSTs) after the
session is established.
This patch adds that support. It also adds a TCP signature option to the
packet, if appropriate.
PR: 206047
Differential Revision: https://reviews.freebsd.org/D4808
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks
- Reorder variables by size
- Move initializer closer to where it is used
- Remove unneeded variable
Differential Revision: https://reviews.freebsd.org/D4808
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks
for output and drop; connect didn't always fire a user probe
some probes were missing in fastpath
Submitted by: Hannes Mehnert
Sponsored by: REMS, EPSRC
Differential Revision: https://reviews.freebsd.org/D5525
included in loader.conf. It also fixes it so that no matter if some one incorrectly
specifies a load order, the lists and such will be initialized on demand at that
time so no one can make that mistake.
Reviewed by: hiren
Differential Revision: D5189
ACK aggregation limit is append count based, while the TCP data segment
aggregation limit is length based. Unless the network driver sets these
two limits, it's an NO-OP.
Reviewed by: adrian, gallatin (previous version), hselasky (previous version)
Approved by: adrian (mentor)
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5185
Fix a panic that occurs when a vnet interface is unavailable at the time the
vnet jail referencing said interface is stopped.
Sponsored by: FIS Global, Inc.
the sign bit doesn't cause an overflow. The overflow manifests itself
as a sorting index wrap around in the middle of the sorted array,
which is not a problem for the LRO code, but might be a problem for
the logic inside qsort().
Reviewed by: gnn @
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D5239
back and harmize the use cases among RIB, IPFW, PF yet but it's also not
the scope of this work. Prevents instant panics on teardown and frees
the FIB bits again.
Sponsored by: The FreeBSD Foundation
new addresses during restart. If this is not done, restart doesn't
work when the local socket is IPv4 only and the peer uses
IPv4 and IPv6 addresses.
MFC after: 3 days.
o Return back the buf[TCP_CA_NAME_MAX] for TCP_CONGESTION,
for TCP_CCALGOOPT use dynamically allocated *pbuf.
o For SOPT_SET TCP_CONGESTION do NULL terminating of string
taking from userland.
o For SOPT_SET TCP_CONGESTION do the search for the algorithm
keeping the inpcb lock.
o For SOPT_GET TCP_CONGESTION first strlcpy() the name
holding the inpcb lock into temporary buffer, then copyout.
Together with: lstewart
60 seconds, respectively. Turn them into sysctls that can be tuned live. The
default values of 5 seconds and 60 seconds have been retained.
Submitted by: Jason Wolfe (j at nitrology dot com)
Reviewed by: gnn, rrs, hiren, bz
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D5024
There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
with different requirements. In fact, first 3 don't have _any_ requirements
and first 2 does not use radix locking. On the other hand, routing
structure do have these requirements (rnh_gen, multipath, custom
to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.
So, radix code now uses tiny 'struct radix_head' structure along with
internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
Existing consumers still uses the same 'struct radix_node_head' with
slight modifications: they need to pass pointer to (embedded)
'struct radix_head' to all radix callbacks.
Routing code now uses new 'struct rib_head' with different locking macro:
RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
information base).
New net/route_var.h header was added to hold routing subsystem internal
data. 'struct rib_head' was placed there. 'struct rtentry' will also
be moved there soon.
Saw all the printfs already.
Note: not sure the atomics are needed but without them, the condition
would never trigger, and we'd still see panics (which could have been
due to the insert race). Will work my way backwards in case this stays
stable.
Sponsored by: The FreeBSD Foundation
easier. Note: this is currently not in a usable state as certain
teardown parts are not called and the DOMAIN rework is missing.
More to come soon and find its way to head.
Obtained from: P4 //depot/user/bz/vimage/...
Sponsored by: The FreeBSD Foundation
control algorithm options. The argument is variable length and is opaque
to TCP, forwarded directly to the algorithm's ctl_output method.
Provide new includes directory netinet/cc, where algorithm specific
headers can be installed.
The new API doesn't yet have any in tree consumers.
The original code written by lstewart.
Reviewed by: rrs, emax
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D711
Recover the vertical space.
Sponsored by: The FreeBSD Foundation
MFC After: 3 days
Obtained from: p4 CH=180830
Reviewed by: gnn, hiren
Differential Revision: https://reviews.freebsd.org/D4898
- Add optimizing LRO wrapper which pre-sorts all incoming packets
according to the hash type and flowid. This prevents exhaustion of
the LRO entries due to too many connections at the same time.
Testing using a larger number of higher bandwidth TCP connections
showed that the incoming ACK packet aggregation rate increased from
~1.3:1 to almost 3:1. Another test showed that for a number of TCP
connections greater than 16 per hardware receive ring, where 8 TCP
connections was the LRO active entry limit, there was a significant
improvement in throughput due to being able to fully aggregate more
than 8 TCP stream. For very few very high bandwidth TCP streams, the
optimizing LRO wrapper will add CPU usage instead of reducing CPU
usage. This is expected. Network drivers which want to use the
optimizing LRO wrapper needs to call "tcp_lro_queue_mbuf()" instead
of "tcp_lro_rx()" and "tcp_lro_flush_all()" instead of
"tcp_lro_flush()". Further the LRO control structure must be
initialized using "tcp_lro_init_args()" passing a non-zero number
into the "lro_mbufs" argument.
- Make LRO statistics 64-bit. Previously 32-bit integers were used for
statistics which can be prone to wrap-around. Fix this while at it
and update all SYSCTL's which expose LRO statistics.
- Ensure all data is freed when destroying a LRO control structures,
especially leftover LRO entries.
- Reduce number of memory allocations needed when setting up a LRO
control structure by precomputing the total amount of memory needed.
- Add own memory allocation counter for LRO.
- Bump the FreeBSD version to force recompilation of all KLDs due to
change of the LRO control structure size.
Sponsored by: Mellanox Technologies
Reviewed by: gallatin, sbruno, rrs, gnn, transport
Tested by: Netflix
Differential Revision: https://reviews.freebsd.org/D4914
(RFC 2385/TCP-MD5) kernel option.
If a tcpcb has TF_NOOPT flag, then tcp_addoptions() is not called,
and to.to_signature is an uninitialized stack variable. The value
is later used as write offset, which leads to writing to random
address.
Submitted by: rstone, jtl
Security: SA-16:05.tcp
Move actual rte selection process from rtalloc_mpath_fib()
to the rt_path_selectrte() function. Add public
rt_mpath_select() to use in fibX_lookup_ functions.
The only piece of information that is required is rt_flags subset.
In particular, if_loop() requires RTF_REJECT and RTF_BLACKHOLE flags
to check if this particular mbuf needs to be dropped (and what
error should be returned).
Note that if_loop() will always return EHOSTUNREACH for "reject" routes
regardless of RTF_HOST flag existence. This is due to upcoming routing
changes where RTF_HOST value won't be available as lookup result.
All other functions require RTF_GATEWAY flag to check if they need
to return EHOSTUNREACH instead of EHOSTDOWN error.
There are 11 places where non-zero 'struct route' is passed to if_output().
For most of the callers (forwarding, bpf, arp) does not care about exact
error value. In fact, the only place where this result is propagated
is ip_output(). (ip6_output() passes NULL route to nd6_output_ifp()).
Given that, add 3 new 'struct route' flags (RT_REJECT, RT_BLACKHOLE and
RT_IS_GW) and inline function (rt_update_ro_flags()) to copy necessary
rte flags to ro_flags. Call this function in ip_output() after looking up/
verifying rte.
Reviewed by: ae
Such handler should pass different set of variables, instead
of directly providing 2 locked route entries.
Given that it hasn't been really used since at least 2012, remove
current code.
Will re-add it after finishing most major routing-related changes.
Discussed with: np
and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned
up after T/TCP removal. After all permutations over the years the result is
that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum
protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if
timestamps are in action, or is equal to t_maxopd otherwise. That's a very
rough estimate of MSS reduced by options length. Throughout the code it
was used in places, where preciseness was not important, like cwnd or
ssthresh calculations.
With this change:
- t_maxopd goes away.
- t_maxseg now stores MSS not adjusted by options.
- new function tcp_maxseg() is provided, that calculates MSS reduced by
options length. The functions gives a better estimate, since it takes
into account SACK state as well.
Reviewed by: jtl
Differential Revision: https://reviews.freebsd.org/D3593