While doing the initial SACK retransmission segment while heavily cwnd
constrained, tcp_ouput can erroneously send out the entire sendbuffer
again. This may happen after an retransmission timeout, which resets
snd_nxt to snd_una while the SACK scoreboard is still populated.
Reviewed By: tuexen, #transport
PR: 264257
PR: 263445
PR: 260393
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36637
When doing Limited Transmit send an ACK when needed by the protocol
processing (like sending ACKs with a DSACK block).
PR: 264257
PR: 263445
PR: 260393
Reviewed by: rscheff@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D36631
tcp_respond() crafts a packet and sends it directly to ip[6]output(),
bypassing tcp_output(). Hence it must increment TCP send statistics.
Reviewed by: rscheff, tuexen, rrs (implicitly)
Differential revision: https://reviews.freebsd.org/D36641
- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
is also necessary for a half-synchronized connection. Fix that
just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.
Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet. For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete. The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup(). Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case. However, this doesn't yet work.
Reviewed by: rscheff, tuexen, rrs
Differential revision: https://reviews.freebsd.org/D36641
Only update the offset if actually retransmitting from the
scoreboard. If not done correctly, this may result in
trying to (re)-transmit data not being being in the socket
buffe and therefore resulting in a panic.
PR: 264257
PR: 263445
PR: 260393
Reviewed by: rscheff@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D36626
While here, join two unlikely cases into one if clause.
Submitted by: Ivan Rozhuk <rozhuk.im gmail.com>
PR: 265718
Reviewed by: mjg, melifaro
Differential revision: https://reviews.freebsd.org/D36584
This changes the default TCP Congestion Control (CC) to CUBIC.
For small, transactional exchanges (e.g. web objects <15kB), this
will not have a material effect. However, for long duration data
transfers, CUBIC allocates a slightly higher fraction of the
available bandwidth, when competing against NewReno CC.
Reviewed By: tuexen, mav, #transport, guest-ccui, emaste
Relnotes: Yes
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36537
Consistently refer to the CUBIC congestion control
mechanism in uppercase throughout all comments.
No functional change.
Reviewed By: #transport, tuexen, mav, guest-ccui, emaste
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36547
In doing some testing for a different problem, I have found rack retransmitting
all outstanding data every time a timeout occurs. The outstanding is sent 1ms
apart between each packet, and then the timeout runs off again. This causes
extra retransmissions when we should be waiting for an ack after sending the
very first segment.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36494
o Retire global always running ipreass_slowtimo().
o Instead use one callout entry per hash slot. The per-slot callout
would be scheduled only if a slot has entries, and would be driven
by TTL of the very last entry.
o Make net.inet.ip.fragttl read/write and document it.
o Retire IPFRAGTTL, which used to be meaningful only with PR_SLOWTIMO.
Differential revision: https://reviews.freebsd.org/D36275
This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place. After 6f3caa6d81 it definitely
became necessary always and commit message from f1ee30ccd6 confirms that.
Now that 6f3caa6d81 is effectively backed out by 07285bb4c2, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.
Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.
Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36488
It is now unused and not having it allows further clean ups.
Reviewed by: cy, glebius, kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D36452
Although originally socket was intended to use with ipfw(4) only, now
it also can be used with pf(4). On a kernel without packet filters,
it still can be used to inject traffic.
With 61f7427f02 raw sockets protosw has wildcard pr_protocol. Protocol
of a specific pcb is stored in inp_ip_p.
Reviewed by: karels
Reported by: karels
Differential revision: https://reviews.freebsd.org/D36429
Fixes: 61f7427f02
The AccECN handshake and TCP header flags are supported,
no support yet for the AccECN option. This minimalistic
implementation is sufficient to support DCTCP while
dramatically cutting the number of ACKs, and provide ECN
response from the receiver to the CC modules.
Reviewed By: #transport, #manpages, rrs, pauamma
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21011
While a receiver should continue sending SACK blocks for the
duration of a SACK loss recovery, if for some reason the
TCP options no longer contain these SACK blocks, but we
already started maintaining the Scoreboard, keep on handling
incoming ACKs (without SACK) as belonging to the SACK recovery.
Reported by: thj
Reviewed by: tuexen, #transport
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36046
Here go cons of using inpcb for divert:
- divert(4) uses only 16 bits (local port) out of struct inpcb,
which is 424 bytes today.
- The inpcb KPI isn't able to provide hashing for divert(4),
thus it uses global inpcb list for lookups.
- divert(4) uses INET-specific part of the KPI, making INET
a requirement for IPDIVERT.
Maintain our own very simple hash lookup database instead. It
has mutex protection for write and epoch protection for lookups.
Since now so->so_pcb no longer points to struct inpcb, don't
initialize protosw methods to methods that belong to PF_INET.
Also, drop support for setting options on a divert socket. My
review of software in base and ports confirms that this has no
use and unlikely worked before.
Differential revision: https://reviews.freebsd.org/D36382
Instead of incrementing pretty random counters in the IP statistics,
create divert socket statistics structure. Export via netstat(1).
Differential revision: https://reviews.freebsd.org/D36381
Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6). This story ended with 78b1fc05b2.
These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP). Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.
With 78b1fc05b2 all these entries are no longer needed, as ip_protox
is independent of protosw. Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw. And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.
For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
of protosw value, which is now always 0.
Differential revision: https://reviews.freebsd.org/D36380
The divert(4) is not a protocol of IPv4. It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back. It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols. The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char. I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.
Moving divert(4) out of inetsw accomplished two important things:
1) IPDIVERT is getting much closer to be not dependent on INET.
This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
Domain/proto selection code won't need a hack for SOCK_RAW and
multiple entries in inetsw implementing different flavors of
raw socket can merge into one without requirement of raw IPv4
being the last member of dom_protosw.
Differential revision: https://reviews.freebsd.org/D36379
o Statically initialize max_linkhdr to default value without relying
on domain(9) code doing that.
o Statically initialize max_protohdr to a sane value, without relying
on TCP being always compiled in.
o Retire max_datalen. Set, but not used.
o Don't make the domain(9) system responsible in validating these
values and updating max_hdr. Instead provide KPI max_linkhdr_grow()
and max_protohdr_grow().
o Call max_linkhdr_grow() from IEEE802.11 and max_protohdr_grow() from
TCP. Those are the only protocols today that may want to grow.
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D36376
Construct the desired hexthops directly instead of using the
"translation" layer in form of filling rt_addrinfo data.
Simplify V_rt_add_addr_allfibs handling by using recently-added
rib_copy_route() to propagate the routes to the non-primary address
fibs.
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D36166
Currently when the peer collapses its rwnd, we mark packets to be retransmitted
and use the must_retran flags like we do when a PMTU collapses to retransmit the
collapsed packets. However this causes a problem with some middle boxes that
play with the rwnd to control flow. As soon as the rwnd increases we start resending
which may be not even a rtt.. and in fact the peer may have gotten the packets. Which
means we gratuitously retransmit packets we should not.
The fix here is to make sure that a rack time has passed before retransmitting the packets.
This makes sure that the rwnd collapse was real and the packets do need retransmission.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D35166
A while back Hans optimized the LRO code. This is great but one
optimization he did degrades the timestamp precision so that
all flushed LRO entries end up with the same LRO timestamp
if there is not a hardware timestamp. The intent of the LRO timestamp
is to get as close to the time that the packet arrived as possible. Without
the LRO queuing this works out fine since a binuptime is taken and then
the rx_common code is called. But when you go through the queue path
you end up *not* updating the M_LRO_TSTMP fields.
Another issue in the LRO code is several places that cause packet reordering. In
general TCP can handle reordering but it can cause extra un-needed retransmission
as well as other oddities. We will fix all of the reordering problems.
Lets fix this so that we restore the precision to the timestamp.
Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36043
For now it is read-only, but eventually the cycle that goes over
all fragments should be refactored and this MIB should also become
read/write.
This MIB will allow SNMP daemons to implement MIB-II ipReasmTimeout MIB
straightfoward. Right now net-snmp compilation is broken by 1922eb3e9c.
The base system bsnmpd is not broken just because it ignored PR_SLOWTIMO,
and thus always returned incorrectly doubled value for ipReasmTimeout.
o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d1), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee5).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232