Commit graph

7487 commits

Author SHA1 Message Date
Richard Scheffenegger
a743fc8826 tcp: fix cwnd restricted SACK retransmission loop
While doing the initial SACK retransmission segment while heavily cwnd
constrained, tcp_ouput can erroneously send out the entire sendbuffer
again. This may happen after an retransmission timeout, which resets
snd_nxt to snd_una while the SACK scoreboard is still populated.

Reviewed By:		tuexen, #transport
PR:			264257
PR:			263445
PR:			260393
MFC after:		3 days
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36637
2022-09-22 13:28:43 +02:00
Michael Tuexen
5ae83e0d87 tcp: send ACKs when requested
When doing Limited Transmit send an ACK when needed by the protocol
processing (like sending ACKs with a DSACK block).

PR:			264257
PR:			263445
PR:			260393
Reviewed by:		rscheff@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36631
2022-09-22 12:12:11 +02:00
Gleb Smirnoff
9453ec6619 tcp: increment tcpstats in tcp_respond()
tcp_respond() crafts a packet and sends it directly to ip[6]output(),
bypassing tcp_output().  Hence it must increment TCP send statistics.

Reviewed by:		rscheff, tuexen, rrs (implicitly)
Differential revision:	https://reviews.freebsd.org/D36641
2022-09-21 14:03:33 -07:00
Gleb Smirnoff
493105c2a8 tcp: fix simultaneous open and refine e80062a2d4
- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
  is also necessary for a half-synchronized connection.  Fix that
  just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
  soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet.  For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete.  The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup().  Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case.  However, this doesn't yet work.

Reviewed by:		rscheff, tuexen, rrs
Differential revision:	https://reviews.freebsd.org/D36641
2022-09-21 14:02:49 -07:00
Gleb Smirnoff
0c7f3ae8c6 tcpcb: fix tabulation count in i4012ef7754c and abbreviate "packets"
This lines up comments to the rest of the file.  Abbreviation
helps to fit in to 80 char terminal.  Not a functional change.
2022-09-19 10:29:53 -07:00
Michael Tuexen
6d9e911fba tcp: fix computation of offset
Only update the offset if actually retransmitting from the
scoreboard. If not done correctly, this may result in
trying to (re)-transmit data not being being in the socket
buffe and therefore resulting in a panic.

PR:			264257
PR:			263445
PR:			260393
Reviewed by:		rscheff@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36626
2022-09-19 12:49:31 +02:00
Gleb Smirnoff
da6715bbb1 ip_output: always increase "cantfrag" stat if ip_fragment() fails
While here, join two unlikely cases into one if clause.

Submitted by:		Ivan Rozhuk <rozhuk.im gmail.com>
PR:			265718
Reviewed by:		mjg, melifaro
Differential revision:	https://reviews.freebsd.org/D36584
2022-09-14 19:22:40 -07:00
Gleb Smirnoff
15b73a2a14 ip_reass: use correct comparison in ipreass_callout()
Reported-by:	syzbot+55415dc73f9b89b87fce@syzkaller.appspotmail.com
2022-09-14 08:32:07 -07:00
Richard Scheffenegger
bb1d472d79 tcp: make CUBIC the default congestion control mechanism.
This changes the default TCP Congestion Control (CC) to CUBIC.
For small, transactional exchanges (e.g. web objects <15kB), this
will not have a material effect. However, for long duration data
transfers, CUBIC allocates a slightly higher fraction of the
available bandwidth, when competing against NewReno CC.

Reviewed By: tuexen, mav, #transport, guest-ccui, emaste
Relnotes: Yes
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36537
2022-09-13 12:09:21 +02:00
Richard Scheffenegger
ea6d0de299 tcp: Make all references to CUBIC uppercase
Consistently refer to the CUBIC congestion control
mechanism in uppercase throughout all comments.

No functional change.

Reviewed By: #transport, tuexen, mav, guest-ccui, emaste
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36547
2022-09-13 12:07:06 +02:00
Dag-Erling Smørgrav
c198adf394 siftr: spell PFIL_PASS correctly.
Sponsored by:	NetApp
Sponsored by:	Klara Inc.
Differential Revision: https://reviews.freebsd.org/D36539
2022-09-12 19:20:10 +02:00
Mateusz Guzik
1760a6950a Fixup build after recent getsock changes 2022-09-10 20:40:43 +00:00
Mateusz Guzik
3212ad15ab Add getsock
All but one consumers of getsock_cap only pass 4 arguments.
Take advantage of it.
2022-09-10 19:47:47 +00:00
Gleb Smirnoff
29b4b63c59 ip_reass: optimize ipreass_drain_vnet()
- Call ipreass_reschedule() only once per slot [1]
- Aggregate stats and update them once

Suggested by:	jtl [1]
2022-09-10 02:17:15 -07:00
Gleb Smirnoff
13018bfae8 ip_reass: make stray callout assertion more verbose
Syzcaller hits this assertion, but can't find reproducer.  I also never
seen it hit in my testing.  Try to get more information via syzcaller.
2022-09-10 02:11:39 -07:00
Gleb Smirnoff
c8bc874172 ip_reass: fixup the just added tunable
- Don't use hardcoded hash mask
- free the memory on VNET destroy

Fixes:	1494f4776a
2022-09-09 09:19:39 -07:00
Randall Stewart
81560c5582 TCP: Rack ends up sending all that is outstanding every timeout.
In doing some testing for a different problem, I have found rack retransmitting
all outstanding data every time a timeout occurs. The outstanding is sent 1ms
apart between each packet, and then the timeout runs off again. This causes
extra retransmissions when we should be waiting for an ack after sending the
very first segment.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36494
2022-09-09 08:59:21 -04:00
Gleb Smirnoff
1494f4776a ip_reass: add loader tunable to tune the reassembly hash size 2022-09-08 13:49:58 -07:00
Gleb Smirnoff
a30cb31589 ip_reass: retire ipreass_slowtimo() in favor of per-slot callout
o Retire global always running ipreass_slowtimo().
o Instead use one callout entry per hash slot.  The per-slot callout
  would be scheduled only if a slot has entries, and would be driven
  by TTL of the very last entry.
o Make net.inet.ip.fragttl read/write and document it.
o Retire IPFRAGTTL, which used to be meaningful only with PR_SLOWTIMO.

Differential revision:	https://reviews.freebsd.org/D36275
2022-09-08 13:49:58 -07:00
Mateusz Guzik
dda6376b04 net: employ newly added pfil_mbuf_{in,out} where approriate
Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36454
2022-09-08 16:21:08 +00:00
Gleb Smirnoff
e80062a2d4 tcp: avoid call to soisconnected() on transition to ESTABLISHED
This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place.  After 6f3caa6d81 it definitely
became necessary always and commit message from f1ee30ccd6 confirms that.
Now that 6f3caa6d81 is effectively backed out by 07285bb4c2, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36488
2022-09-08 09:16:04 -07:00
Mateusz Guzik
14c9a2dbfb net: retire PFIL_FWD
It is now unused and not having it allows further clean ups.

Reviewed by:	cy, glebius, kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36452
2022-09-07 10:04:31 +00:00
Mateusz Guzik
223a73a1c4 net: remove stale altq_input reference
Code setting it was removed in:
commit 325fab802e
Author: Eric van Gyzen <vangyzen@FreeBSD.org>
Date:   Tue Dec 4 23:46:43 2018 +0000

    altq: remove ALTQ3_COMPAT code

Reviewed by:	glebius, kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36471
2022-09-07 10:03:12 +00:00
Gleb Smirnoff
aa74cc6d6f divert(4): do not depend on ipfw(4)
Although originally socket was intended to use with ipfw(4) only, now
it also can be used with pf(4).  On a kernel without packet filters,
it still can be used to inject traffic.
2022-09-06 20:54:57 -07:00
Gleb Smirnoff
999c9fd733 divert(4): don't check for CSUM_SCTP without INET
This compiles, but actually is a dead code.

Noticed by:	bz
Fixes:		e72c522858
2022-09-06 20:54:57 -07:00
Gleb Smirnoff
0773b44e82 tcp: tcp6_connect() requires net epoch
PR:			262663
Reported & tested by:	dch
MFC after:		2 weeks
2022-09-05 10:19:11 -07:00
Gordon Bergling
347b1991b0 netdump(4): Correct a typo in source code comment
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:59:29 +02:00
Gordon Bergling
c3679af313 tcp_rack: Correct some typos in source code comments
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:58:13 +02:00
Gordon Bergling
893f36b7f1 netinet: Correct a typo in source code comment
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:57:12 +02:00
Gordon Bergling
d07a501876 tcp_hpts: Correct some typos in source code comments
- s/occured/occurred/
- s/the the/the/

MFC after:	3 days
2022-09-04 12:47:49 +02:00
Gordon Bergling
fa52f9dc9a tcp_rack: Fix two typos in source code comments
- s/overriden/overridden/

MFC after:	3 days
2022-09-03 15:05:42 +02:00
Gleb Smirnoff
74ed2e8ab2 raw ip: fix regression with multicast and RSVP
With 61f7427f02 raw sockets protosw has wildcard pr_protocol.  Protocol
of a specific pcb is stored in inp_ip_p.

Reviewed by:		karels
Reported by:		karels
Differential revision:	https://reviews.freebsd.org/D36429
Fixes:			61f7427f02
2022-09-02 12:17:09 -07:00
Richard Scheffenegger
4012ef7754 tcp: Functional implementation of Accurate ECN
The AccECN handshake and TCP header flags are supported,
no support yet for the AccECN option. This minimalistic
implementation is sufficient to support DCTCP while
dramatically cutting the number of ACKs, and provide ECN
response from the receiver to the CC modules.

Reviewed By:		#transport, #manpages, rrs, pauamma
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D21011
2022-08-31 15:05:53 +02:00
Richard Scheffenegger
c21b7b55be tcp: finish SACK loss recovery on sudden lack of SACK blocks
While a receiver should continue sending SACK blocks for the
duration of a SACK loss recovery, if for some reason the
TCP options no longer contain these SACK blocks, but we
already started maintaining the Scoreboard, keep on handling
incoming ACKs (without SACK) as belonging to the SACK recovery.

Reported by:		thj
Reviewed by:		tuexen, #transport
MFC after:		2 weeks
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36046
2022-08-31 14:49:47 +02:00
Gleb Smirnoff
e72c522858 divert(4): make it compilable and working without INET
Differential revision:	https://reviews.freebsd.org/D36383
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
f1fb051716 divert(4): maintain own cb database and stop using inpcb KPI
Here go cons of using inpcb for divert:
- divert(4) uses only 16 bits (local port) out of struct inpcb,
  which is 424 bytes today.
- The inpcb KPI isn't able to provide hashing for divert(4),
  thus it uses global inpcb list for lookups.
- divert(4) uses INET-specific part of the KPI, making INET
  a requirement for IPDIVERT.

Maintain our own very simple hash lookup database instead.  It
has mutex protection for write and epoch protection for lookups.
Since now so->so_pcb no longer points to struct inpcb, don't
initialize protosw methods to methods that belong to PF_INET.
Also, drop support for setting options on a divert socket.  My
review of software in base and ports confirms that this has no
use and unlikely worked before.

Differential revision:	https://reviews.freebsd.org/D36382
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
2b1c72171e divert(4): provide statistics
Instead of incrementing pretty random counters in the IP statistics,
create divert socket statistics structure.  Export via netstat(1).

Differential revision:	https://reviews.freebsd.org/D36381
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
61f7427f02 protosw: cleanup protocols that existed merely to provide pr_input
Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6).  This story ended with 78b1fc05b2.

These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP).  Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.

With 78b1fc05b2 all these entries are no longer needed, as ip_protox
is independent of protosw.  Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw.  And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.

For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
  of protosw value, which is now always 0.

Differential revision:	https://reviews.freebsd.org/D36380
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
8624f4347e divert: declare PF_DIVERT domain and stop abusing PF_INET
The divert(4) is not a protocol of IPv4.  It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back.  It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols.  The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char.  I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.

Moving divert(4) out of inetsw accomplished two important things:

1) IPDIVERT is getting much closer to be not dependent on INET.
   This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
   Domain/proto selection code won't need a hack for SOCK_RAW and
   multiple entries in inetsw implementing different flavors of
   raw socket can merge into one without requirement of raw IPv4
   being the last member of dom_protosw.

Differential revision:	https://reviews.freebsd.org/D36379
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
c00605751e tcp: remove a dead code leftover from T/TCP,
that doesn't have any value today.
2022-08-29 19:30:12 -07:00
Gleb Smirnoff
8fc8063849 divert: merge div_output() into div_send()
No functional change intended.
2022-08-29 19:15:01 -07:00
Gleb Smirnoff
c414347bc5 mbufs: isolate max_linkhdr and max_protohdr handling in the mbuf code
o Statically initialize max_linkhdr to default value without relying
  on domain(9) code doing that.
o Statically initialize max_protohdr to a sane value, without relying
  on TCP being always compiled in.
o Retire max_datalen. Set, but not used.
o Don't make the domain(9) system responsible in validating these
  values and updating max_hdr.  Instead provide KPI max_linkhdr_grow()
  and max_protohdr_grow().
o Call max_linkhdr_grow() from IEEE802.11 and max_protohdr_grow() from
  TCP.  Those are the only protocols today that may want to grow.

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D36376
2022-08-29 19:14:25 -07:00
Alexander V. Chernikov
7b3440fc30 Revert "routing: install prefix and loopback routes using new nhop-based KPI."
Temporarily revert the commit to unblock testing.

This reverts commit a1b59379db.
2022-08-29 16:20:42 +00:00
Alexander V. Chernikov
a1b59379db routing: install prefix and loopback routes using new nhop-based KPI.
Construct the desired hexthops directly instead of using the
 "translation" layer in form of filling rt_addrinfo data.
Simplify V_rt_add_addr_allfibs handling by using recently-added
 rib_copy_route() to propagate the routes to the non-primary address
 fibs.

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D36166
2022-08-29 10:07:58 +00:00
Michael Tuexen
c624b9a549 tcp: fix stats counter for SYN_RCVD state when TCP-FO is used
Reviewed by:		glebius
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36384
2022-08-28 18:45:59 +02:00
Randall Stewart
62ce18fc9a tcp: Rack rwnd collapse.
Currently when the peer collapses its rwnd, we mark packets to be retransmitted
and use the must_retran flags like we do when a PMTU collapses to retransmit the
collapsed packets. However this causes a problem with some middle boxes that
play with the rwnd to control flow. As soon as the rwnd increases we start resending
which may be not even a rtt.. and in fact the peer may have gotten the packets. Which
means we gratuitously retransmit packets we should not.

The fix here is to make sure that a rack time has passed before retransmitting the packets.
This makes sure that the rwnd collapse was real and the packets do need retransmission.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D35166
2022-08-23 09:17:05 -04:00
Randall Stewart
4e0ce82b53 TCP Lro has a loss of timestamp precision and reorders packets.
A while back Hans optimized the LRO code. This is great but one
optimization he did degrades the timestamp precision so that
all flushed LRO entries end up with the same LRO timestamp
if there is not a hardware timestamp. The intent of the LRO timestamp
is to get as close to the time that the packet arrived as possible. Without
the LRO queuing this works out fine since a binuptime is taken and then
the rx_common code is called. But when you go through the queue path
you end up *not* updating the M_LRO_TSTMP fields.

Another issue in the LRO code is several places that cause packet reordering. In
general TCP can handle reordering but it can cause extra un-needed retransmission
as well as other oddities. We will fix all of the reordering problems.

Lets fix this so that we restore the precision to the timestamp.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36043
2022-08-23 09:12:31 -04:00
Gleb Smirnoff
6498153665 ip_reass: don't drain all vnets on a vnet destroy 2022-08-21 07:44:58 -07:00
Gleb Smirnoff
8338690a0a ip_reass: provide sysctl MIB returning IP fragment TTL
For now it is read-only, but eventually the cycle that goes over
all fragments should be refactored and this MIB should also become
read/write.

This MIB will allow SNMP daemons to implement MIB-II ipReasmTimeout MIB
straightfoward.  Right now net-snmp compilation is broken by 1922eb3e9c.
The base system bsnmpd is not broken just because it ignored PR_SLOWTIMO,
and thus always returned incorrectly doubled value for ipReasmTimeout.
2022-08-20 13:39:12 -07:00
Gleb Smirnoff
e7d02be19d protosw: refactor protosw and domain static declaration and load
o Assert that every protosw has pr_attach.  Now this structure is
  only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw.  This was suggested
  in 1996 by wollman@ (see 7b187005d1), and later reiterated
  in 2006 by rwatson@ (see 6fbb9cf860).
o Make struct domain hold a variable sized array of protosw pointers.
  For most protocols these pointers are initialized statically.
  Those domains that may have loadable protocols have spacers. IPv4
  and IPv6 have 8 spacers each (andre@ dff3237ee5).
o For inetsw and inet6sw leave a comment noting that many protosw
  entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36232
2022-08-17 11:50:32 -07:00