Commit graph

624 commits

Author SHA1 Message Date
Andre Oppermann
3c914c547e Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore.  The upper limit is still at
IP_MAXPACKET (65536 bytes).  Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found.  When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by:	cperciva (earlier version)
Reviewed by:	cperciva
Tested by:	cperciva
MFC after:	1 week (using spare struct members to preserve ABI)
2013-06-03 12:55:13 +00:00
Andre Oppermann
5628dd0893 When doing RFC3042 limited transmit on the first on second
duplicate ACK make sure we actually have new data to send.
This prevents us from sending unneccessary pure ACKs.

Reported by:	Matt Miller <matt@matthewjmiller.net>
Tested by:	Matt Miller <matt@matthewjmiller.net>
MFC after:	2 weeks
2013-04-23 14:06:32 +00:00
Andre Oppermann
982c1675ff Fix a race condition on tcp listen socket teardown with pending
connections in the accept queue and contiguous new incoming SYNs.

Compared to the original submitters patch I've moved the test
next to the SYN handling to have it together in a logical unit
and reworded the comment explaining the issue.

Submitted by:	Matt Miller <matt@matthewjmiller.net>
Submitted by:	Juan Mojica <jmojica@gmail.com>
Reviewed by:	Matt Miller (changes)
Tested by:	pho
MFC after:	1 week
2013-04-09 20:52:26 +00:00
Gleb Smirnoff
4a21e86ec1 Fix VIMAGE build. 2013-04-09 09:15:26 +00:00
Gleb Smirnoff
5923c29332 Merge from projects/counters: TCP/IP stats.
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).

  This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.

Sponsored by:	Nginx, Inc.
2013-04-08 19:57:21 +00:00
Ed Maste
ce7ad6640c Keep fwd_tag around for subsequent pcb lookups
For TIMEWAIT handling tcp_input may have to jump back for an additional
pass through pcblookup.  Prior to this change the fwd_tag had been
discarded after the first lookup, so a new connection attempt delivered
locally via 'ipfw fwd' would fail to find a match.

As of r248886 the tag will be detached and freed when passed to the
socket buffer.
2013-03-29 20:51:44 +00:00
Lawrence Stewart
5b648e797b Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited"
logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are
unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could
therefore potentially corrupt the result (although under normal operation,
neither variable should legitmately exceed 32 bits).

[1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html

Submitted by:	jhb
MFC after:	1 week
2013-01-22 09:44:21 +00:00
Gleb Smirnoff
b8056fae06 Fix !INET6 build after r244365. 2012-12-18 08:14:16 +00:00
Gleb Smirnoff
dd029d52fa Clear correct flag in INET6 case. 2012-12-18 08:09:44 +00:00
Andrey V. Elsukov
f491274582 Since we use different flags to detect tcp forwarding, and we share the
same code for IPv4 and IPv6 in tcp_input, we should check both
M_IP_NEXTHOP and M_IP6_NEXTHOP flags.

MFC after:	3 days
2012-12-17 20:55:33 +00:00
Gleb Smirnoff
78a7880f64 Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it,
but later after processing and freeing the tag, we need to jump back again
to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to
process and free the tag for second time.

Reported & tested by:	Pawel Tyll <ptyll nitronet.pl>
MFC after:		3 days
2012-12-12 17:41:21 +00:00
Andre Oppermann
60ee3bb213 Back out r242262. The simplified window change/update logic wasn't
complete and ready for production use.

PR:	kern/173309
2012-11-05 09:13:06 +00:00
Andrey V. Elsukov
ffdbf9da3b Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
Andre Oppermann
09440655fe Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
 net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet.  Also the restart
window isn't yet increased as allowed.  Both will be adjusted with
upcoming changes.

Is is enabled by default.  In Linux it is enabled since kernel 3.0.

MFC after:	2 weeks
2012-10-28 19:47:46 +00:00
Andre Oppermann
79ce26a08c Simplify and enhance the window change/update acceptance logic,
especially in the presence of bi-directional data transfers.

snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data.  This makes it like rcv_nxt plus
reassembly.  It never goes backwards to prevent older, possibly
reordered segments from updating the window.

snd_wl2 tracks the left edge of sent data.  This makes it a duplicate
of snd_una.  However joining them right now is difficult due to
separate update dependencies in different places in the code flow.

snd_wnd tracks the current advertized send window by the peer.  In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.

ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced.  The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments.  The ACK clock is most important because
it determines how much data we are allowed to inject into the network.

Zero window updates get us out of persistence mode are crucial.  Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.

When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window.  This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.

The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.

Tcpdump provided by:	darrenr
Tested by:	darrenr
MFC after:	2 weeks
2012-10-28 19:16:22 +00:00
Andre Oppermann
4faaea5505 Allow arbitrary MSS sizes and don't mind about the cluster size anymore.
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.

MFC after:	2 weeks
2012-10-28 18:33:52 +00:00
Andre Oppermann
cf8f04f4c0 When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment.  This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after:	2 weeks
2012-10-28 17:25:08 +00:00
Andre Oppermann
22efabd40c Adjust the initial default CWND upon connection establishment to the
new and increased values specified by RFC5681 Section 3.1.

The even larger initial CWND per RFC3390, if enabled, is not affected.

MFC after:	2 weeks
2012-10-28 17:16:09 +00:00
Andrey V. Elsukov
c1de64a495 Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
Gleb Smirnoff
8ad458a471 Do not reduce ip_len by size of IP header in the ip_input()
before passing a packet to protocol input routines.
  For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.

  Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.
2012-10-23 08:33:13 +00:00
Gleb Smirnoff
8f134647ca Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
Gleb Smirnoff
105bd2113b In ip_stripoptions():
- Remove unused argument and incorrect comment.
  - Fixup ip_len after stripping.
2012-10-12 09:24:24 +00:00
Randall Stewart
ec03d5433f This small change takes care of a race condition
that can occur when both sides close at the same time.
If that occurs, without this fix the connection enters
FIN1 on both sides and they will forever send FIN|ACK at
each other until the connection times out. This is because
we stopped processing the FIN|ACK and thus did not advance
the sequence and so never ACK'd each others FIN. This
fix adjusts it so we *do* process the FIN properly and
the race goes away ;-)

MFC after:	1 month
2012-08-25 09:26:37 +00:00
Robert Watson
0989f56cff Update some stale comments regarding tcbinfo locking in the TCP input
path: read locks on tcbinfo are no longer used, so won't happen.  No
functional change.

MFC after:	3 days
2012-07-22 17:31:36 +00:00
Navdeep Parhar
09fe63205c - Updated TOE support in the kernel.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)
2012-06-19 07:34:13 +00:00
Maksim Yevmenkin
77d396fd18 Plug more refcount leaks and possible NULL deref for interface
address list.

Submitted by:	scottl@
MFC after:	3 days
2012-06-04 18:43:51 +00:00
Bjoern A. Zeeb
356ab07e2d It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading.  Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible).  Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation.  Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by:	gallatin, dim, ..
Reviewed by:	gallatin (glanced at?)
MFC after:	3 days
X-MFC with:	r235961,235959,235958
2012-05-28 09:30:13 +00:00
Bjoern A. Zeeb
45747ba53c MFp4 bz_ipv6_fast:
Add code to handle pre-checked TCP checksums as indicated by mbuf
  flags to save the entire computation for validation if not needed.

  In the IPv6 TCP output path only compute the pseudo-header checksum,
  set the checksum offset in the mbuf field along the appropriate flag
  as done in IPv4.

  In tcp_respond() just initialize the IPv6 payload length to 0 as
  ip6_output() will properly set it.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 02:23:26 +00:00
Bjoern A. Zeeb
3a9391defb MFp4 bz_ipv6_fast:
Factor out the tcp_hc_getmtu() call.  As the comments say it
  applies to both v4 and v6, so only write it once making it easier
  to read the protocol family specifc code.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 01:13:39 +00:00
Gleb Smirnoff
ef341ee1e3 When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
  offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
  MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by:	az
Reported by:	Andrey Zonov <andrey zonov.org>
Reviewed by:	andre (previous version of patch)
2012-04-16 13:49:03 +00:00
Bjoern A. Zeeb
d8951c8a2f Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when
hz >> 1000 and thus getting outside the timestamp clock frequenceny of
1ms < x < 1s per tick as mandated by RFC1323, leading to connection
resets on idle connections.

Always use a granularity of 1ms using getmicrouptime() making all but
relevant callouts independent of hz.

Use getmicrouptime(), not getmicrotime() as the latter may make a jump
possibly breaking TCP nfsroot mounts having our timestamps move forward
for more than 24.8 days in a second without having been idle for that
long.

PR:		kern/61404
Reviewed by:	jhb, mav, rrs
Discussed with:	silby, lstewart
Sponsored by:	Sandvine Incorporated (originally in 2011)
MFC after:	6 weeks
2012-02-15 16:09:56 +00:00
Gleb Smirnoff
9077f38738 Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by:	andre, bz, lstewart
2012-02-05 16:53:02 +00:00
John Baldwin
1e96ae8193 Remove the assertion from tcp_input() that rcv_nxt is always greater
than or equal to rcv_adv and fix tcp_twstart() to handle this case by
assuming the last window was zero rather than a negative value.

The code in tcp_input() already safely handled this case.  It can happen
due to delayed ACKs along with a remote sender that sends data beyond
the window we previously advertised.  If we have room in our socket buffer
for the extra data beyond the advertised window, we will accept it.
However, if the ACK for that segment is delayed, then we will not
effectively fixup rcv_adv to account for that extra data until the
next segment arrives and forces out an ACK.  When that next segment
arrives, rcv_nxt will be beyond rcv_adv.

Tested by:	pjd
MFC after:	1 week
2012-01-05 22:29:11 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Sergey Kandaurov
ddd0c4a969 Restore sysctl names for tcp_sendspace/tcp_recvspace.
They seem to be changed unintentionally in r226437, and there were no
any mentions of renaming in commit log message.

Reported by:	Anton Yuzhaninov <citrin citrin ru>
2011-11-02 20:58:47 +00:00
Bjoern A. Zeeb
fba0cea143 Add syntactic sugar missed in r226437 and then not added either when moving
things around in r226448 but desperately needed to always make things
compile successfully.

MFC after:	1 week
2011-10-17 00:05:31 +00:00
Andre Oppermann
873789cb0f Move the tcp_sendspace and tcp_recvspace sysctl's from
the middle of tcp_usrreq.c to the top of tcp_output.c
and tcp_input.c respectively next to the socket buffer
autosizing controls.

MFC after:	1 week
2011-10-16 20:18:39 +00:00
Andre Oppermann
9ec4a4cca5 Remove the ss_fltsz and ss_fltsz_local sysctl's which have
long been superseded by the RFC3390 initial CWND sizing.

Also remove the remnants of TCP_METRICS_CWND which used the
TCP hostcache to set the initial CWND in a non-RFC compliant
way.

MFC after:	1 week
2011-10-16 20:06:44 +00:00
Andre Oppermann
e233e2acb3 VNET virtualize tcp_sendspace/tcp_recvspace and change the
type to INT.  A long is not necessary as the TCP window is
limited to 2**30.  A larger initial window isn't useful.

MFC after:	1 week
2011-10-16 15:08:43 +00:00
Attilio Rao
4af309c810 For the INP_TIMEWAIT case, there is no valid tcpcb object tied to the
inpcb object.
Skip the TCP_SIGNATURE check in that case as it is consistent with the
output path (no TCP_SIGNATURE for outcoming packets in TIMEWAIT state)
and also because for TIMEWAIT state the verify may be less effective.

Sponsored by:		Sandvine Incorporated
Reported by:		rwatson
No objections by:	rwatson
MFC after:		3 days
2011-10-06 14:29:38 +00:00
Bjoern A. Zeeb
b233773bb9 Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults.  They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption.   All values are still tunable
by sysctls.

Suggested by:	gnn
Discussed on:	arch (Mar and Aug 2011)
MFC after:	3 weeks
Approved by:	re (kib)
2011-08-25 09:20:13 +00:00
Bjoern A. Zeeb
6f69742441 Fix compilation in case of defined(INET) && defined(IPFIREWALL_FORWARD)
but no INET6.

Reported by:	avg
Tested by:	avg
MFC after:	4 weeks
X-MFC with:	r225044
Approved by:	re (kib)
2011-08-20 18:45:38 +00:00
Bjoern A. Zeeb
8a006adb24 Add support for IPv6 to ipfw fwd:
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.

Obtained from:	David Dolson at Sandvine Incorporated
		(original version for ipfw fwd IPv6 support)
Sponsored by:	Sandvine Incorporated
PR:		bin/117214
MFC after:	4 weeks
Approved by:	re (kib)
2011-08-20 17:05:11 +00:00
Robert Watson
d3c1f00350 Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc.  For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).

Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.

(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI.  However, it won't be useful without
other previous possibly less MFCable changes.)

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
Robert Watson
fa046d8774 Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
  inpcb counter.  This lock is now relegated to a small number of
  allocation and free operations, and occasional operations that walk
  all connections (including, awkwardly, certain UDP multicast receive
  operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
  looking up connections and bound sockets, manipulated using new
  INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
  the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required.  As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb.  Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed.  In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup.  New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

  INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
  INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
  TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
  eliminated, and global hash lock hold times are dramatically reduced
  compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
  may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
  is no longer available -- hash lookup locks are now held only very
  briefly during inpcb lookup, rather than for potentially extended
  periods.  However, the pcbinfo ipi_lock will still be acquired if a
  connection state might change such that a connection is added or
  removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
  due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
  callers to acquire hash locks and perform one or more lookups atomically
  with 4-tuple allocation: this is required only for TCPv6, as there is no
  in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
  locking, which relates to source address selection.  This needs
  attention, as it likely significantly reduces parallelism in this code
  for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
  somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
  is no longer sufficient.  A second check once the inpcb lock is held
  should do the trick, keeping the general case from requiring the inpcb
  lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
  which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
  undesirable, and probably another argument is required to take care of
  this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics.  It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
John Baldwin
5891ebd6cd Oops, fix order of sequence numbers in KASSERT()'s to catch negative
receive windows to match the labels in the panic message.

Submitted by:	trociny
2011-05-14 14:41:40 +00:00
John Baldwin
f701e30d7f Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted.  However, this can cause rcv_nxt
to be greater than rcv_adv.  This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.

During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1.  On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long.  In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.

Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.

Reviewed by:	bz
MFC after:	1 month
2011-05-02 21:05:52 +00:00
Bjoern A. Zeeb
29bd2010d4 Fix a mismerge from p4 in that in_localaddr() is not available without INET.
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 16:30:18 +00:00
Bjoern A. Zeeb
b287c6c70c Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness.  To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 11:21:29 +00:00
John Baldwin
672dc4aea2 TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer.  However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window.  As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd.  However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.

If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh.  In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).

Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb.  The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.

Reviewed by:	bz
MFC after:	2 weeks
2011-04-29 15:40:12 +00:00
Attilio Rao
2903309aca Add the possibility to verify MD5 hash of incoming TCP packets.
As long as this is a costy function, even when compiled in (along with
the option TCP_SIGNATURE), it can be disabled via the
net.inet.tcp.signature_verify_input sysctl.

Sponsored by:	Sandvine Incorporated
Reviewed by:	emaste, bz
MFC after:	2 weeks
2011-04-25 17:13:40 +00:00
Lawrence Stewart
891b8ed467 Use the full and proper company name for Swinburne University of Technology
throughout the source tree.

Requested by:	Grenville Armitage, Director of CAIA at Swinburne University of
			Technology
MFC after:	3 days
2011-04-12 08:13:18 +00:00
John Baldwin
766282cbe7 Clamp the initial advertised receive window when responding to a SYN/ACK
to the maximum allowed window.  Growing the window too large would cause
an underflow in the calculations in tcp_output() to decide if a window
update should be sent which would prevent the persist timer from being
started if data was pending and the other end of the connection advertised
an initial window size of 0.

PR:		kern/154006
Submitted by:	Stefan `Sec` Zehl  sec 42 org
Reviewed by:	bz
MFC after:	1 week
2011-03-30 12:35:39 +00:00
Lawrence Stewart
d64a46ea1a Reset the last_sack_ack SACK hint for TCP input processing to ensure that the
hint is 0 when no SACK data is received to update the hint with. This was
accidentally omitted from r216753.

Sponsored by:	FreeBSD Foundation
MFC after:	10 weeks
X-MFC with:	216753
2011-01-10 06:12:01 +00:00
John Baldwin
79e955ed63 Trim extra spaces before tabs. 2011-01-07 21:40:34 +00:00
Lawrence Stewart
39bc9de532 - Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
  connections. The hooks only run if at least one hook function is registered
  for the hook point, ensuring the impact on the stack is effectively nil when
  no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
  data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
  modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
  introduced by this commit and r216753.

In collaboration with:	David Hayes <dahayes at swin edu au> and
				Grenville Armitage <garmitage at swin edu au>
Sponsored by:	FreeBSD Foundation
Reviewed by:	bz, others along the way
MFC after:	3 months
2010-12-28 12:13:30 +00:00
Lawrence Stewart
6157935fa5 Set ssthresh appropriately on RTO. This change was accidentally not ported from
the pre modular CC stack.

Sponsored by:	FreeBSD Foundation
Submitted by:	David Hayes <dahayes at swin edu au>
MFC after:	9 weeks
X-MFC with:	r215166
2010-12-02 01:01:37 +00:00
Lawrence Stewart
dbc4240942 This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
  algorithms to be used in the net stack. Algorithms can maintain per-connection
  state if required, and connections maintain their own algorithm pointer, which
  allows different connections to concurrently use different algorithms. The
  TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
  programmatically query or change the congestion control algorithm respectively
  from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
  possible. Care was also taken to develop the framework in a way that should
  allow integration with other congestion aware transport protocols (e.g. SCTP)
  in the future. The hope is that we will one day be able to share a single set
  of congestion control algorithm modules between all congestion aware transport
  protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
  and use it to decouple the meaning of recovery from a congestion event and
  recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
  congestion control protocols don't generally need to recover from packet loss
  and need a different way to note a congestion recovery episode within the
  stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
  and ensures the stack always uses the appropriate mechanisms for recovering
  from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
  massage it into module form. NewReno is always built into the kernel and will
  remain the default algorithm for the forseeable future. Implementations of
  additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
  that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with:	David Hayes <dahayes at swin edu au> and
			Grenville Armitage <garmitage at swin edu au>
Sponsored by:	Cisco URP, FreeBSD Foundation
Reviewed by:	rpaulo
Tested by:	David Hayes (and many others over the years)
MFC after:	3 months
2010-11-12 06:41:55 +00:00
Andre Oppermann
1c18314d17 Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework.  It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI.  It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI.  It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned.  The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.
2010-09-16 21:06:45 +00:00
Andre Oppermann
8502ec25dc Use timestamp modulo comparison macro for automatic receive buffer
scaling to correctly handle wrapping of ticks value.

MFC after:	1 week
2010-08-27 12:34:53 +00:00
Andre Oppermann
b7d747ecec Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug
sysctl's and remove any side effects.

Both sysctl's share the same backend infrastructure and due to the
way it was implemented enabling net.inet.tcp.log_in_vain would also
cause log_debug output to be generated.  This was surprising and
eventually annoying to the user.

The log output backend is kept the same but a little shim is inserted
to properly separate log_in_vain and log_debug and to remove any side
effects.

PR:		kern/137317
MFC after:	1 week
2010-08-18 17:39:47 +00:00
Bjoern A. Zeeb
82cea7e6f3 MFP4: @176978-176982, 176984, 176990-176994, 177441
"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by:	jhb
Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	6 days
2010-04-29 11:52:42 +00:00
Rui Paulo
9c251892c0 Honor the CE bit even when the CWR bit is set.
PR:		145600
Submitted by:	Richard Scheffenegger <rs at netapp.com>
MFC after:	1 week
2010-04-10 12:47:06 +00:00
Robert Watson
66f80e90ef Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

MFC after:	1 week
2010-03-06 21:24:11 +00:00
Robert Watson
f681a5fdd4 Remove tcp_input lock statistics; these are intended for debugging only
and are not intended to ship in 8.0 as they dirty additional cache
lines in a performance-critical per-packet path.

MFC after:	3 days
2009-10-06 20:35:41 +00:00
Robert Watson
883e9bc41d In tcp_input(), we acquire a global write lock at first only if a
segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN).
If we later have to upgrade the lock, we acquire an inpcb reference
and drop both global/inpcb locks before reacquiring in-order.  In
that gap, the connection may transition into TIMEWAIT, so we need
to loop back and reevaluate the inpcb after relocking.

MFC after:	3 days
Reported by:	Kamigishi Rei <spambox at haruhiism.net>
Reviewed by:	bz
2009-10-05 22:24:13 +00:00
Robert Watson
315e3e38fa Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem.  These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored.  This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

      if_bridge
      if_cxgb
      if_gif
      ip_mroute
      ipdivert
      pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack.  However, that change is too agressive for
this point in the release cycle.

Reviewed by:	bz
Approved by:	re (kib)
2009-08-02 19:43:32 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Julian Elischer
7973fba3a4 Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by:	ambrisko
Approved by:	re (kib)
MFC after:	3 days
2009-07-28 19:43:27 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Robert Watson
8c0fec805f Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references.  The following routines now return references:

  ifaddr_byindex
  ifa_ifwithaddr
  ifa_ifwithbroadaddr
  ifa_ifwithdstaddr
  ifa_ifwithnet
  ifaof_ifpforaddr
  ifa_ifwithroute
  ifa_ifwithroute_fib
  rt_getifa
  rt_getifa_fib
  IFP_TO_IA
  ip_rtaddr
  in6_ifawithifp
  in6ifa_ifpforlinklocal
  in6ifa_ifpwithaddr
  in6_ifadd
  carp_iamatch6
  ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

  IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc).  In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit.  Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by:	bz
Obtained from:	Apple, Inc. (portions)
MFC after:	6 weeks (portions)
2009-06-23 20:19:09 +00:00
John Baldwin
6dfb8b316c Fix edge cases with ticks wrapping from INT_MAX to INT_MIN in the handling
of the per-tcpcb t_badtrxtwin.

Submitted by:	bde
2009-06-16 19:00:12 +00:00
John Baldwin
1a0e7cfc42 Trim extra ()'s.
Submitted by:	bde
2009-06-11 14:36:13 +00:00
John Baldwin
0e8cc7e748 Change a few members of tcpcb that store cached copies of ticks to be ints
instead of unsigned longs.  This fixes a few overflow edge cases on 64-bit
platforms.  Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.

Reviewed by:	silby, gnn, lstewart
MFC after:	1 week
2009-06-10 18:27:15 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
f93bfb23dc Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from:	TrustedBSD Project
2009-06-02 18:26:17 +00:00
Zachary Loafman
81ad7eb017 Correct handling of SYN packets that are to the left of the current window of an ESTABLISHED connection.
Reviewed by:        net@, gnn
Approved by:        dfr (mentor)
2009-05-27 17:02:10 +00:00
Robert Watson
78b5071407 Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel.  This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after:	3 days
2009-04-11 22:07:19 +00:00
Kip Macy
80cb9f211a Import "flowid" support for serializing flows across transmit queues
Reviewed by:	rwatson and jeli
2009-04-10 06:16:14 +00:00
Robert Watson
ad71fe3c35 Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz
2009-03-15 09:58:31 +00:00
Lawrence Stewart
24cb0f2232 Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.
The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by:	rpaulo, gnn
Approved by:	gnn, kmacy (mentors)
Sponsored by:	FreeBSD Foundation
2009-01-15 06:44:22 +00:00
Bjoern A. Zeeb
dcdb4371ca Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by:	rwatson during review [1]
Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 12:52:34 +00:00
Robert Watson
d15fb96522 Enhance one comment relating to recent TCP locking changes, and fix a
typo in another.

MFC after:	6 weeks
2008-12-09 15:49:02 +00:00
Robert Watson
252ca42863 Move from solely write-locking the global tcbinfo in tcp_input()
to read-locking in the TCP input path, allowing greater TCP
input parallelism where multiple ithreads or ithread and netisr
are able to run in parallel.  Previously, most TCP input paths
held a write lock on the global tcbinfo lock, effectively
serializing TCP input.

Before looking up the connection, acquire a write lock if a
potentially state-changing flag is set on the TCP segment header
(FIN, RST, SYN), and otherwise a read lock.  We may later have
to upgrade to a write lock in certain cases (ACKs received by the
syncache or during TIMEWAIT) in order to support global state
transitions, but this is never required for steady-state packets.

Upgrading from a write lock to a read lock must be done as a
trylock operation to avoid deadlocks, and actually violates the
lock order as the tcbinfo lock preceeds the inpcb lock held at
the time of upgrade.  If the trylock fails, we bump the refcount
on the inpcb, drop both locks, and re-acquire in-order.  If
another thread has freed the connection while the locks are
dropped, we free the inpcb and repeat the lookup (this should
hardly ever or never happen in practice).

For now, maintain a number of new counters measuring how many
times various cases execute, and in particular whether various
optimistic assumptions about when read locks can be used, whether
upgrades are done using the fast path, and whether connections
close in practice in the above-described race, actually occur.

MFC after:	6 weeks
Discussed with:	kmacy
Reviewed by:	bz, gnn, kmacy
Tested by:	kmacy
2008-12-08 20:27:00 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Marko Zec
97021c2464 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
Marko Zec
44e33a0758 Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
Bjoern A. Zeeb
8e5c87f4b6 Fix typo and while here another one.
Reviewed by:	keramida
Reported by:	keramida
MFC after:	2 months (with r184720)
2008-11-06 16:30:20 +00:00
Bjoern A. Zeeb
91d6cfa6b1 Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by:	kmacy
MFC after:	2 months (along with r182851)
2008-11-06 13:25:59 +00:00
Bjoern A. Zeeb
6f01cac68a Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

In case we return early and got a metricptr to pass the hostcache
info back to the caller we need to initialize the data to a defined
state (zero it) as tcp_hc_get() would do if there was no hit.
Without that the caller would check on random stack garbage which
could lead to undefined results.

This only affected tcp_mss() if there was no routing entry for the peer,
tcp_mtudisc() was not affected.

MFC after:	2 months (along with r182851)
2008-11-06 12:33:33 +00:00
Robert Watson
dd8ac7f990 In both dropwithreset paths in tcp_input.c, drop the tcbinfo lock
sooner to decomplicate locking and eliminate the need for a rather
chatty comment about why we have to handle the global lock in a
special way for the benefit of ipfw and pf cred rules.

MFC after:	3 days
2008-10-26 22:03:52 +00:00
Robert Watson
4c95fd23d6 Remove endearing but syntactically unnecessary "return;" statements
directly before the final closeing brackets of some TCP functions.

MFC after:	3 days
2008-10-26 19:33:22 +00:00
Robert Watson
6c8286e42d Don't pass curthread to sbreserve_locked() in tcp_do_segment(), as the
netisr or ithread's socket buffer size limit is not the right limit to
use.  Instead, pass NULL as the other two calls to sbreserve_locked()
in the TCP input path (tcp_mss()) do.

In practice, this is a no-op, as ithreads and the netisr run without a
process limit on socket buffer use, and a NULL thread pointer leads to
not using the process's limit, if any.  However, if tcp_input() is
called in other contexts that do have limits, this may prevent the
incorrect limit from being used.

MFC after:	3 days
2008-10-07 09:41:07 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
Robert Watson
014ea782b1 As a follow-on to r183323, correct another case where ip_output() was
called without an inpcb pointer despite holding the tcbinfo global
lock, which lead to a deadlock or panic when ipfw tried to further
acquire it recursively.

Reported by:    Stefan Ehmann <shoesoft at gmx dot net>
MFC after:      3 days
2008-09-25 17:26:54 +00:00
Robert Watson
a0ca087183 When dropping a packet and issuing a reset during TCP segment handling,
unconditionally drop the tcbinfo lock (after all, we assert it lines
before), but call tcp_dropwithreset() under both inpcb and inpcbinfo
locks only if we pass in an tcpcb.  Otherwise, if the pointer is NULL,
firewall code may later recurse the global tcbinfo lock trying to look
up an inpcb.

This is an instance where a layering violation leads not only
potentially to code reentrace and recursion, but also to lock
recursion, and was revealed by the conversion to rwlocks because
acquiring a read lock on an rwlock already held with a write lock is
forbidden.  When these locks were mutexes, they simply recursed.

Reported by:	Stefan Ehmann <shoesoft at gmx dot net>
MFC after:	3 days
2008-09-24 11:07:03 +00:00
Bjoern A. Zeeb
c10eb6d10a Work around an integer division resulting in 0 and thus the
congestion window not being incremented, if cwnd > maxseg^2.
As suggested in RFC2581 increment the cwnd by 1 in this case.

See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf
for more details.

Submitted by:	Alana Huebner, Lawrence Stewart,
		Grenville Armitage (caia.swin.edu.au)
Reviewed by:	dwmalone, gnn, rpaulo
MFC After:	3 days
2008-09-09 07:35:21 +00:00
Bjoern A. Zeeb
3cee92e074 Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR:		kern/118455
Reviewed by:	silby (back in March)
MFC after:	2 months
2008-09-07 18:50:25 +00:00
Julian Elischer
ac957cd271 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Rui Paulo
f2512ba12a MFp4 (//depot/projects/tcpecn/):
TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
  TCP ECN is defined in RFC 3168.

Partly reviewed by:	dwmalone, silby
Obtained from:		NetBSD
2008-07-31 15:10:09 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Robert Watson
7a3244ccb7 Add further TCP inpcb locking assertions to some TCP input code paths.
MFC after:	1 month
2008-04-07 12:41:45 +00:00
Bjoern A. Zeeb
c3b02504bc Some "cleanup" of tcp_mss():
- Move the assigment of the socket down before we first need it.
  No need to do it at the beginning and then drop out the function
  by one of the returns before using it 100 lines further down.
- Use t_maxopd which was assigned the "tcp_mssdflt" for the corrrect
  AF already instead of another #ifdef ? : #endif block doing the same.
- Remove an unneeded (duplicate) assignment of mss to t_maxseg just before
  we possibly change mss and re-do the assignment without using t_maxseg
  in between.

Reviewed by:	silby
No objections:	net@ (silence)
MFC after:	5 days
2008-03-02 08:40:47 +00:00
Bjoern A. Zeeb
af92e6cf95 Fix indentation (whitespace changes only).
MFC after:	6 days
2008-03-01 22:27:15 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Mike Silbersack
4b421e2daa Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by:	re (kensmith)
2007-10-07 20:44:24 +00:00
Mike Silbersack
e31d8aa3da Improve the debugging message:
TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received data after socket was closed, sending RST and removing tcpcb

So that it also includes how many bytes of data were received.  It now looks
like this:

TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received X bytes of data after socket was closed, sending RST and removing tcpcb

Approved by:	re (gnn)
2007-10-07 00:07:27 +00:00
Ken Smith
a258946554 Make sure that either inp is NULL or we have obtained a lock on it before
jumping to dropunlock to avoid a panic.  While here move the calls to
ipsec4_in_reject() and ipsec6_in_reject() so they are after we obtain
the lock on inp.

Original patch to avoid panic:	pjd
Review of locking adjustments:	gnn, sam
Approved by:			re (rwatson)
2007-09-10 14:49:32 +00:00
Dag-Erling Smørgrav
218cbbea9a Make tcpstates[] static, and make sure TCPSTATES is defined before
<netinet/tcp_fsm.h> is included into any compilation unit that needs
tcpstates[].  Also remove incorrect extern declarations and TCPDEBUG
conditionals.  This allows kernels both with and without TCPDEBUG to
build, and unbreaks the tinderbox.

Approved by:	re (rwatson)
2007-07-30 11:06:42 +00:00
Matt Jacob
24face5416 Fix compilation problems- tcpstates is only available if TCPDEBUG
is set.

Approved by:	re (in spirit)
2007-07-29 01:31:33 +00:00
Andre Oppermann
773673c133 Provide a sysctl to toggle reporting of TCP debug logging:
sys.net.inet.tcp.log_debug = 1

It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.

It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug.  Enabling of the former
also causes the latter to engage, but not vice versa.

Use consistent terminology in tcp log messages:

 "ignored" means a segment contains invalid flags/information and
   is dropped without changing state or issuing a reply.

 "rejected" means a segments contains invalid flags/information but
   is causing a reply (usually RST) and may cause a state change.

Approved by:	re (rwatson)
2007-07-28 12:20:39 +00:00
Andre Oppermann
19bc77c549 o Move all detailed checks for RST in LISTEN state from tcp_input() to
syncache_rst().
o Fix tests for flag combinations of RST and SYN, ACK, FIN.  Before
  a RST for a connection in syncache did not properly free the entry.
o Add more detailed logging.

Approved by:	re (rwatson)
2007-07-28 11:51:44 +00:00
Mike Silbersack
c325962b47 Export the contents of the syncache to netstat.
Approved by: re (kensmith)
MFC after: 2 weeks
2007-07-27 00:57:06 +00:00
Andre Oppermann
564aab1fe6 Fix comments in tcp_do_segment().
Approved by:	re (kensmith)
2007-07-25 18:48:24 +00:00
Peter Wemm
9fb5d4c064 Fix cast-qualifiers warning when INET6 is not present
Approved by:  re (rwatson)
2007-07-05 05:55:57 +00:00
George V. Neville-Neil
b2630c2934 Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing
2007-07-03 12:13:45 +00:00
George V. Neville-Neil
2cb64cb272 Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by:    bz
Approved by:    re
Supported by:   Secure Computing
2007-07-01 11:41:27 +00:00
Andre Oppermann
f194524fb1 Fix a case in tcp_do_segment() where tcp_update_sack_list() would
be called with an incorrect segment end value.  tcp_reass() may
trim segments when they overlap with already existing ones in the
reassembly queue.  Instead of saving the segment end value before
the call to tcp_reass() compute it on the fly based on the effective
segment length afterwards.

This bug was not really problematic as no information got lost and
the eventual SACK information computation was correct nontheless.

MFC after:	1 week
2007-06-10 21:07:21 +00:00
Andre Oppermann
e8949f7407 Fix style for comments, be more verbose and add some more. 2007-06-10 20:59:22 +00:00
Andre Oppermann
5396d0f8d8 Remove some bogosity from the SYN_SENT case in tcp_do_segment
and simplify handling of the send/receive window scaling.  No
change in effective behavour.

RFC1323 requires the window field in a SYN (i.e., a <SYN> or
<SYN,ACK>) segment itself never be scaled.

Noticed by:	yar
2007-06-09 21:09:49 +00:00
Andre Oppermann
8d573cc158 Make log messages more verbose and simpler to understand for non-experts.
Update comments to be more conscious, verbose and fully reflect reality.
2007-05-28 23:27:44 +00:00
Andre Oppermann
e885b205c6 Fix indentation of the syncache_expand() section in tcp_input(). 2007-05-28 11:35:40 +00:00
Andre Oppermann
a160e6302c Refactor and rewrite in parts the SYN handling code on listen sockets
in tcp_input():

 o tighten the checks on allowed TCP flags to be RFC793 and
   tcp-secure conform
 o log check failures to syslog at LOG_DEBUG level
 o rearrange the code flow to be easier to follow
 o add KASSERTs to validate assumptions of the code flow

Add sysctl net.inet.tcp.syncache.rst_on_sock_fail defaulting to enable
that controls the behavior on socket creation failure for a otherwise
successful 3-way handshake.  The socket creation can fail due to global
memory shortage, listen queue limits and file descriptor limits.  The
sysctl allows to chose between two options to deal with this.  One is
to send a reset to the other endpoint to notify it about the failure
(default).  The other one is to ignore and treat the failure as a
transient error and have the other endpoint retransmit for another try.

Reviewed by:	rwatson (in general)
2007-05-28 11:03:53 +00:00
Andre Oppermann
df541e5fc1 Add tcp_log_addrs() function to generate and standardized TCP log line
for use thoughout the tcp subsystem.

It is IPv4 and IPv6 aware creates a line in the following format:

 "TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>"

A "\n" is not included at the end.  The caller is supposed to add
further information after the standard tcp log header.

The function returns a NUL terminated string which the caller has
to free(s, M_TCPLOG) after use.  All memory allocation is done
with M_NOWAIT and the return value may be NULL in memory shortage
situations.

Either struct in_conninfo || (struct tcphdr && (struct ip || struct
ip6_hdr) have to be supplied.

Due to ip[6].h header inclusion limitations and ordering issues the
struct ip and struct ip6_hdr parameters have to be casted and passed
as void * pointers.

tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
    void *ip6hdr)

Usage example:

 struct ip *ip;
 char *tcplog;

 if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) {
	log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n",
	    tcplog, __func__);
	free(s, M_TCPLOG);
 }
2007-05-18 19:58:37 +00:00
Andre Oppermann
2104448fe7 Move TIME_WAIT related functions and timer handling from files
other than repo copied tcp_subr.c into tcp_timewait.c#1.284:

 tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck()

 tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset()
 tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop()
 tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan()

This is a mechanical move with appropriate renames and making
them static if used only locally.

The tcp_tw_2msl_scan() cleanup function is still run from the
tcp_slowtimo() in tcp_timer.c.
2007-05-16 17:14:25 +00:00
Andre Oppermann
ec9c755352 Complete the (mechanical) move of the TCP reassembly and timewait
functions from their origininal place to their own files.

TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait   from tcp_subr.c  -> tcp_timewait.c
2007-05-13 22:16:13 +00:00
Robert Watson
f2565d68a4 Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.
2007-05-10 15:58:48 +00:00
Maxim Konovalov
d30d90dc80 o Fix style(9) bugs introduced in the last commit.
Pointed out by:	bde
2007-05-09 11:39:46 +00:00
Maxim Konovalov
10fe523e99 o Unbreak "options TCPDEBUG" && "nooptions INET6" kernel build.
PR:		kern/112517
Submitted by:	vd
2007-05-09 06:09:40 +00:00
Andre Oppermann
3529149e9a Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool.  Change all users accordingly.
2007-05-06 15:56:31 +00:00
Andre Oppermann
0ca3f933eb o Remove redundant tcp reassembly check in header prediction code
o Rearrange code to make intent in TCPS_SYN_SENT case more clear
 o Assorted style cleanup
 o Comment clarification for tcp_dropwithreset()
2007-05-06 15:41:06 +00:00
Andre Oppermann
c5ad39b910 Reorder the TCP header prediction test to check for the most volatile
values first to spend less time on a fallback to normal processing.
2007-05-06 15:23:51 +00:00
Andre Oppermann
679d9708b6 Remove the defunct remains of the TCPS_TIME_WAIT cases from tcp_do_segment
and change it to a void function.

We use a compressed structure for TCPS_TIME_WAIT to save memory.  Any late
late segments arriving for such a connection is handled directly in the TW
code.
2007-05-06 15:16:05 +00:00
Robert Watson
1cd6eadfbb Tweak comment at end of tcp_input() when calling into tcp_do_segment(): the
pcbinfo lock will be released as well, not just the pcb lock.
2007-05-04 17:45:52 +00:00
Andre Oppermann
9fa198bead o Fix INP lock leak in the minttl case
o Remove indirection in the decision of unlocking inp
o Further annotation of locking in tcp_input()
2007-04-23 19:41:47 +00:00
Andre Oppermann
df47e4377b o Remove unncessary TOF_SIGLEN flag from struct tcpopt
o Correctly set to->to_signature in tcp_dooptions()
o Update comments
2007-04-20 15:28:01 +00:00
Andre Oppermann
7824d002c0 Add more KASSERT's. 2007-04-20 15:21:29 +00:00
Andre Oppermann
4d6e713043 Remove bogus check for accept queue length and associated failure handling
from the incoming SYN handling section of tcp_input().

Enforcement of the accept queue limits is done by sonewconn() after the
3WHS is completed.  It is not necessary to have an earlier check before a
connection request enters the SYN cache awaiting the full handshake.  It
rather limits the effectiveness of the syncache by preventing legit and
illegit connections from entering it and having them shaken out before we
hit the real limit which may have vanished by then.

Change return value of syncache_add() to void.  No status communication
is required.
2007-04-20 14:34:54 +00:00
Andre Oppermann
e207f80039 Simplifly syncache_expand() and clarify its semantics. Zero is returned
when the ACK is invalid and doesn't belong to any registered connection,
either in syncache or through SYN cookies.  True but a NULL struct socket
is returned when the 3WHS completed but the socket could not be created
due to insufficient resources or limits reached.

For both cases an RST is sent back in tcp_input().

A logic error leading to a panic is fixed where syncache_expand() would
free the mbuf on socket allocation failure but tcp_input() later supplies
it to tcp_dropwithreset() to issue a RST to the peer.

Reported by:	kris (the panic)
2007-04-20 13:51:34 +00:00
Robert Watson
215c8d75b8 Remove unused variable tcbinfo_mtx. 2007-04-15 21:03:23 +00:00
Andre Oppermann
b8152ba793 Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by:	rwatson (earlier version)
2007-04-11 09:45:16 +00:00
Andre Oppermann
995a77176f Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add some
further INP_INFO_WLOCK_ASSERT() while there.
2007-04-04 18:30:16 +00:00
Andre Oppermann
0c38fd0a7a Move last tcpcb initialization for the inbound connection case from
tcp_input() to syncache_socket() where it belongs and the majority
of it already happens.

The "tp->snd_up = tp->snd_una" is removed as it is done with the
tcp_sendseqinit() macro a few lines earlier.
2007-04-04 16:13:45 +00:00
Andre Oppermann
5dd9dfefd6 Retire unused TCP_SACK_DEBUG. 2007-04-04 14:44:15 +00:00
Andre Oppermann
b728e90260 In tcp_dooptions() skip over SACK options if it is a SYN segment. 2007-04-04 14:39:49 +00:00
Andre Oppermann
1929eae1cc When blackholing do a 'dropunlock' in the new world order to prevent the
INP_INFO_LOCK from leaking.

Reported by:	ache
Found by:	rwatson
2007-03-28 12:58:13 +00:00
Maxim Konovalov
14739780bd o Use a define for a buffer size.
Prodded by:	db

o Add missed vars for TCPDEBUG in tcp_do_segment().

Prodded by:	tinderbox
2007-03-24 22:15:02 +00:00
Andre Oppermann
302ce8d690 Split tcp_input() into its two functional parts:
o tcp_input() now handles TCP segment sanity checks and preparations
   including the INPCB lookup and syncache.
 o tcp_do_segment() handles all data and ACK processing and is IPv4/v6
   agnostic.

Change all KASSERT() messages to ("%s: ", __func__).

The changes in this commit are primarily of mechanical nature and no
functional changes besides the function split are made.

Discussed with:	rwatson
2007-03-23 20:16:50 +00:00
Andre Oppermann
4dfdffe9e2 Tidy up some code to conform better to surroundings and style(9), 0 = NULL
and space/tab.
2007-03-23 19:11:22 +00:00
Andre Oppermann
fc30a25199 Bring SACK option handling in tcp_dooptions() in line with all other
options and ajust users accordingly.
2007-03-23 18:33:21 +00:00
Andre Oppermann
ad3f9ab320 ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.
2007-03-21 19:37:55 +00:00
Andre Oppermann
b10fbdeafa Tidy up IPFIREWALL_FORWARD sections and comments. 2007-03-21 18:56:03 +00:00
Andre Oppermann
794235b737 Update and clarify comments in first section of tcp_input(). 2007-03-21 18:52:58 +00:00
Andre Oppermann
db33b3e6a7 Tidy up the ACCEPTCONN section of tcp_input(), ajust comments and remove
old dead T/TCP code.
2007-03-21 18:49:43 +00:00
Andre Oppermann
574b696407 Tidy up tcp_log_in_vain and blackhole. 2007-03-21 18:36:49 +00:00
Andre Oppermann
85c497918c Make TCP_DROP_SYNFIN a standard part of TCP. Disabled by default it
doesn't impede normal operation negatively and is only a few lines of
code.  It's close relatives blackhole and log_in_vain aren't options
either.
2007-03-21 18:25:28 +00:00
Andre Oppermann
e406f5a1c9 Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code.  Should
the problem surface in the future we can simply resurrect it from cvs
history.
2007-03-21 18:05:54 +00:00
Andre Oppermann
6489fe6553 Match up SYSCTL declaration style. 2007-03-19 19:00:51 +00:00
Andre Oppermann
02a1a64357 Consolidate insertion of TCP options into a segment from within tcp_output()
and syncache_respond() into its own generic function tcp_addoptions().

tcp_addoptions() is alignment agnostic and does optimal packing in all cases.

In struct tcpopt rename to_requested_s_scale to just to_wscale.

Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."

Reviewed by:	silby, mohans, julian
Sponsored by:	TCP/IP Optimization Fundraise 2005
2007-03-15 15:59:28 +00:00
Qing Li
95ad8418dc This patch is provided to fix a couple of deployment issues observed
in the field. In one situation, one end of the TCP connection sends
a back-to-back RST packet, with delayed ack, the last_ack_sent variable
has not been update yet. When tcp_insecure_rst is turned off, the code
treats the RST as invalid because last_ack_sent instead of rcv_nxt is
compared against th_seq. Apparently there is some kind of firewall that
sits in between the two ends and that RST packet is the only RST
packet received. With short lived HTTP connections, the symptom is
a large accumulation of connections over a short period of time .

The +/-(1) factor is to take care of implementations out there that
generate RST packets with these types of sequence numbers. This
behavior has also been observed in live environments.

Reviewed by:	silby, Mike Karels
MFC after:	1 week
2007-03-07 23:21:59 +00:00
Mohan Srinivasan
4a32dc299f In the SYN_SENT case, Initialize the snd_wnd before the call to tcp_mss().
The TCP hostcache logic in tcp_mss() depends on the snd_wnd being initialized.
2007-02-28 20:48:00 +00:00
Mohan Srinivasan
7c72af8770 Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.
2007-02-26 22:25:21 +00:00
Robert Watson
afdb42748d Rename two identically named log_in_vain variables: tcp_input.c's static
log_in_vain to tcp_log_in_vain, and udp_usrreq's global log_in_vain to
udp_log_in_vain.

MFC after:	1 week
2007-02-20 10:20:03 +00:00
Andre Oppermann
6741ecf595 Auto sizing TCP socket buffers.
Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.

With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.

FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.

New sysctls are:
  net.inet.tcp.sendbuf_auto=1 (enabled)
  net.inet.tcp.sendbuf_inc=8192 (8K, step size)
  net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
  net.inet.tcp.recvbuf_auto=1 (enabled)
  net.inet.tcp.recvbuf_inc=16384 (16K, step size)
  net.inet.tcp.recvbuf_max=262144 (256K, growth limit)

Tested by:	many (on HEAD and RELENG_6)
Approved by:	re
MFC after:	1 month
2007-02-01 18:32:13 +00:00
Bjoern A. Zeeb
1d54aa3ba9 MFp4: 92972, 98913 + one more change
In ip6_sprintf no longer use and return one of eight static buffers
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.
2006-12-12 12:17:58 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
John-Mark Gurney
e16fa5ca55 fix calculating to_tsecr... This prevents the rtt calculations from
going all wonky...
2006-09-26 01:21:46 +00:00
Bruce M Simpson
f1edc3bde5 Always set the IP version in the TCP input path, to preserve
the header field for possible later IPSEC SPD lookup, even
when the kernel is built without 'options INET6'.

PR:		kern/57760
MFC after:	1 week
Submitted by:	Joachim Schueth
2006-09-23 16:26:31 +00:00
Andre Oppermann
bf6d304ab2 Rewrite of TCP syncookies to remove locking requirements and to enhance
functionality:

 - Remove a rwlock aquisition/release per generated syncookie.  Locking
   is now integrated with the bucket row locking of syncache itself and
   syncookies no longer add any additional lock overhead.
 - Syncookie secrets are different for and stored per syncache buck row.
   Secrets expire after 16 seconds and are reseeded on-demand.
 - The computational overhead for syncookie generation and verification
   is one MD5 hash computation as before.
 - Syncache can be turned off and run with syncookies only by setting the
   sysctl net.inet.tcp.syncookies_only=1.

This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present.  This way we can
keep track of the entire state we need to know to recreate the session in
its original form.  Almost all TCP speakers implement RFC1323 timestamps
these days.  For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies.  The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.

The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever).  Everything we need to know should be available from
the information we encoded in the SYN-ACK.

A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.

Reviewed by:	silby (slightly earlier version)
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-09-13 13:08:27 +00:00
Ruslan Ermilov
751dea2935 Back when we had T/TCP support, we used to apply different
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).

Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).
2006-09-07 13:06:00 +00:00
Andre Oppermann
233dcce118 First step of TSO (TCP segmentation offload) support in our network stack.
o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
 o add CSUM_TSO flag to mbuf pkthdr csum_flags field
 o add tso_segsz field to mbuf pkthdr
 o enhance ip_output() packet length check to allow for large TSO packets
 o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
 o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on:	-current, -net
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-09-06 21:51:59 +00:00
Mohan Srinivasan
464469c713 Fixes an edge case bug in timewait handling where ticks rolling over causing
the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry).
Reviewed by:	silby
2006-08-11 21:15:23 +00:00
Bjoern A. Zeeb
421d8aa603 Use INPLOOKUP_WILDCARD instead of just 1 more consistently.
OKed by: rwatson (some weeks ago)
2006-06-29 10:49:49 +00:00
Andre Oppermann
8bfb19180d Some cleanups and janitorial work to tcp_syncache:
o don't assign remote/local host/port information manually between provided
   struct in_conninfo and struct syncache, bcopy() it instead
 o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture
   the purpose of this field
 o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons
 o fix IPSEC error case printf's to report correct function name
 o in syncache_socket() only transpose enhanced tcp options parameters to
   struct tcpcb when the inpcb doesn't has TF_NOOPT set
 o in syncache_respond() reorder stack variables
 o in syncache_respond() remove bogus KASSERT()

No functional changes.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-06-26 16:14:19 +00:00
Andre Oppermann
f72167f4d1 Some cleanups and janitorial work to tcp_dooptions():
o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its
   usage accordingly
 o update the comments to the tcp_dooptions() invocation in
   tcp_input():after_listen to reflect reality
 o move the logic checking the echoed timestamp out of tcp_dooptions() to the
   only place that uses it next to the invocation described in the previous
   item
 o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others
 o add comments in to struct tcpopt.to_flags #defines

No functional changes.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-06-26 15:35:25 +00:00
David Malone
5e1aa27995 When we receive an out-of-window SYN for an "ESTABLISHED" connection,
ACK the SYN as required by RFC793, rather than ignoring it. NetBSD
have had a similar change since 1999.

PR:		93236
Submitted by:	Grant Edwards <grante@visi.com>
MFC after:	1 month
2006-06-19 12:33:52 +00:00
Andre Oppermann
351630c40d Add locking to TCP syncache and drop the global tcpinfo lock as early
as possible for the syncache_add() case.  The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.

On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second.  When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it.  However this would be a more
involved change from the way syncookies work at the moment.

Reviewed by:	rwatson
Tested by:	rwatson, ps (earlier version)
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-06-17 17:32:38 +00:00
Paul Saab
4f590175b7 Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.
2006-04-21 09:25:40 +00:00
Robert Watson
3cbe7fafa5 Modify tcp_timewait() to accept an inpcb reference, not a tcptw
reference.  For now, we allow the possibility that the in_ppcb
pointer in the inpcb may be NULL if a timewait socket has had its
tcptw structure recycled.  This allows tcp_timewait() to
consistently unlock the inpcb.

Reported by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after:	3 months
2006-04-09 16:59:19 +00:00
Robert Watson
a460ae4b4c Don't unlock a timewait structure if the pointer is NULL in
tcp_timewait().  This corrects a bug (or lack of fixing of a bug)
in tcp_input.c:1.295.

Submitted by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after:	3 months
2006-04-05 08:45:59 +00:00
Robert Watson
ae0e714308 Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL.  We currently do allow this to happen, but may want to remove that
possibility in the future.  This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled.  This will
be cleaned up in the future.

Found by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after:	3 months
2006-04-04 12:26:07 +00:00
Robert Watson
afa39e25c4 Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed.  Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after:	3 months
2006-04-03 13:33:55 +00:00
Robert Watson
623dce13c6 Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
  never NULL, converting dozens of unnecessary NULL checks into
  assertions, and eliminating dozens of unnecessary error handling
  cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
  longer required to ensure so_pcb != NULL.  For example, the receive
  code no longer requires the pcbinfo lock, and the send code only
  requires it if building a new connection on an otherwise unconnected
  socket triggered via sendto() with an address.  This should
  significnatly reduce tcbinfo lock contention in the receive and send
  cases.

- In order to support the invariant that so_pcb != NULL, it is now
  necessary for the TCP code to not discard the tcpcb any time a
  connection is dropped, but instead leave the tcpcb until the socket
  is shutdown.  This case is handled by setting INP_DROPPED, to
  substitute for using a NULL so_pcb to indicate that the connection
  has been dropped.  This requires the inpcb lock, but not the pcbinfo
  lock.

- Unlike all other protocols in the tree, TCP may need to retain access
  to the socket after the file descriptor has been closed.  Set
  SS_PROTOREF in tcp_detach() in order to prevent the socket from being
  freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
  or not it needs to free the socket when the connection finally does
  close.  The typical case where this occurs is if close() is called on
  a TCP socket before all sent data in the send socket buffer has been
  transmitted or acknowledged.  If INP_SOCKREF is found when the
  connection is dropped, we release the inpcb, tcpcb, and socket instead
  of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
  nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
  in which timers are stopped but not drained when the socket is freed,
  as waiting for drain may lead to deadlocks, or have to occur in a
  context where waiting is not permitted.  This race has been handled
  by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
  versa), which is not normally permitted, but may be true of a inpcb
  and tcpcb have been freed.  Add a counter to test how often this race
  has actually occurred, and a large comment for each instance where
  we compare potentially freed memory with NULL.  This will have to be
  fixed in the near future, but requires is to further address how to
  handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
  so no longer need to return a pointer to indicate whether the argument
  passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
  methods for TCP, as it lead to more obscurity, and as locking becomes
  more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
  have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs.  Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after:	3 months
2006-04-01 16:36:36 +00:00
Robert Watson
1c53f80637 Explicitly assert socket pointer is non-NULL in tcp_input() so as to
provide better debugging information.

Prefer explicit comparison to NULL for tcpcb pointers rather than
treating them as booleans.

MFC after:	1 month
2006-03-26 01:33:41 +00:00
Andre Oppermann
464fcfbc5c Rework TCP window scaling (RFC1323) to properly scale the send window
right from the beginning and partly clean up the differences in handling
between SYN_SENT and SYN_RCVD (syncache).

Further changes to this code to come.  This is a first incremental step
to a general overhaul and streamlining of the TCP code.

PR:		kern/15095
PR:		kern/92690 (partly)
Reviewed by:	qingli (and tested with ANVL)
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-02-28 23:05:59 +00:00
Qing Li
4b8e98d632 This patch fixes the problem where the current TCP code can not handle
simultaneous open. Both the bug and the patch were verified using the
ANVL test suite.

PR:		kern/74935
Submitted by:	qingli (before I became committer)
Reviewed by:	andre
MFC after:	5 days
2006-02-23 21:14:34 +00:00
Andre Oppermann
8e8aab7aec Remove unneeded includes and provide more accurate description
to others.

Submitted by:	garys
PR:		kern/86437
2006-02-18 17:05:00 +00:00
Andre Oppermann
eaf80179e2 Have TCP Inflight disable itself if the RTT is below a certain
threshold.  Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage.  It defaults
to 10ms.

Tested by:	Joao Barros <joao.barros-at-gmail.com>,
		Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-02-16 19:38:07 +00:00
Andre Oppermann
0270746230 Do not derefence the ip header pointer in the IPv6 case.
This fixes a bug in the previous commit.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID253
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-01-18 18:59:30 +00:00
George V. Neville-Neil
34f83c52e7 Check the correct TTL in both the IPv6 and IPv4 cases.
Submitted by:	glebius
Reviewed by:	gnn, bz
Found with:     Coverity Prevent(tm)
2006-01-14 16:39:31 +00:00
Andre Oppermann
ef39adf007 Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
  ip_dooptions(struct mbuf *m, int pass)
  save_rte(m, option, dst)
  ip_srcroute(m0)
  ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
  ip_insertoptions(m, opt, phlen)
  ip_optcopy(ip, jp)
  ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with:	rwatson
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 20:12:40 +00:00
Robert Watson
a65e12b09d Convert if (tp->t_state == TCPS_LISTEN) panic() into a KASSERT.
MFC after:	2 weeks
2005-10-19 09:37:52 +00:00
Paul Saab
4d3b134633 Remove a KASSERT in the sack path that fails because of a interaction
between sack and a bug in the "bad retransmit recovery" logic. This is
a workaround, the underlying bug will be fixed later.

Submitted by:   Mohan Srinivasan, Noritoshi Demizu
2005-08-24 02:48:45 +00:00
Andre Oppermann
936cd18dad Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket.  All packets with a
lower TTL are silently dropped.  Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682.  Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after:	2 weeks
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-08-22 16:13:08 +00:00
Paul Saab
d758711729 Fix for a bug in newreno partial ack handling where if a large amount
of data is partial acked, snd_cwnd underflows, causing a burst.

Found, Submitted by:	Noritoshi Demizu
Approved by:		re
2005-07-05 19:23:02 +00:00
Paul Saab
482ac96888 Fix for a bug in the change that defers sack option processing until
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.

Found by:	Andrey Chernov.
Submitted by:	Noritoshi Demizu, Raja Mukerji.
Approved by:	re
2005-07-01 22:54:18 +00:00
Paul Saab
69e0362019 Fix for a SACK crash caused by a bug in tcp_reass(). tcp_reass()
does not clear tlen and frees the mbuf (leaving th pointing at
freed memory), if the data segment is a complete duplicate.
This change works around that bug. A fix for the tcp_reass() bug
will appear later (that bug is benign for now, as neither th nor
tlen is referenced in tcp_input() after the call to tcp_reass()).

Found by:	Pawel Jakub Dawidek.
Submitted by:	Raja Mukerji, Noritoshi Demizu.
Approved by:	re
2005-07-01 22:52:46 +00:00
Simon L. B. Nielsen
0a389eab22 Fix ipfw packet matching errors with address tables.
The ipfw tables lookup code caches the result of the last query.  The
kernel may process multiple packets concurrently, performing several
concurrent table lookups.  Due to an insufficient locking, a cached
result can become corrupted that could cause some addresses to be
incorrectly matched against a lookup table.

Submitted by:	ru
Reviewed by:	csjp, mlaier
Security:	CAN-2005-2019
Security:	FreeBSD-SA-05:13.ipfw

Correct bzip2 permission race condition vulnerability.

Obtained from:	Steve Grubb via RedHat
Security:	CAN-2005-0953
Security:	FreeBSD-SA-05:14.bzip2
Approved by:	obrien

Correct TCP connection stall denial of service vulnerability.

A TCP packets with the SYN flag set is accepted for established
connections, allowing an attacker to overwrite certain TCP options.

Submitted by:	Noritoshi Demizu
Reviewed by:	andre, Mohan Srinivasan
Security:	CAN-2005-2068
Security:	FreeBSD-SA-05:15.tcp

Approved by:	re (security blanket), cperciva
2005-06-29 21:36:49 +00:00