Commit graph

3455 commits

Author SHA1 Message Date
Max Laier
05206588f2 Make in-kernel multicast protocols for pfsync and carp work after enabling
dynamic resizing of multicast membership array.

Reported and testing by:	Maxim Konovalov, Scott Ullrich
Reminded by:			thompsa
MFC after:			2 weeks
2006-07-08 00:01:01 +00:00
Robert Watson
be54a5eeb3 Remove unneeded mac.h include.
MFC after:	3 days
2006-07-06 13:25:01 +00:00
Oleg Bulyzhin
6372145725 Complete timebase (time_second -> time_uptime) conversion.
PR:		kern/94249
Reviewed by:	andre (few months ago)
Approved by:	glebius (mentor)
2006-07-05 23:37:21 +00:00
Maxim Konovalov
764a094c3f o Kill BUGS section as it is not valid since rev. 1.4 alias_pptp.c.
Spotted by:	ru.unix.bsd activists
MFC after:	1 week
2006-07-04 20:39:38 +00:00
Yaroslav Tykhiy
4b97d7affd There is a consensus that ifaddr.ifa_addr should never be NULL,
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures.  Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.

Suggested by:	glebius, jhb, ru
2006-06-29 19:22:05 +00:00
Yaroslav Tykhiy
ad67537233 Use TAILQ_FOREACH consistently. 2006-06-29 17:09:47 +00:00
Gleb Smirnoff
4d09f5a030 Fix URL to Bellovin's paper.
Submitted by:	Anton Yuzhaninov <citrin rambler-co.ru>
2006-06-29 13:38:36 +00:00
Bjoern A. Zeeb
333ad3bc40 Eliminate the offset argument from send_reject. It's not been
used since FreeBSD-SA-06:04.ipfw.
Adopt send_reject6 to what had been done for legacy IP: no longer
send or permit sending rejects for any but the first fragment.

Discussed with: oleg, csjp (some weeks ago)
2006-06-29 11:17:16 +00:00
Bjoern A. Zeeb
421d8aa603 Use INPLOOKUP_WILDCARD instead of just 1 more consistently.
OKed by: rwatson (some weeks ago)
2006-06-29 10:49:49 +00:00
Pawel Jakub Dawidek
835d4b8924 - Use suser_cred(9) instead of directly checking cr_uid.
- Change the order of conditions to first verify that we actually need
  to check for privileges and then eventually check them.

Reviewed by:	rwatson
2006-06-27 11:35:53 +00:00
Andre Oppermann
cc477a6347 In syncache_respond() do not reply with a MSS that is larger than what
the peer announced to us but make it at least tcp_minmss in size.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-06-26 17:54:53 +00:00
Andre Oppermann
8bfb19180d Some cleanups and janitorial work to tcp_syncache:
o don't assign remote/local host/port information manually between provided
   struct in_conninfo and struct syncache, bcopy() it instead
 o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture
   the purpose of this field
 o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons
 o fix IPSEC error case printf's to report correct function name
 o in syncache_socket() only transpose enhanced tcp options parameters to
   struct tcpcb when the inpcb doesn't has TF_NOOPT set
 o in syncache_respond() reorder stack variables
 o in syncache_respond() remove bogus KASSERT()

No functional changes.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-06-26 16:14:19 +00:00
Andre Oppermann
f72167f4d1 Some cleanups and janitorial work to tcp_dooptions():
o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its
   usage accordingly
 o update the comments to the tcp_dooptions() invocation in
   tcp_input():after_listen to reflect reality
 o move the logic checking the echoed timestamp out of tcp_dooptions() to the
   only place that uses it next to the invocation described in the previous
   item
 o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others
 o add comments in to struct tcpopt.to_flags #defines

No functional changes.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-06-26 15:35:25 +00:00
Andre Oppermann
dfabcc1d29 Reverse the source/destination parameters to in[6]_pcblookup_hash() in
syncache_respond() for the #ifdef MAC case.

Submitted by:	Tai-hwa Liang <avatar-at-mmlab.cse.yzu.edu.tw>
2006-06-26 09:43:55 +00:00
Robert Watson
b4470c1639 In tcp6_usr_attach(), return immediately if SS_ISDISCONNECTED, to
avoid dereferencing an uninitialized inp variable.

Submitted by:	Michiel Boland <michiel at boland dot org>
MFC after:	1 month
2006-06-26 09:38:08 +00:00
Andre Oppermann
a846263567 Decrement the global syncache counter in syncache_expand() when the entry
is removed from the bucket.  This fixes the syncache statistics.
2006-06-25 11:11:33 +00:00
Andre Oppermann
649ac0ce5f Move the syncookie MD5 context from globals to the stack to make it MP safe. 2006-06-22 15:07:45 +00:00
Hajimu UMEMOTO
a0a59ae4af - Pullup even when the extention header is unknown, to prevent
infinite loop with net.inet6.ip6.fw.deny_unknown_exthdrs=0.
- Teach ipv6 and ipencap as they appear in an IPv4/IPv6 over IPv6
  tunnel.
- Test the next extention header even when the routing header type
  is unknown with net.inet6.ip6.fw.deny_unknown_exthdrs=0.

Found by:	xcast-fan-club
MFC after:	1 week
2006-06-22 13:22:54 +00:00
Andre Oppermann
c9f7b0ad5b Allocate a zero'ed syncache hashtable. mtx_init() tests the supplied
memory location for already existing/initialized mutexes.  With random
data in the memory location this fails (ie. after a soft reboot).

Reported by:	brueffer, YAMAMOTO Shigeru
Submitted by:	YAMAMOTO Shigeru <shigeru-at-iij.ad.jp>
2006-06-20 08:11:30 +00:00
David Malone
5e1aa27995 When we receive an out-of-window SYN for an "ESTABLISHED" connection,
ACK the SYN as required by RFC793, rather than ignoring it. NetBSD
have had a similar change since 1999.

PR:		93236
Submitted by:	Grant Edwards <grante@visi.com>
MFC after:	1 month
2006-06-19 12:33:52 +00:00
Andre Oppermann
6593a94979 Remove T/TCP RFC1644 Connection Count comparison macros. They are no longer
used and needed.

Sponsored by:   TCP/IP Optimization Fundraise 2005
2006-06-18 14:24:12 +00:00
Andre Oppermann
2f1a4ccfc1 Do not access syncache entry before it was allocated for the TF_NOOPT case
in syncache_add().

Found by:	Coverity Prevent
CID:		1473
2006-06-18 13:03:42 +00:00
Andre Oppermann
8411d000a1 Move all syncache related structures to tcp_syncache.c. They are only used
there.

This unbreaks userland programs that include tcp_var.h.

Discussed with:	rwatson
2006-06-18 12:26:11 +00:00
Andre Oppermann
bdfbf1e203 Remove double lock acquisition in syncookie_lookup() which came from last
minute conversions to macros.

Pointy hat to:	andre
2006-06-18 11:48:03 +00:00
Andre Oppermann
ee2e4c1d4e Fix the !INET6 compile.
Reported by:	alc
2006-06-17 18:42:07 +00:00
Andre Oppermann
93f0d0c5bf Rearrange fields in struct syncache and syncache_head to make them more
cache line friendly.

Sponsored by:   TCP/IP Optimization Fundraise 2005
2006-06-17 17:57:36 +00:00
Andre Oppermann
0c529372f0 ANSIfy and tidy up comments.
Sponsored by:   TCP/IP Optimization Fundraise 2005
2006-06-17 17:49:11 +00:00
Andre Oppermann
351630c40d Add locking to TCP syncache and drop the global tcpinfo lock as early
as possible for the syncache_add() case.  The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.

On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second.  When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it.  However this would be a more
involved change from the way syncookies work at the moment.

Reviewed by:	rwatson
Tested by:	rwatson, ps (earlier version)
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-06-17 17:32:38 +00:00
Oleg Bulyzhin
254c472561 Add support of 'tablearg' feature for:
- 'tag' & 'untag' action parameters.
- 'tagged' & 'limit' rule options.
Rule examples:
	pipe 1 tag tablearg ip from table(1) to any
	allow ip from any to table(2) tagged tablearg
	allow tcp from table(3) to any 25 setup limit src-addr tablearg

sbin/ipfw/ipfw2.c:
1) new macros
   GET_UINT_ARG - support of 'tablearg' keyword, argument range checking.
   PRINT_UINT_ARG - support of 'tablearg' keyword.
2) strtoport(): do not silently truncate/accept invalid port list expressions
   like: '1,2-abc' or '1,2-3-4' or '1,2-3x4'. style(9) cleanup.

Approved by:	glebius (mentor)
MFC after:	1 month
2006-06-15 09:39:22 +00:00
Oleg Bulyzhin
58a0fab73f install_state(): style(9) cleanup
Approved by:	glebius (mentor)
MFC after:	1 month
2006-06-15 08:54:29 +00:00
Andrew Thompson
5feebeeb53 Enable proxy ARP answers on any of the bridged interfaces if proxy record
belongs to another interface within the bridge group.

PR:		kern/94408
Submitted by:	Eygene A. Ryabinkin
MFC after:	1 month
2006-06-09 00:33:30 +00:00
Oleg Bulyzhin
458009ae93 install_state() should properly initialize 'addr_type' field of newly created
flows for O_LIMIT rules.  Otherwise 'ipfw -d show' is unable to display
PARENT rules properly.
(This bug was exposed by ipfw2.c rev.1.90)

Approved by:	glebius (mentor)
MFC after:	2 weeks
2006-06-08 11:27:45 +00:00
Oleg Bulyzhin
d2dc1907e8 Fix following rules: pipe X (tag|altq) Y ...
Approved by:	glebius (mentor)
MFC after:	2 weeks
2006-06-08 11:13:23 +00:00
Robert Watson
f2de87fec4 Push acquisition of pcbinfo lock out of tcp_usr_attach() into
tcp_attach() after the call to soreserve(), as it doesn't require
the global lock.  Rearrange inpcb locking here also.

MFC after:	1 month
2006-06-04 09:31:34 +00:00
Robert Watson
d8ab0ec661 When entering a timer on a tcpcb, don't continue processing if it has been
dropped.  This prevents a bug introduced during the socket/pcb refcounting
work from occuring, in which occasionally the retransmit timer may fire
after a connection has been reset, resulting in the resulting R|A TCP
packet having a source port of 0, as the port reservation has been
released.

While here, fixing up some RUNLOCK->WUNLOCK bugs.

MFC after:	1 month
2006-06-03 19:37:08 +00:00
Robert Watson
f24618aaf0 Acquire udbinfo lock after call to soreserve() rather than before, as it
is not required.  This simplifies error-handling, and reduces the time
that this lock is held.

MFC after:	1 month
2006-06-03 19:29:26 +00:00
Christian S.J. Peron
16d878cc99 Fix the following bpf(4) race condition which can result in a panic:
(1) bpf peer attaches to interface netif0
	(2) Packet is received by netif0
	(3) ifp->if_bpf pointer is checked and handed off to bpf
	(4) bpf peer detaches from netif0 resulting in ifp->if_bpf being
	    initialized to NULL.
	(5) ifp->if_bpf is dereferenced by bpf machinery
	(6) Kaboom

This race condition likely explains the various different kernel panics
reported around sending SIGINT to tcpdump or dhclient processes. But really
this race can result in kernel panics anywhere you have frequent bpf attach
and detach operations with high packet per second load.

Summary of changes:

- Remove the bpf interface's "driverp" member
- When we attach bpf interfaces, we now set the ifp->if_bpf member to the
  bpf interface structure. Once this is done, ifp->if_bpf should never be
  NULL. [1]
- Introduce bpf_peers_present function, an inline operation which will do
  a lockless read bpf peer list associated with the interface. It should
  be noted that the bpf code will pickup the bpf_interface lock before adding
  or removing bpf peers. This should serialize the access to the bpf descriptor
  list, removing the race.
- Expose the bpf_if structure in bpf.h so that the bpf_peers_present function
  can use it. This also removes the struct bpf_if; hack that was there.
- Adjust all consumers of the raw if_bpf structure to use bpf_peers_present

Now what happens is:

	(1) Packet is received by netif0
	(2) Check to see if bpf descriptor list is empty
	(3) Pickup the bpf interface lock
	(4) Hand packet off to process

From the attach/detach side:

	(1) Pickup the bpf interface lock
	(2) Add/remove from bpf descriptor list

Now that we are storing the bpf interface structure with the ifnet, there is
is no need to walk the bpf interface list to locate the correct bpf interface.
We now simply look up the interface, and initialize the pointer. This has a
nice side effect of changing a bpf interface attach operation from O(N) (where
N is the number of bpf interfaces), to O(1).

[1] From now on, we can no longer check ifp->if_bpf to tell us whether or
    not we have any bpf peers that might be interested in receiving packets.

In collaboration with:	sam@
MFC after:	1 month
2006-06-02 19:59:33 +00:00
Robert Watson
ad3a630f7e Minor restyling and cleanup around ipport_tick().
MFC after:	1 month
2006-06-02 08:18:27 +00:00
Oleg Bulyzhin
6a7d5cb645 Implement internal (i.e. inside kernel) packet tagging using mbuf_tags(9).
Since tags are kept while packet resides in kernelspace, it's possible to
use other kernel facilities (like netgraph nodes) for altering those tags.

Submitted by:	Andrey Elsukov <bu7cher at yandex dot ru>
Submitted by:	Vadim Goncharov <vadimnuclight at tpu dot ru>
Approved by:	glebius (mentor)
Idea from:	OpenBSD PF
MFC after:	1 month
2006-05-24 13:09:55 +00:00
Maxim Konovalov
d45e4f9945 o In udp|rip_disconnect() acquire a socket lock before the socket
state modification.  To prevent races do that while holding inpcb
lock.

Reviewed by:	rwatson
2006-05-21 19:28:46 +00:00
Maxim Konovalov
635354c446 o Add missed error check: in ip_ctloutput() sooptcopyin() returns a
result but we never examine it.

Reviewed by:	rwatson
MFC after:	2 weeks
2006-05-21 17:52:08 +00:00
Bruce M Simpson
8d7d85149e Initialize the new members of struct ip_moptions as
a defensive programming measure.

Note that whilst these members are not used by the ip_output()
path, we are passing an instance of struct ip_moptions here
which is declared on the stack (which could be considered a
bad thing).

ip_output() does not consume struct ip_moptions, but in case it
does in future, declare an in_multi vector on the stack too to
behave more like ip_findmoptions() does.
2006-05-18 19:51:08 +00:00
Gleb Smirnoff
e5f88c4492 Since m_pullup() can return a new mbuf, change gre_input2() to
return mbuf back to gre_input(). If the former returns mbuf back
to the latter, then pass it to raw_input().

Coverity ID:	829
2006-05-16 11:15:22 +00:00
Gleb Smirnoff
ffb761f624 - Backout one line from 1.78. The tp can be freed by tcp_drop().
- Style next line.

Coverity ID:	912
2006-05-16 10:51:26 +00:00
Maxim Konovalov
eb16472f74 o In rip_disconnect() do not call rip_abort(), just mark a socket
as not connected.  In soclose() case rip_detach() will kill inpcb for
us later.

It makes rawconnect regression test do not panic a system.

Reviewed by:	rwatson
X-MFC after:	with all 1th April inpcb changes
2006-05-15 09:28:57 +00:00
Max Laier
0e7185f6e7 Use only lower 64bit of src/dest (and src/dest port) for hashing of IPv6
connections and get rid of the flow_id as it is not guaranteed to be stable
some (most?) current implementations seem to just zero it out.

PR:		kern/88664
Reported by:	jylefort
Submitted by:	Joost Bekkers (w/ changes)
Tested by	"regisr" <regisrApoboxDcom>
2006-05-14 23:42:24 +00:00
Bruce M Simpson
3548bfc964 Fix a long-standing limitation in IPv4 multicast group membership.
By making the imo_membership array a dynamically allocated vector,
this minimizes disruption to existing IPv4 multicast code. This
change breaks the ABI for the kernel module ip_mroute.ko, and may
cause a small amount of churn for folks working on the IGMPv3 merge.

Previously, sockets were subject to a compile-time limitation on
the number of IPv4 group memberships, which was hard-coded to 20.
The imo_membership relationship, however, is 1:1 with regards to
a tuple of multicast group address and interface address. Users who
ran routing protocols such as OSPF ran into this limitation on machines
with a large system interface tree.
2006-05-14 14:22:49 +00:00
Max Laier
656faadcb8 Remove ip6fw. Since ipfw has full functional IPv6 support now and - in
contrast to ip6fw - is properly lockes, it is time to retire ip6fw.
2006-05-12 20:39:23 +00:00
Max Laier
e93187482d Reintroduce net.inet6.ip6.fw.enable sysctl to dis/enable the ipv6 processing
seperately.  Also use pfil hook/unhook instead of keeping the check
functions in pfil just to return there based on the sysctl.  While here fix
some whitespace on a nearby SYSCTL_ macro.
2006-05-12 04:41:27 +00:00
Max Laier
432288dcb6 Don't claim "(+ipv6)" if we didn't build with INET6. 2006-05-11 15:22:38 +00:00
Robert Watson
59b8854eee Modify UDP to use sosend_dgram() instead of sosend(). This allows
for signicantly optimized UDP socket I/O when using a single UDP
socket from many threads or processes that share it, by avoiding
significant locking and other overhead in the general sosend()
path that isn't necessary for simple datagram sockets.  Specifically,
this change results in a significant performance improvement for
threaded name service in BIND9 under load.

Suggested by:	Jinmei_Tatsuya at isc dot org
2006-05-06 11:24:59 +00:00
Bjoern A. Zeeb
91b309a1c4 Make sure the ip data pointer is correct before touching it again
after ipsec4_output processing else KAME IPSec using the handbook
configuration with gif(4) will panic the kernel.

Problem reported by:    t. patterson <tp lot.org>
Tested by:              t. patterson <tp lot.org>
2006-05-05 07:31:03 +00:00
Robert Watson
3127286870 Only return (tw) from tcp_twclose() if reuse is passed, otherwise
return NULL.  In principle this shouldn't change the behavior, but
avoids returning a potentially invalid/inappropriate pointer to
the caller.

Found with:	Coverity Prevent (tm)
Submitted by:	pjd
MFC after:	3 months
2006-05-05 06:50:23 +00:00
Pawel Jakub Dawidek
1d7d0bfe5e /tmp/cvsTXPIwQ 2006-05-05 06:24:34 +00:00
Marcel Moolenaar
7c5a8ab212 In in_pcbdrop(), fix !INVARIANTS build. 2006-04-25 23:23:13 +00:00
Robert Watson
8e3f3b169e Rename 'last' to 'inp' in udp_append(): the name 'last' is due to
the fact that the loop through inpcb's in udp_input() tracks the
last inpcb while looping.  We keep that name in the calling loop
but not in the delivery routine itself.

MFC after:	3 months
2006-04-25 17:38:08 +00:00
Robert Watson
10702a2840 Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP,
into in_pcbdrop().  Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.

MFC after:	3 months
2006-04-25 11:17:35 +00:00
Robert Watson
c78cbc7b1d Instead of calling tcp_usr_detach() from tcp_usr_abort(), break out
common pcb tear-down logic into tcp_detach(), which is called from
either.  Invoke tcp_drop() from the tcp_usr_abort() path rather than
tcp_disconnect(), as we want to drop it immediately not perform a
FIN sequence.  This is one reason why some people were experiencing
panics in sodealloc(), as the netisr and aborting thread were
simultaneously trying to tear down the socket.  This bug could often
be reproduced using repeated runs of the listenclose regression test.

MFC after:	3 months
PR:		96090
Reported by:	Peter Kostouros <kpeter at melbpc dot org dot au>, kris
Tested by:	Peter Kostouros <kpeter at melbpc dot org dot au>, kris
2006-04-24 08:20:02 +00:00
Robert Watson
9106a6d6b0 Replace isn_mtx direct use with ISN_*() lock macros so that locking
details/strategy can be changed without touching every use.

MFC after:	3 months
2006-04-23 12:27:42 +00:00
Robert Watson
4c0e8f41f6 Introduce a new TCP mutex, isn_mtx, which protects the initial sequence
number state, rather than re-using pcbinfo.  This introduces some
additional mutex operations during isn query, but avoids hitting the TCP
pcbinfo lock out of yet another frequently firing TCP timer.

MFC after:	3 months
2006-04-22 19:23:24 +00:00
Robert Watson
602cc7f12b Assert the inpcb lock when rehashing an inpcb.
Improve consistency of style around some current assertions.

MFC after:	3 months
2006-04-22 19:15:20 +00:00
Robert Watson
6466b28a40 Remove pcbinfo locking from in_setsockaddr() and in_setpeeraddr();
holding the inpcb lock is sufficient to prevent races in reading
the address and port, as both the inpcb lock and pcbinfo lock are
required to change the address/port.

Improve consistency of spelling in assertions about inp != NULL.

MFC after:	3 months
2006-04-22 19:10:02 +00:00
Paul Saab
4f590175b7 Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.
2006-04-21 09:25:40 +00:00
Gleb Smirnoff
4cbb118526 Merge rev. 1.240 of ip_output.c, so that IPFIREWALL_FORWARD_EXTENDED
kernel option will affect both forwarding methods - classic and fast.
2006-04-18 09:20:16 +00:00
Robert Watson
3cbe7fafa5 Modify tcp_timewait() to accept an inpcb reference, not a tcptw
reference.  For now, we allow the possibility that the in_ppcb
pointer in the inpcb may be NULL if a timewait socket has had its
tcptw structure recycled.  This allows tcp_timewait() to
consistently unlock the inpcb.

Reported by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after:	3 months
2006-04-09 16:59:19 +00:00
Mohan Srinivasan
1714e18e79 Eliminate debug code that catches bugs in the hinting of sack variables
(tcp_sack_output_debug checks cached hints aginst computed values by walking the
scoreboard and reports discrepancies). The sack hinting code has been stable for
many months now so it is time for the debug code to go. Leaving tcp_sack_output_debug
ifdef'ed out in case we need to resurrect it at a later point.
2006-04-06 17:21:16 +00:00
Robert Watson
a460ae4b4c Don't unlock a timewait structure if the pointer is NULL in
tcp_timewait().  This corrects a bug (or lack of fixing of a bug)
in tcp_input.c:1.295.

Submitted by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after:	3 months
2006-04-05 08:45:59 +00:00
Mohan Srinivasan
1f65c2cd31 Certain (bad) values of sack blocks can end up corrupting the sack scoreboard.
Make the checks in tcp_sack_doack() more robust to prevent this.

Submitted by: Raja Mukerji (raja@mukerji.com)
Reviewed by:  Mohan Srinivasan
2006-04-05 00:11:04 +00:00
Gleb Smirnoff
a73b656763 Add a tunable net.inet.tcp.maxtcptw, that allows to set a limit
on tcptw zone independently from setting a limit on socket zone.
2006-04-04 14:31:37 +00:00
Robert Watson
ae0e714308 Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL.  We currently do allow this to happen, but may want to remove that
possibility in the future.  This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled.  This will
be cleaned up in the future.

Found by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after:	3 months
2006-04-04 12:26:07 +00:00
Robert Watson
cb895fb9b0 In TCP notify routines, check inpcb for INP_TIMEWAIT and INP_DROPPED.
The INP_DROPPED check replaces the current NULL checks; the INP_TIMEWAIT
checks appear to have always been required, but not been there, which
is/was a bug.  This avoids unconditionally casting of in_ppcb to a tcpcb,
when it may be a twtcb, which may have resulted in obscure ICMP-related
panics in earlier releases.

MFC after:	3 months
2006-04-03 14:07:50 +00:00
Robert Watson
afa39e25c4 Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed.  Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after:	3 months
2006-04-03 13:33:55 +00:00
Robert Watson
43f56a32a0 Style tweaks: convert to ANSI from K&R function prototypes.
MFC after:	3 months
2006-04-03 12:59:27 +00:00
Robert Watson
2fc5ae87d0 Update comment on tcp_close() for new world order.
MFC after:	3 months
2006-04-03 12:52:13 +00:00
Robert Watson
e6e65783d6 Clarify comment on handling of non-timewait TCP states in
tcp_usr_detach().

MFC after:	3 months
2006-04-03 12:43:56 +00:00
Robert Watson
fa38deac65 Fix up locking surrounding tcp_drop sysctl: in the new world order, we
don't free inpcbs until after the socket is closed, so we always need
to unlock an inpcb after calling tcp_drop() on it.

MFC after:	3 months
2006-04-03 11:57:12 +00:00
Robert Watson
3d2d3ef434 After checking for SO_ISDISCONNECTED in tcp_usr_accept(), return
immediately rather than jumping to the normal output handling, which
assumes we've pulled out the inpcb, which hasn't happened at this
point (and isn't necessary).

Return ECONNABORTED instead of EINVAL when the inpcb has entered
INP_TIMEWAIT or INP_DROPPED, as this is the documented error value.

This may correct the panic seen by Ganbold.

MFC after:	1 month
Reported by:	Ganbold <ganbold at micom dot mng dot net>
2006-04-03 09:52:55 +00:00
Robert Watson
a34f6c1e1d Correct incorrect assertion in div_bind(): inp must not be NULL here.
Reported by:	tegge
MFC after:	3 months
2006-04-03 09:01:17 +00:00
Robert Watson
953b5606df During reformulation of tcp_usr_detach(), the call to initiate TCP
disconnect for fully connected sockets was dropped, meaning that if
the socket was closed while the connection was alive, it would be
leaked.  Structure tcp_usr_detach() so that there are two clear
parts: initiating disconnect, and reclaiming state, and reintroduce
the tcp_disconnect() call in the first part.

MFC after:	3 months
2006-04-02 16:42:51 +00:00
Robert Watson
34af7bae80 Properly handle an edge case previously not handled correctly: a
socket can have a tcp connection that has entered time wait
attached to it, in the event that shutdown() is called on the
socket and the FINs properly exchange before close().  In this
case we don't detach or free the inpcb, just leave the tcptw
detached and freed, but we must release the inpcb lock (which we
didn't previously).

MFC after:	3 months
2006-04-01 23:53:25 +00:00
Robert Watson
623dce13c6 Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
  never NULL, converting dozens of unnecessary NULL checks into
  assertions, and eliminating dozens of unnecessary error handling
  cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
  longer required to ensure so_pcb != NULL.  For example, the receive
  code no longer requires the pcbinfo lock, and the send code only
  requires it if building a new connection on an otherwise unconnected
  socket triggered via sendto() with an address.  This should
  significnatly reduce tcbinfo lock contention in the receive and send
  cases.

- In order to support the invariant that so_pcb != NULL, it is now
  necessary for the TCP code to not discard the tcpcb any time a
  connection is dropped, but instead leave the tcpcb until the socket
  is shutdown.  This case is handled by setting INP_DROPPED, to
  substitute for using a NULL so_pcb to indicate that the connection
  has been dropped.  This requires the inpcb lock, but not the pcbinfo
  lock.

- Unlike all other protocols in the tree, TCP may need to retain access
  to the socket after the file descriptor has been closed.  Set
  SS_PROTOREF in tcp_detach() in order to prevent the socket from being
  freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
  or not it needs to free the socket when the connection finally does
  close.  The typical case where this occurs is if close() is called on
  a TCP socket before all sent data in the send socket buffer has been
  transmitted or acknowledged.  If INP_SOCKREF is found when the
  connection is dropped, we release the inpcb, tcpcb, and socket instead
  of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
  nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
  in which timers are stopped but not drained when the socket is freed,
  as waiting for drain may lead to deadlocks, or have to occur in a
  context where waiting is not permitted.  This race has been handled
  by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
  versa), which is not normally permitted, but may be true of a inpcb
  and tcpcb have been freed.  Add a counter to test how often this race
  has actually occurred, and a large comment for each instance where
  we compare potentially freed memory with NULL.  This will have to be
  fixed in the near future, but requires is to further address how to
  handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
  so no longer need to return a pointer to indicate whether the argument
  passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
  methods for TCP, as it lead to more obscurity, and as locking becomes
  more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
  have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs.  Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after:	3 months
2006-04-01 16:36:36 +00:00
Robert Watson
14ba8add01 Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
  never NULL, converting dozens of unnecessary NULL checks into
  assertions, and eliminating dozens of unnecessary error handling
  cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
  longer required to ensure so_pcb != NULL.  For example, in protocol
  shutdown methods, and in raw IP send.

- Abort and detach protocol switch methods no longer return failures,
  nor attempt to free sockets, as the socket layer does this.

- Invoke in_pcbfree() after in_pcbdetach() in order to free the
  detached in_pcb structure for a socket.

MFC after:	3 months
2006-04-01 16:20:54 +00:00
Robert Watson
4c7c478d0f Break out in_pcbdetach() into two functions:
- in_pcbdetach(), which removes the link between an inpcb and its
  socket.

- in_pcbfree(), which frees a detached pcb.

Unlike the previous in_pcbdetach(), neither of these functions will
attempt to conditionally free the socket, as they are responsible only
for managing in_pcb memory.  Mirror these changes into in6_pcbdetach()
by breaking it into in6_pcbdetach() and in6_pcbfree().

While here, eliminate undesired checks for NULL inpcb pointers in
sockets, as we will now have as an invariant that sockets will always
have valid so_pcb pointers.

MFC after:	3 months
2006-04-01 16:04:42 +00:00
Robert Watson
bc725eafc7 Chance protocol switch method pru_detach() so that it returns void
rather than an error.  Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF.  so_pcb is now entirely owned and
managed by the protocol code.  Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit.  In their current state they may leak
memory or panic.

MFC after:	3 months
2006-04-01 15:42:02 +00:00
Robert Watson
ac45e92ff2 Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful.  Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit.  This will be corrected shortly in followup
commits to these components.

MFC after:      3 months
2006-04-01 15:15:05 +00:00
Robert Watson
6882aa2c03 Define two new inpcb flags in the inp_vflag field, which for whatever
reason, seems to be where new flags are getting defined:

INP_DROPPED - The protocol has terminated this connection and the socket
              is not reusable: when the socket code enters the protocol,
              an error is immediately returned.  This will substitute for
              NULLing the so_pcb socket field, helping to implement the
              invariant that all valid sockets have valid pcb's in TCP.

INP_SOCKREF - The protocol has become the owner of the socket reference,
              and will need to free it when freeing the pcb, which will
              be used when a TCP socket is closed but still has queued
              data.

MFC after:	1 month
2006-03-26 11:30:31 +00:00
Robert Watson
a07b8fd178 Minor style tweak: tab after #define, not space.
MFC after:	1 month
2006-03-26 11:26:12 +00:00
Robert Watson
1c53f80637 Explicitly assert socket pointer is non-NULL in tcp_input() so as to
provide better debugging information.

Prefer explicit comparison to NULL for tcpcb pointers rather than
treating them as booleans.

MFC after:	1 month
2006-03-26 01:33:41 +00:00
Gleb Smirnoff
0fa0801895 o Introduce carp_multicast_cleanup(), which removes and frees
multicast addresses from carp interface. [1]
o Rewrite carpdetach(), so that it does the following things: [1]
  - Stops callouts.
  - Decrements carp_suppress_preempt, if needed.
  - Downs interface and sets CARP state to INIT.
  - Calls carp_multicast_cleanup().
  - Detaches softc from carp_if and if we are the last frees
    the carp_if.
o Use new carpdetach() in carp_clone_destroy().
o In carp_ifdetach() acquire the carp_if lock and cleanup all
  interfaces hanging on carp_if. [1]
o Make carp_ifdetach() static and use EVENT(9) to call it
  from if_detach(). [2]
o In carp_setrun() exit if the softc doesn't have a valid pointer
  to parent. [1]

Obtained from:	OpenBSD [1]
Submitted by:	Dan Lukes <dan obluda.cz> [2]
PR:		kern/82908 [2]
2006-03-21 14:29:48 +00:00
Giorgos Keramidas
f92e18f4fc Add descriptions for the sysctls:
net.inet.icmp.drop_redirect
    net.inet.icmp.log_redirect
    net.inet.icmp.icmplim
    net.inet.icmp.icmplim_output

Approved & text by:	andre
2006-03-20 21:44:12 +00:00
David Malone
fcd1001c63 Make net.inet.ip.portrange.reservedhigh and
net.inet.ip.portrange.reservedlow apply to IPv6 aswell as IPv4.

We could have made new sysctls for IPv6, but that potentially makes
things complicated for mapped addresses. This seems like the least
confusing option and least likely to cause obscure problems in the
future.

This change makes the mac_portacl module useful with IPv6 apps.

Reviewed by:	ume
MFC after:	1 month
2006-03-19 11:48:48 +00:00
Robert Watson
92c07a345e Change soabort() from returning int to returning void, since all
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.
2006-03-16 07:03:14 +00:00
Andrew Thompson
7f2d8767e0 Further refine the bridge hack in the arp code. Only do the special arp
handling for interfaces which are actually in the bridge group, ignore all
others.

MFC after:	3 days
2006-03-07 21:40:44 +00:00
Gleb Smirnoff
e2779391fa - Do not leak read lock in IP_FW_TABLE_GETSIZE case of ipfw_ctl().
- Acquire read (not write) lock in case of IP_FW_TABLE_LIST.

In collaboration with:	ru
2006-03-03 12:10:59 +00:00
Andre Oppermann
464fcfbc5c Rework TCP window scaling (RFC1323) to properly scale the send window
right from the beginning and partly clean up the differences in handling
between SYN_SENT and SYN_RCVD (syncache).

Further changes to this code to come.  This is a first incremental step
to a general overhaul and streamlining of the TCP code.

PR:		kern/15095
PR:		kern/92690 (partly)
Reviewed by:	qingli (and tested with ANVL)
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-02-28 23:05:59 +00:00
Qing Li
4b8e98d632 This patch fixes the problem where the current TCP code can not handle
simultaneous open. Both the bug and the patch were verified using the
ANVL test suite.

PR:		kern/74935
Submitted by:	qingli (before I became committer)
Reviewed by:	andre
MFC after:	5 days
2006-02-23 21:14:34 +00:00
Hajimu UMEMOTO
9bf40914d2 Obey opt_inet6.h in kernel build directory.
Reported by:	Peter Losher <plosher-keyword-freebsd.a36e57__at__plosh.net>
MFC after:	3 days
2006-02-20 12:30:32 +00:00
Andre Oppermann
8e8aab7aec Remove unneeded includes and provide more accurate description
to others.

Submitted by:	garys
PR:		kern/86437
2006-02-18 17:05:00 +00:00
Andre Oppermann
da3482e0f6 Add missing TH_PUSH to the TH_FLAGS enumeration.
Submitted by:	Andre Albsmeier <Andre.Albsmeier-at-siemens.com>
PR:		kern/85203
2006-02-18 16:50:08 +00:00
Andre Oppermann
eaf80179e2 Have TCP Inflight disable itself if the RTT is below a certain
threshold.  Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage.  It defaults
to 10ms.

Tested by:	Joao Barros <joao.barros-at-gmail.com>,
		Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-02-16 19:38:07 +00:00
Andre Oppermann
cf744713e8 In in_pcbconnect_setup() reduce code duplication and use ip_rtaddr()
to find the outgoing interface for this connection.

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	2 weeks
2006-02-16 15:45:28 +00:00
Andre Oppermann
a4684d742d Make sysctl_msec_to_ticks(SYSCTL_HANDLER_ARGS) generally available instead
of being private to tcp_timer.c.

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-02-16 15:40:36 +00:00
Ruslan Ermilov
ea9dce1461 When sending a packet from dummynet, indicate that we're forwarding
it so that ip_id etc. don't get overwritten.  This fixes forwarding
of fragmented IP packets through a dummynet pipe -- fragments came
out with modified and different(!) ip_id's, making it impossible to
reassemble a datagram at the receiver side.

Submitted by:	Alexander Karptsov (reworked by me)
MFC after:	3 days
2006-02-14 06:36:39 +00:00
Qing Li
eee9df08bd Set the M_ZERO flag when calling uma_zalloc() to allocate a syncache entry.
Reviewed by:	andre, glebius
MFC after:	3 days
2006-02-09 21:29:02 +00:00
Qing Li
c1fd993af9 Redo the previous fix by setting the UMA_ZONE_ZINIT bit in the syncache
zone, eliminating the need to call bzero() after each syncache entry
allocation.

Suggested by:	glebius
Reviewed by:	andre
MFC after:	3 days
2006-02-08 23:32:57 +00:00
Qing Li
737b12e98f Fixes a crash due to the memory of the newly allocated syncache entry
in syncache_lookup() is not cleared and may lead to an arbitrary and
bogus rtentry pointer which later gets free'd.

Reviewed by: andre
MFC after: 3 days
2006-02-07 19:59:46 +00:00
Oleg Bulyzhin
6edb555dbc Fix five years old bug in ip_reass(): if we are using 'full' (i.e. including
pseudo header) hardware rx checksum offloading ip_reass() fails to calculate
TCP/UDP checksum for reassembled packet correctly.  This also should fix
recent 'NFS over UDP over bge' issue exposed by if_bge.c rev. 1.123

Reviewed by:	sam (earlier version), bde
Approved by:	glebius (mentor)
MFC after:	2 weeks
2006-02-07 11:48:10 +00:00
Hajimu UMEMOTO
d5e8a67ee9 Never select the PCB that has INP_IPV6 flag and is bound to :: if
we have another PCB which is bound to 0.0.0.0.  If a PCB has the
INP_IPV6 flag, then we set its cost higher than IPv4 only PCBs.

Submitted by:	Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from:	KAME
MFC after:	1 week
2006-02-04 07:59:17 +00:00
Gleb Smirnoff
a7908db153 Dropping the lock in the transmit_event() is not safe, because we
store some pipe pointers on stack. If user reconfigures dummynet
in the interlock gap, we can work with freed pipes after relock.

To fix this, we decided not to send packets in transmit_event(),
but fill a queue. At the end of dummynet() and dummynet_io(),
after the lock is dropped, if there is something in the queue
we run dummynet_send() to process the queue.

In collaboration with:	ru
2006-02-03 11:38:19 +00:00
Gleb Smirnoff
ce62866023 Axe unused function. 2006-02-03 10:42:28 +00:00
Christian S.J. Peron
f5cdbcf14c Use PFIL_HOOKED macros in if_bridge and pass the right argument to
rw_assert. This un-breaks the build.

Submitted by:	Kostik Belousov
Pointy hat to:	csjp
2006-02-02 16:41:20 +00:00
Christian S.J. Peron
604afec496 Somewhat re-factor the read/write locking mechanism associated with the packet
filtering mechanisms to use the new rwlock(9) locking API:

- Drop the variables stored in the phil_head structure which were specific to
  conditions and the home rolled read/write locking mechanism.
- Drop some includes which were used for condition variables
- Drop the inline functions, and convert them to macros. Also, move these
  macros into pfil.h
- Move pfil list locking macros intp phil.h as well
- Rename ph_busy_count to ph_nhooks. This variable will represent the number
  of IN/OUT hooks registered with the pfil head structure
- Define PFIL_HOOKED macro which evaluates to true if there are any
  hooks to be ran by pfil_run_hooks
- In the IP/IP6 stacks, change the ph_busy_count comparison to use the new
  PFIL_HOOKED macro.
- Drop optimization in pfil_run_hooks which checks to see if there are any
  hooks to be ran, and returns if not. This check is already performed by the
  IP stacks when they call:

        if (!PFIL_HOOKED(ph))
                goto skip_hooks;

- Drop in assertion which makes sure that the number of hooks never drops
  below 0 for good measure. This in theory should never happen, and if it
  does than there are problems somewhere
- Drop special logic around PFIL_WAITOK because rw_wlock(9) does not sleep
- Drop variables which support home rolled read/write locking mechanism from
  the IPFW firewall chain structure.
- Swap out the read/write firewall chain lock internal to use the rwlock(9)
  API instead of our home rolled version
- Convert the inlined functions to macros

Reviewed by:	mlaier, andre, glebius
Thanks to:	jhb for the new locking API
2006-02-02 03:13:16 +00:00
Andre Oppermann
1dfcf0d2a3 Move the IPSEC related code blocks to their own file to unclutter
and signifincantly improve the readability of ip_input() and
ip_output() again.

The resulting IPSEC hooks in ip_input() and ip_output() may be
used later on for making IPSEC loadable.

This move is mostly mechanical and should preserve current IPSEC
behaviour as-is.  Nothing shall prevent improvements in the way
IPSEC interacts with the IPv4 stack.

Discussed with:	bz, gnn, rwatson; (earlier version)
2006-02-01 13:55:03 +00:00
Ruslan Ermilov
e46c3da737 Brain-o (use standard int types now). 2006-02-01 06:15:37 +00:00
Ruslan Ermilov
bc7eeed4c9 Fix multicast routing on 64-bit platforms.
Tested on:	amd64
MFC after:	3 days
2006-01-31 22:39:35 +00:00
Andrew Thompson
235073f4c0 Now that the bridge also processes Ethernet frames as itself, two arp replies
will be sent if there is an address on the bridge. Exclude the bridge from the
special arp handling.

This has been tested with all combinations of addresses on the bridge and members.

Pointed out by:	Michal Mertl
2006-01-31 21:29:41 +00:00
Gleb Smirnoff
25af0bb50e Add some initial locking to gif(4). It doesn't covers the whole driver,
however IPv4-in-IPv4 tunnels are now stable on SMP. Details:

- Add per-softc mutex.
- Hold the mutex on output.

The main problem was the rtentry, placed in softc. It could be
freed by ip_output(). Meanwhile, another thread being in
in_gif_output() can read and write this rtentry.

Reported by:	many
Tested by:	Alexander Shiryaev <aixp mail.ru>
2006-01-30 08:39:09 +00:00
Andrew Thompson
74948aa6f3 Back out of r1.148, it causes two arp replies to be sent with different mac
addresses. One for the bridged interface with the IP address assigned but then
another with the mac for the bridge itself.
2006-01-29 23:21:01 +00:00
Andre Oppermann
ab48768b20 When doing IP forwarding with [FAST_]IPSEC compiled into the kernel
ip_forward() would report back a zero MTU in ICMP needfrag messages
because on a IPSEC SP lookup failure no MTU got computed.

Fix this by changing the logic to compute a new MTU in any case if
IPSEC didn't do it.

Change MTU computation logic to use egress interface MTU if available
or the next smaller MTU compared to the current packet size instead
of falling back to a very small fixed MTU.

Fix associated comment.

PR:		kern/91412
MFC after:	3 days
2006-01-24 17:57:19 +00:00
Andre Oppermann
1dec73a153 In ip_mdq() compute the TV_DELTA the correct way around.
PR:		kern/91851
Submitted by:	SAKAI Hiroaki <sakai.hiroaki-at-jp.fujitsu.com>
MFC after:	3 days
2006-01-24 17:09:12 +00:00
Andre Oppermann
31343a3da2 In in_control() remove the temporary in_ifaddr structure from the
ia_hash only if it actually is an AF_INET address.  All other places
test for sa_family == AF_INET but this one.

PR:		kern/92091
Submitted by:	Seth Kingsley <sethk-at-meowfishies.com>
MFC after:	3 days
2006-01-24 16:19:31 +00:00
Oleg Bulyzhin
44a515834f Fix minor bug in uRPF:
If net.link.ether.inet.useloopback=1 and we send broadcast packet using our
  own source ip address it may be rejected by uRPF rules.

  Same bug was fixed for IPv6 in rev. 1.115 by suz.

PR:		kern/76971
Approved by:	glebius (mentor)
MFC after:	3 days
2006-01-24 13:38:06 +00:00
Gleb Smirnoff
0b4ae859ac Implement 'ipfw fwd laddr,port' feature for UDP. According to ipfw(8)
it should work, however it never did. People expect it to work.

PR:		kern/90834
2006-01-24 09:08:54 +00:00
Gleb Smirnoff
1c0b0f523d Fix build. 2006-01-23 20:10:49 +00:00
Andre Oppermann
06003a1e7c Simplify ip_next_mtu() and make its logic more easy to see while
silencing code analysis tools.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID341
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-01-23 17:06:32 +00:00
Robert Watson
136d4f1cf2 Convert remaining functions to ANSI C function declarations; remove
'register' where present.

MFC after:	1 week
2006-01-22 01:16:25 +00:00
Robert Watson
d0c75d36b9 Convert last remaining function in ip_gre.c to ANSI C function
declaration.

MFC after:	1 week
2006-01-22 01:08:30 +00:00
Bjoern A. Zeeb
3f2e28fe9f Fix stack corruptions on amd64.
Vararg functions have a different calling convention than regular
functions on amd64. Casting a varag function to a regular one to
match the function pointer declaration will hide the varargs from
the caller and we will end up with an incorrectly setup stack.

Entirely remove the varargs from these functions and change the
functions to match the declaration of the function pointers.
Remove the now unnecessary casts.

Lots of explanations and help from:     peter
Reviewed by:                            peter
PR:                                     amd64/89261
MFC after:                              6 days
2006-01-21 10:44:34 +00:00
Christian S.J. Peron
9c57c204be - Change the return type for init_tables from void to int so we can propagate
errors from rn_inithead back to the ipfw initialization function.
- Check return value of rn_inithead for failure, if table allocation has
  failed for any reason, free up any tables we have created and return ENOMEM
- In ipfw_init check the return value of init_tables and free up any mutexes or
  UMA zones which may have been created.
- Assert that the supplied table is not NULL before attempting to dereference.

This fixes panics which were a result of invalid memory accesses due to failed
table allocation. This is an issue mainly because the R_Zalloc function is a
malloc(M_NOWAIT) wrapper, thus making it possible for allocations to fail.

Found by:	Coverity Prevent (tm)
Coverity ID:	CID79
MFC after:	1 week
2006-01-20 05:35:27 +00:00
Christian S.J. Peron
e9186cb94b Destroy the dynamic rule zone in the event that we fail to insert the
initial default rule.

MFC after:	1 week
2006-01-20 03:21:25 +00:00
Andre Oppermann
0270746230 Do not derefence the ip header pointer in the IPv6 case.
This fixes a bug in the previous commit.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID253
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-01-18 18:59:30 +00:00
Andre Oppermann
8f8d29f686 In in_delayed_cksum() we can't perform a m_pullup() as it may
change the mbuf pointer and we don't have any way of passing
it back to the callers.  Instead just fail silently without
updating the checksum but leaving the mbuf+chain intact.

A search in our GNATS database did not turn up any match for
the existing warning message when this case is encountered.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID779
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-01-18 18:49:16 +00:00
Andre Oppermann
79eb490467 In syncache_expand() insert a proper syncache_free() to fix a case
that currently can't be triggered.  But better be safe than sorry
later on.  Additionally it properly silences Coverity Prevent for
future tests.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID802
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-01-18 18:25:03 +00:00
Andre Oppermann
39550088cf Prevent dereferencing a NULL route pointer when trying to update the
route MTU.

This bug is very difficult to reach and not remotely exploitable.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID162
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-01-18 15:05:05 +00:00
Andre Oppermann
5d691e6da8 Return mbuf pointer or NULL from ip_fastforward() as the mbuf pointer
may have changed by m_pullup() during fastforward processing.

While this is a bug it is actually never triggered in real world
situations and it is not remotely exploitable.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID780
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-01-18 14:24:39 +00:00
Robert Watson
d248c7d7f5 Modify the IP fragment reassembly code so that it uses a new UMA zone,
ipq_zone, to allocate fragment headers from, rather than using cast mbuf
storage.  This was one of the few remaining uses of mbuf storage for
local data structures that relied on dtom().  Implement the resource
limit on ipq's using UMA zone limits, but preserve current sysctl
semantics using a sysctl proc.

MFC after:	3 weeks
2006-01-15 18:58:21 +00:00
Robert Watson
dfa60d9354 Staticize ipqlock, since it is local to ip_input.c.
MFC after:	3 days
2006-01-15 17:05:48 +00:00
George V. Neville-Neil
34f83c52e7 Check the correct TTL in both the IPv6 and IPv4 cases.
Submitted by:	glebius
Reviewed by:	gnn, bz
Found with:     Coverity Prevent(tm)
2006-01-14 16:39:31 +00:00
Gleb Smirnoff
ecedca7441 UMA can return NULL not only in case when our zone is full, but
also in case of generic memory shortage. In the latter case we may
not find an old entry.

Found with:	Coverity Prevent(tm)
2006-01-14 13:04:08 +00:00
Robert Watson
e5bc0aa3c3 Remove dead code: 'opts' is not used in udp_append(), only in udp_input(),
so no need to assign it to NULL or conditionally free it.

Found with:	Coverity Prevent(tm)
MFC after:	3 days
2006-01-14 11:18:32 +00:00
Andrew Thompson
54c427e0e2 Include the bridge interface itself in the special arp handling.
PR:		90973
MFC after:	1 week
2006-01-12 21:05:30 +00:00
Colin Percival
9ed97bee65 Correct insecure temporary file usage in texindex. [06:01]
Correct insecure temporary file usage in ee. [06:02]
Correct a race condition when setting file permissions, sanitize file
names by default, and fix a buffer overflow when handling files
larger than 4GB in cpio. [06:03]
Fix an error in the handling of IP fragments in ipfw which can cause
a kernel panic. [06:04]

Security:	FreeBSD-SA-06:01.texindex
Security:	FreeBSD-SA-06:02.ee
Security:	FreeBSD-SA-06:03.cpio
Security:	FreeBSD-SA-06:04.ipfw
2006-01-11 08:02:16 +00:00
Andrew Thompson
73ff045c57 Add RFC 3378 EtherIP support. This change makes it possible to add gif
interfaces to bridges, which will then send and receive IP protocol 97 packets.
Packets are Ethernet frames with an EtherIP header prepended.

Obtained from:	NetBSD
MFC after:	2 weeks
2005-12-21 21:29:45 +00:00
Xin LI
92e0a4a2a4 Use consistent indent character as other IPPROTO_* lines did. 2005-12-20 09:38:03 +00:00
George V. Neville-Neil
496f9fc522 Add protocol number for SCTP.
Submitted by:	Randall Stewart rrs at cisco.com
MFC after:	1 week
2005-12-20 09:24:04 +00:00
Gleb Smirnoff
3939390679 Add a knob to suppress logging of attempts to modify
permanent ARP entries.

Submitted by:	Andrew Alcheyev <buddy telenet.ru>
2005-12-18 19:11:56 +00:00
Ed Maste
bd2b686fe8 Add descriptions for sysctl -d.
Approved by:	glebius
Silence from:	rwatson (mentor)
2005-12-16 15:01:44 +00:00
Gleb Smirnoff
6e02dbdfa3 Cleanup __FreeBSD_version. 2005-12-16 13:10:32 +00:00
John Baldwin
636a309adb Use %t (ptrdiff_t modifier) to print a couple of pointer differences rather
than casting them to int.
2005-12-15 21:57:32 +00:00
Maxime Henrion
e59898ff36 Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by:	Thomas Hurst <tom@hur.st>
MFC after:	3 days
2005-12-14 22:27:48 +00:00
Gleb Smirnoff
40b1ae9e00 Add a new feature for optimizining ipfw rulesets - substitution of the
action argument with the value obtained from table lookup. The feature
is now applicable only to "pipe", "queue", "divert", "tee", "netgraph"
and "ngtee" rules.

An example usage:

  ipfw pipe 1000 config bw 1000Kbyte/s
  ipfw pipe 4000 config bw 4000Kbyte/s
  ipfw table 1 add x.x.x.x 1000
  ipfw table 1 add x.x.x.y 4000
  ipfw pipe tablearg ip from table(1) to any

In the example above the rule will throw different packets to different pipes.

TODO:
  - Support "skipto" action, but without searching all rules.
  - Improve parser, so that it warns about bad rules. These are:
    - "tablearg" argument to action, but no "table" in the rule. All
      traffic will be blocked.
    - "tablearg" argument to action, but "table" searches for entry with
      a specific value. All traffic will be blocked.
    - "tablearg" argument to action, and two "table" looks - for src and
      for dst. The last lookup will match.
2005-12-13 12:16:03 +00:00
Gleb Smirnoff
bbce982bd5 When we drop packet due to no space in output interface output queue, also
increase the ifp->if_snd.ifq_drops.

PR:		72440
Submitted by:	ikob
2005-12-06 11:16:11 +00:00
Gleb Smirnoff
95d1f36f82 Optimize parallel processing of ipfw(4) rulesets eliminating the locking
of the radix lookup tables. Since several rnh_lookup() can run in
parallel on the same table, we can piggyback on the shared locking
provided by ipfw(4).
  However, the single entry cache in the ip_fw_table can't be used lockless,
so it is removed. This pessimizes two cases: processing of bursts of similar
packets and matching one packet against the same table several times during
one ipfw_chk() lookup. To optimize the processing of similar packet bursts
administrator should use stateful firewall. To optimize the second problem
a solution will be provided soon.

Details:
  o Since we piggyback on the ipfw(4) locking, and the latter is per-chain,
    the tables are moved from the global declaration to the
    struct ip_fw_chain.
  o The struct ip_fw_table is shrunk to one entry and thus vanished.
  o All table manipulating functions are extended to accept the struct
    ip_fw_chain * argument.
  o All table modifing functions use IPFW_WLOCK_ASSERT().
2005-12-06 10:45:49 +00:00
Ruslan Ermilov
f4e9888107 Fix -Wundef. 2005-12-04 02:12:43 +00:00
Hajimu UMEMOTO
8846bbf3ce obey opt_inet6.h and opt_ipsec.h in kernel build directory.
Requested by:	hrs
2005-11-29 17:56:11 +00:00
Gleb Smirnoff
b090e4ce1f Garbage-collect now unused struct _ipfw_insn_pipe and flush_pipe_ptrs(),
thus removing a few XXXes.
  Document the ABI breakage in UPDATING.
2005-11-29 08:59:41 +00:00
Gleb Smirnoff
99b41b34fb First step in removing welding between ipfw(4) and dummynet.
o Do not use ipfw_insn_pipe->pipe_ptr in locate_flowset(). The
  _ipfw_insn_pipe isn't touched by this commit to preserve ABI
  compatibility.
o To optimize the lookup of the pipe/flowset in locate_flowset()
  introduce hashes for pipes and queues:
  - To preserve ABI compatibility utilize the place of global list
    pointer for SLIST_ENTRY.
  - Introduce locate_flowset(queue nr) and locate_pipe(pipe nr).
o Rework all the dummynet code to deal with the hashes, not global
  lists. Also did some style(9) changes in the code blocks that were
  touched by this sweep:
  - Be conservative about flowset and pipe variable names on stack,
    use "fs" and "pipe" everywhere.
  - Cleanup whitespaces.
  - Sort variables.
  - Give variables more meaningful names.
  - Uppercase and dots in comments.
  - ENOMEM when malloc(9) failed.
2005-11-29 00:11:01 +00:00
Ruslan Ermilov
fc1eaecf4a Fix prototype. 2005-11-24 14:17:35 +00:00
Paul Saab
d0a14f55c3 Fix for a bug that causes SACK scoreboard corruption when the limit
on holes per connection is reached.

Reported by:	Patrik Roos
Submitted by:	Mohan Srinivasan
Reviewed by:	Raja Mukerji, Noritoshi Demizu
2005-11-21 19:22:10 +00:00
Andre Oppermann
22f2c8b5db Remove 'ipprintfs' which were protected under DIAGNOSTIC. It doesn't
have any know to enable it from userland and could only be enabled by
either setting it to 1 at compile time or through the kernel debugger.

In the future it may be brought back as KTR tracing points.

Discussed with:	rwatson
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-19 17:04:52 +00:00
Andre Oppermann
c444cdded2 Move MAX_IPOPTLEN and struct ipoption back into ip_var.h as
userland programs depend on it.

Pointed out by:	le
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-19 14:01:32 +00:00
Andre Oppermann
ef39adf007 Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
  ip_dooptions(struct mbuf *m, int pass)
  save_rte(m, option, dst)
  ip_srcroute(m0)
  ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
  ip_insertoptions(m, opt, phlen)
  ip_optcopy(ip, jp)
  ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with:	rwatson
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 20:12:40 +00:00
Andre Oppermann
147f74d176 Purge layer specific mbuf flags on layer crossings to avoid confusing
upper or lower layers.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 16:23:26 +00:00
Andre Oppermann
e86ebebc52 Rework icmp_error() to deal with truncated IP packets from
ip_forward() when doing extended quoting in error messages.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 14:48:42 +00:00
Andre Oppermann
780b2f698c In ip_forward() copy as much into the temporary error mbuf as we
have free space in it.  Allocate correct mbuf from the beginning.
This allows icmp_error() to quote the entire TCP header in error
messages.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 14:44:48 +00:00
Gleb Smirnoff
218837618a MFOpenBSD 1.62:
Prevent backup CARP hosts from replying to arp requests, fixes strangeness
  with some layer-3 switches. From Bill Marquette.

Tested by:	Kazuaki Oda <kaakun highway.ne.jp>
2005-11-17 12:56:40 +00:00
Ruslan Ermilov
433aaf04cb Unbreak for !INET6 case. 2005-11-14 12:50:23 +00:00
Ruslan Ermilov
4a0d6638b3 - Store pointer to the link-level address right in "struct ifnet"
rather than in ifindex_table[]; all (except one) accesses are
  through ifp anyway.  IF_LLADDR() works faster, and all (except
  one) ifaddr_byindex() users were converted to use ifp->if_addr.

- Stop storing a (pointer to) Ethernet address in "struct arpcom",
  and drop the IFP2ENADDR() macro; all users have been converted
  to use IF_LLADDR() instead.
2005-11-11 16:04:59 +00:00
SUZUKI Shinsuke
d9a989231e fixed a bug that uRPF does not work properly for an IPv6 packet bound for the sending machine itself (this is a bug introduced due to a change in ip6_input.c:Rev.1.83)
Pointed out by: Sean McNeil and J.R.Oldroyd
MFC after: 3 days
2005-11-10 22:10:39 +00:00
Ruslan Ermilov
303989a2f3 Use sparse initializers for "struct domain" and "struct protosw",
so they are easier to follow for the human being.
2005-11-09 13:29:16 +00:00
Andrew Thompson
4e7e0183e1 Move the cloned interface list management in to if_clone. For some drivers the
softc lists and associated mutex are now unused so these have been removed.

Calling if_clone_detach() will now destroy all the cloned interfaces for the
driver and in most cases is all thats needed to unload.

Idea by:	brooks
Reviewed by:	brooks
2005-11-08 20:08:34 +00:00
Gleb Smirnoff
e1ff74c58d Rework ARP retransmission algorythm so that ARP requests are
retransmitted without suppression, while there is demand for
such ARP entry. As before, retransmission is rate limited to
one packet per second. Details:
  - Remove net.link.ether.inet.host_down_time
  - Do not set/clear RTF_REJECT flag on route, to
    avoid rt_check() returning error. We will generate error
    ourselves.
  - Return EWOULDBLOCK on first arp_maxtries failed
    requests , and return EHOSTDOWN/EHOSTUNREACH
    on further requests.
  - Retransmit ARP request always, independently from return
    code. Ratelimit to 1 pps.
2005-11-08 12:05:57 +00:00
Andre Oppermann
34333b16cd Retire MT_HEADER mbuf type and change its users to use MT_DATA.
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it.  It only adds a layer of confusion.  The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit.  For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-02 13:46:32 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Robert Watson
d374e81efd Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN.  This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit.  This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface.  This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
2005-10-30 19:44:40 +00:00
Gleb Smirnoff
f3d30eb20d First fill in structure with valid values, and only then attach it
to the global list.

Reviewed by:	rwatson
2005-10-28 20:29:42 +00:00
Yaroslav Tykhiy
9f4abef9a3 Since carp(4) interfaces presently are kinda fake yet possess
IP addresses, mark them with LOOPBACK so that routing daemons
take them easy for link-state routing protocols.

Reviewed by:	glebius
2005-10-26 05:57:35 +00:00
Max Laier
1e4b360655 Fix build after in6_joingroup change. It remains unclear if DAD breaks CARP
or not.
2005-10-22 14:54:02 +00:00
Gleb Smirnoff
bfb26eecfb In in_addprefix() compare not only route addresses, but their masks,
too. This fixes problem when connected prefixes overlap.

Obtained from:	OpenBSD (rev. 1.40 by claudio);
		[ I came to this fix myself, and then found out that
		  OpenBSD had already fixed it the same way.]
2005-10-22 14:50:27 +00:00
SUZUKI Shinsuke
743eee666f sync with KAME regarding NDP
- introduced fine-grain-timer to manage ND-caches and IPv6 Multicast-Listeners
- supports Router-Preference <draft-ietf-ipv6-router-selection-07.txt>
- better prefix lifetime management
- more spec-comformant DAD advertisement
- updated RFC/internet-draft revisions

Obtained from: KAME
Reviewed by: ume, gnn
MFC after: 2 month
2005-10-21 16:23:01 +00:00
Robert Watson
a65e12b09d Convert if (tp->t_state == TCPS_LISTEN) panic() into a KASSERT.
MFC after:	2 weeks
2005-10-19 09:37:52 +00:00
Andrew Thompson
febd0759f3 Change the reference counting to count the number of cloned interfaces for each
cloner. This ensures that ifc->ifc_units is not prematurely freed in
if_clone_detach() before the clones are destroyed, resulting in memory modified
after free. This could be triggered with if_vlan.

Assert that all cloners have been destroyed when freeing the memory.

Change all simple cloners to destroy their clones with ifc_simple_destroy() on
module unload so the reference count is properly updated. This also cleans up
the interface destroy routines and allows future optimisation.

Discussed with:	brooks, pjd, -current
Reviewed by:	brooks
2005-10-12 19:52:16 +00:00
Maxim Konovalov
d46ff6bd1e o INP_ONESBCAST is inpcb.inp_vflag flag not inp_flags. The confusion
with IP_PORTRANGE_HIGH leads to the incorrect checksum calculation.

PR:		kern/87306
Submitted by:	Rickard Lind
Reviewed by:	bms
MFC after:	2 weeks
2005-10-12 18:13:25 +00:00
Philip Paeps
7691747aac Unbreak the net.inet6.tcp6.getcred sysctl.
This makes inetd/auth work again in IPv6 setups.

Pointy hat to:	ume/KAME
2005-10-12 09:24:18 +00:00
Andrew Thompson
f69453ca8b When bridging is enabled and an ARP request is recieved on a member interface,
the arp code will search all local interfaces for a match. This triggers a
kernel log if the bridge has been assigned an address.

arp: ac🇩🇪48:18:83:3d is using my IP address 192.168.0.142!

bridge0: flags=8041<UP,RUNNING,MULTICAST> mtu 1500
        inet 192.168.0.142 netmask 0xffffff00
        ether ac🇩🇪48:18:83:3d

Silence this warning for 6.0 to stop unnecessary bug reports, the code will need
to be reworked.

Approved by:	mlaier (mentor)
MFC after:	3 days
2005-10-04 19:50:02 +00:00
Andre Oppermann
1fd7af262a Correct brainfart in SO_BINTIME test.
Pointed out by:	nate
Pointy hat to:	andre
2005-10-04 18:19:21 +00:00
Andre Oppermann
e5fbf72cd8 Make SO_BINTIME timestamps available on raw_ip sockets.
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-10-04 18:07:11 +00:00
Robert Watson
c48b03fb69 Unlock Giant symmetrically with respect to lock acquire order as that's
generally nicer.

Spotted by:	johan
MFC after:	1 week
2005-10-03 11:34:29 +00:00
Robert Watson
1fa9efeffb Acquire Giant conditionally in in_addmulti() and in_delmulti() based on
whether the interface being accessed is IFF_NEEDSGIANT or not.  This
avoids lock order reversals when calling into the interface ioctl
handler, which could potentially lead to deadlock.

The long term solution is to eliminate non-MPSAFE network drivers.

Discussed with:	jhb
MFC after:	1 week
2005-10-03 11:09:39 +00:00
Maxim Konovalov
ac827533df o Teach sysctl_drop() how to deal with the sockets in TIME_WAIT state.
This is a special case because tcp_twstart() destroys a tcp control
block via tcp_discardcb() so we cannot call tcp_drop(struct *tcpcb) on
such connections.  Use tcp_twclose() instead.

MFC after:	5 days
2005-10-02 08:43:57 +00:00
Max Laier
b6de9e91bd Remove bridge(4) from the tree. if_bridge(4) is a full functional
replacement and has additional features which make it superior.

Discussed on:	-arch
Reviewed by:	thompsa
X-MFC-after:	never (RELENG_6 as transition period)
2005-09-27 18:10:43 +00:00
Andre Oppermann
b2828ad291 Implement IP_DONTFRAG IP socket option enabling the Don't Fragment
flag on IP packets.  Currently this option is only repected on udp
and raw ip sockets.  On tcp sockets the DF flag is controlled by the
path MTU discovery option.

Sending a packet larger than the MTU size of the egress interface
returns an EMSGSIZE error.

Discussed with:	rwatson
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-09-26 20:25:16 +00:00
Andre Oppermann
fe53256dc2 Use monotonic 'time_uptime' instead of 'time_second' as timebase
for rt->rt_rmx.rmx_expire.
2005-09-19 22:54:55 +00:00
Andre Oppermann
e6b9152d20 Use monotonic 'time_uptime' instead of 'time_second' as timebase
for timeouts.
2005-09-19 22:31:45 +00:00
Robert Watson
b1c53bc9c0 Take a first cut at cleaning up ifnet removal and multicast socket
panics, which occur when stale ifnet pointers are left in struct
moptions hung off of inpcbs:

- Add in_ifdetach(), which matches in6_ifdetach(), and allows the
  protocol to perform early tear-down on the interface early in
  if_detach().

- Annotate that if_detach() needs careful consideration.

- Remove calls to in_pcbpurgeif0() in the handling of SIOCDIFADDR --
  this is not the place to detect interface removal!  This also
  removes what is basically a nasty (and now unnecessary) hack.

- Invoke in_pcbpurgeif0() from in_ifdetach(), in both raw and UDP
  IPv4 sockets.

It is now possible to run the msocket_ifnet_remove regression test
using HEAD without panicking.

MFC after:	3 days
2005-09-18 17:36:28 +00:00
Andre Oppermann
db1240661f Do not ignore all other TCP options (eg. timestamp, window scaling)
when responding to TCP SYN packets with TCP_MD5 enabled and set.

PR:		kern/82963
Submitted by:	<demizu at dd.iij4u.or.jp>
MFC after:	3 days
2005-09-14 15:06:22 +00:00
Bjoern A. Zeeb
75398603ad Fix panic when kernel compiled without INET6 by rejecting
IPv6 opcodes which are behind #if(n)def INET6 now.

PR:		kern/85826
MFC after:	3 days
2005-09-14 07:53:54 +00:00
Andre Oppermann
ffabe3dce8 In tcp_ctlinput() do not swap ip->ip_len a second time. It
has been done in icmp_input() already.

This fixes the ICMP_UNREACH_NEEDFRAG case where no MTU was
proposed in the ICMP reply.

PR:		kern/81813
Submitted by:	Vitezslav Novy <vita at fio.cz>
MFC after:	3 days
2005-09-10 07:43:29 +00:00
Gleb Smirnoff
a20e25385c - Do not hold route entry lock, when calling arprequest(). One such
call was introduced by me in 1.139, the other one was present before.
- Do all manipulations with rtentry and la before dropping the lock.
- Copy interface address from route into local variable before dropping
  the lock. Supply this copy as argument to arprequest()

LORs fixed:
		http://sources.zabbadoz.net/freebsd/lor/003.html
		http://sources.zabbadoz.net/freebsd/lor/037.html
		http://sources.zabbadoz.net/freebsd/lor/061.html
		http://sources.zabbadoz.net/freebsd/lor/062.html
		http://sources.zabbadoz.net/freebsd/lor/064.html
		http://sources.zabbadoz.net/freebsd/lor/068.html
		http://sources.zabbadoz.net/freebsd/lor/071.html
		http://sources.zabbadoz.net/freebsd/lor/074.html
		http://sources.zabbadoz.net/freebsd/lor/077.html
		http://sources.zabbadoz.net/freebsd/lor/093.html
		http://sources.zabbadoz.net/freebsd/lor/135.html
		http://sources.zabbadoz.net/freebsd/lor/140.html
		http://sources.zabbadoz.net/freebsd/lor/142.html
		http://sources.zabbadoz.net/freebsd/lor/145.html
		http://sources.zabbadoz.net/freebsd/lor/152.html
		http://sources.zabbadoz.net/freebsd/lor/158.html
2005-09-09 10:06:27 +00:00
Gleb Smirnoff
5d40d65b5a When a carp(4) interface is being destroyed and is in a promiscous mode,
first interface is detached from parent and then bpfdetach() is called.
If the interface was the last carp(4) interface attached to parent, then
the mutex on parent is destroyed. When bpfdetach() calls if_setflags()
we panic on destroyed mutex.

To prevent the above scenario, clear pointer to parent, when we detach
ourselves from parent.
2005-09-09 08:41:39 +00:00
Sam Leffler
245c31ccaf clear lock on error in O_LIMIT case of install_state
Submitted by:	Ted Unangst
MFC after:	3 days
2005-09-04 17:33:40 +00:00
Andre Oppermann
e0aec68255 Use the correct mbuf type for MGET(). 2005-08-30 16:35:27 +00:00
Gleb Smirnoff
e3ea67a077 Add newline to debuging printf.
PR:		kern/85271
Submitted by:	Simon Morgan
2005-08-26 15:27:18 +00:00
Gleb Smirnoff
360856f60e - Refuse hashsize of 0, since it is invalid.
- Use defined constant instead of 512.
2005-08-25 13:57:00 +00:00
Gleb Smirnoff
510b360fc0 When we have a published ARP entry for some IP address, do reply on
ARP requests only on the network where this IP address belong, to.

Before this change we did replied on all interfaces. This could
lead to an IP address conflict with host we are doing ARP proxy
for.

PR:		kern/75634
Reviewed by:	andre
2005-08-25 13:25:57 +00:00
Paul Saab
4d3b134633 Remove a KASSERT in the sack path that fails because of a interaction
between sack and a bug in the "bad retransmit recovery" logic. This is
a workaround, the underlying bug will be fixed later.

Submitted by:   Mohan Srinivasan, Noritoshi Demizu
2005-08-24 02:48:45 +00:00
Paul Saab
b24de0e665 Fix up the comment for MAX_SACK_BLKS.
Submitted by:	Noritoshi Demizu
2005-08-24 02:47:16 +00:00
Andre Oppermann
ef8fd90476 Remove unnecessary IPSEC includes.
MFC after:	2 weeks
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-08-23 14:42:40 +00:00
Andre Oppermann
23655387e9 o Fix a logic error when not doing mbuf cluster allocation.
o Change an old panic() to a clean function exit.

MFC after:	2 weeks
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-08-22 22:13:41 +00:00
Andre Oppermann
936cd18dad Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket.  All packets with a
lower TTL are silently dropped.  Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682.  Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after:	2 weeks
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-08-22 16:13:08 +00:00
Andre Oppermann
6b773dff30 Always quote the entire TCP header when responding and allocate an mbuf
cluster if needed.

Fixes the TCP issues raised in I-D draft-gont-icmp-payload-00.txt.

This aids in-the-wild debugging a lot and allows the receiver to do
more elaborate checks on the validity of the response.

MFC after:	2 weeks
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-08-22 14:12:18 +00:00
Andre Oppermann
d56ea155bd Handle pure layer 2 broad- and multicasts properly and simplify related
checks.

PR:		kern/85052
Submitted by:	Dmitrij Tejblum <tejblum at yandex-team.ru>
MFC after:	3 days
2005-08-22 12:06:26 +00:00
Andre Oppermann
bb10780f9f Commit correct version of the change and note the name of the new
sysctl: net.inet.icmp.quotelen and defaults to 8 bytes.

Pointy hat to:	andre
2005-08-21 15:18:00 +00:00
Andre Oppermann
e875dfb826 Add a sysctl to change to length of the quotation of the original
packet in an ICMP reply.  The minimum of 8 bytes is internally
enforced.  The maximum quotation is the remaining space in the
reply mbuf.

This option is added in response to the issues raised in I-D
draft-gont-icmp-payload-00.txt.

MFC after:	2 weeks
Spnsored by:	TCP/IP Optimizations Fundraise 2005
2005-08-21 15:09:07 +00:00
Andre Oppermann
a0866c8d4e Add an option to have ICMP replies to non-local packets generated with
the IP address the packet came through in.  This is useful for routers
to show in traceroutes the actual path a packet has taken instead of
the possibly different return path.

The new sysctl is named net.inet.icmp.reply_from_interface and defaults
to off.

MFC after:	2 weeks
2005-08-21 12:29:39 +00:00
Gleb Smirnoff
1ae954096e In order to support CARP interfaces kernel was taught to handle more
than one interface in one subnet. However, some userland apps rely on
the believe that this configuration is impossible.

Add a sysctl switch net.inet.ip.same_prefix_carp_only. If the switch
is on, then kernel will refuse to add an additional interface to
already connected subnet unless the interface is CARP. Default
value is off.

PR:			bin/82306
In collaboration with:	mlaier
2005-08-18 10:34:30 +00:00
Bjoern A. Zeeb
bd2e5495d1 Fix broken build of rev. 1.108 in case of no INET6 and IPFIREWALL
compiled into kernel.

Spotted and tested by:	Michal Mertl <mime at traveller.cz>
2005-08-14 18:20:33 +00:00
Bjoern A. Zeeb
9066356ba1 * Add dynamic sysctl for net.inet6.ip6.fw.
* Correct handling of IPv6 Extension Headers.
* Add unreach6 code.
* Add logging for IPv6.

Submitted by:	sysctl handling derived from patch from ume needed for ip6fw
Obtained from:	is_icmp6_query and send_reject6 derived from similar
		functions of netinet6,ip6fw
Reviewed by:	ume, gnn; silence on ipfw@
Test setup provided by: CK Software GmbH
MFC after:	6 days
2005-08-13 11:02:34 +00:00
Craig Rodrigues
eee9fe3078 Add NATM_LOCK() and NATM_UNLOCK() in places where npcb_add() and
npcb_free() are called, in order to eliminate witness panics.
This was overlooked in removal of GIANT from ATM.

Reviewed by: rwatson
2005-08-12 02:38:20 +00:00
Gleb Smirnoff
1ed7bf1e3b o Fix a race between three threads: output path,
incoming ARP packet and route request adding/removing
  ARP entries. The root of the problem is that
  struct llinfo_arp was accessed without any locks.
  To close race we will use locking provided by
  rtentry, that references this llinfo_arp:
  - Make arplookup() return a locked rtentry.
  - In arpresolve() hold the lock provided by
    rt_check()/arplookup() until the end of function,
    covering all accesses to the rtentry itself and
    llinfo_arp it refers to.
  - In in_arpinput() do not drop lock provided by
    arplookup() during first part of the function.
  - Simplify logic in the first part of in_arpinput(),
    removing one level of indentation.
  - In the second part of in_arpinput() hold rtentry
    lock while copying address.

o Fix a condition when route entry is destroyed, while
  another thread is contested on its lock:
  - When storing a pointer to rtentry in llinfo_arp list,
    always add a reference to this rtentry, to prevent
    rtentry being destroyed via RTM_DELETE request.
  - Remove this reference when removing entry from
    llinfo_arp list.

o Further cleanup of arptimer():
  - Inline arptfree() into arptimer().
  - Use official queue(3) way to pass LIST.
  - Hold rtentry lock while reading its structure.
  - Do not check that sdl_family is AF_LINK, but
    assert this.

Reviewed by:	sam
Stress test:	http://www.holm.cc/stress/log/cons141.html
Stress test:	http://people.freebsd.org/~pho/stress/log/cons144.html
2005-08-11 08:25:48 +00:00
David E. O'Brien
c11ba30c9a Remove public declarations of variables that were forgotten when they were
made static.
2005-08-10 07:10:02 +00:00
David E. O'Brien
31793d594b Match IPv6 and use a static struct pr_usrreqs nousrreqs. 2005-08-10 06:41:04 +00:00
Robert Watson
a2dc1f5021 Add helper function ip_findmoptions(), which accepts an inpcb, and attempts
to atomically return either an existing set of IP multicast options for the
PCB, or a newlly allocated set with default values.  The inpcb is returned
locked.  This function may sleep.

Call ip_moptions() to acquire a reference to a PCB's socket options, and
perform the update of the options while holding the PCB lock.  Release the
lock before returning.

Remove garbage collection of multicast options when values return to the
default, as this complicates locking substantially.  Most applications
allocate a socket either to be multicast, or not, and don't tend to keep
around sockets that have previously been used for multicast, then used for
unicast.

This closes a number of race conditions involving multiple threads or
processes modifying the IP multicast state of a socket simultaenously.

MFC after:	7 days
2005-08-09 17:19:21 +00:00
Robert Watson
13f4c340ae Propagate rename of IFF_OACTIVE and IFF_RUNNING to IFF_DRV_OACTIVE and
IFF_DRV_RUNNING, as well as the move from ifnet.if_flags to
ifnet.if_drv_flags.  Device drivers are now responsible for
synchronizing access to these flags, as they are in if_drv_flags.  This
helps prevent races between the network stack and device driver in
maintaining the interface flags field.

Many __FreeBSD__ and __FreeBSD_version checks maintained and continued;
some less so.

Reviewed by:	pjd, bz
MFC after:	7 days
2005-08-09 10:20:02 +00:00
Gleb Smirnoff
9bd8ca3014 In preparation for fixing races in ARP (and probably in other
L2/L3 mappings) make rt_check() return a locked rtentry.
2005-08-09 08:39:56 +00:00
Robert Watson
dd5a318ba3 Introduce in_multi_mtx, which will protect IPv4-layer multicast address
lists, as well as accessor macros.  For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.

For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.

Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().

Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.

Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.

Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.

Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.

Problem reported by:	Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after:		10 days
2005-08-03 19:29:47 +00:00
Robert Watson
bccb41014a Modify network protocol consumers of the ifnet multicast address lists
to lock if_addr_mtx.

Problem reported by:	Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after:		1 week
2005-08-02 23:51:22 +00:00
Hajimu UMEMOTO
4dad226e45 recover the line which was wrongly disappeared during scope cleanup.
tcpdrop(8) should work for IPv6, again.
2005-08-01 12:08:49 +00:00
Bjoern A. Zeeb
9e669156d4 Add support for IPv6 over GRE [1]. PR kern/80340 includes the
FreeBSD specific ip_newid() changes NetBSD does not have.
Correct handling of non AF_INET packets passed to bpf [2].

PR:		kern/80340[1], NetBSD PRs 29150[1], 30844[2]
Obtained from:	NetBSD ip_gre.c rev. 1.34,1.35, if_gre.c rev. 1.56
Submitted by:	Gert Doering <gert at greenie.muc.de>[2]
MFC after:	4 days
2005-08-01 08:14:21 +00:00
Hajimu UMEMOTO
c85ed85b1c include scope6_var.h for in6_clearscope(). 2005-07-26 00:19:58 +00:00
Hajimu UMEMOTO
29da8af658 include netinet6/scope6_var.h. 2005-07-25 12:36:43 +00:00
Hajimu UMEMOTO
a1f7e5f8ee scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
  scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
  scoped addresses as a special case.
- scope boundary check will be stricter.  For example, the current
  *BSD code allows a packet with src=::1 and dst=(some global IPv6
  address) to be sent outside of the node, if the application do:
    s = socket(AF_INET6);
    bind(s, "::1");
    sendto(s, some_global_IPv6_addr);
  This is clearly wrong, since ::1 is only meaningful within a single
  node, but the current implementation of the *BSD kernel cannot
  reject this attempt.

Submitted by:	JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from:	KAME
2005-07-25 12:31:43 +00:00
Giorgos Keramidas
a09ad79379 Misc spelling and/or English fixes in comments.
Reviewed by:	glebius, andre
2005-07-23 00:59:13 +00:00
Hajimu UMEMOTO
6c4eaa873f move RFC3542 related definitions into ip6.h.
Submitted by:	Keiichi SHIMA <keiichi__at__iijlab.net>
Reviewed by:	mlaier
Obtained from:	KAME
2005-07-20 10:30:52 +00:00
Hajimu UMEMOTO
77b6f9ed40 add missing RFC3542 definition.
Submitted by:	Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from:	KAME
2005-07-20 09:17:41 +00:00
Hajimu UMEMOTO
18b35df8fe update comments:
- RFC2292bis -> RFC3542
  - typo fixes

Submitted by:	Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from:	KAME
2005-07-20 08:59:45 +00:00
Robert Watson
de35559f82 Remove no-op spl references in in_pcb.c, since in_pcb locking has been
basically complete for several years now.  Update one spl comment to
reference the locking strategy.

MFC after:	3 days
2005-07-19 12:24:27 +00:00
Robert Watson
f59a9ebf10 Remove no-op spl's and most comment references to spls, as TCP locking
is believed to be basically done (modulo any remaining bugs).

MFC after:	3 days
2005-07-19 12:21:26 +00:00
Robert Watson
b77634d046 Remove spl() calls from ip_slowtimo(), as IP fragment queue locking was
merged several years ago.

Submitted by:	gnn
MFC after:	1 day
2005-07-19 12:14:22 +00:00
Max Laier
6de8d9dc52 Export pfsyncstats via sysctl "net.inet.pfsync" in order to print them with
netstat (seperate commit).

Requested by:	glebius
MFC after:	1 week
2005-07-14 22:22:51 +00:00
Robert Watson
3c308b091f Eliminate MAC entry point mac_create_mbuf_from_mbuf(), which is
redundant with respect to existing mbuf copy label routines.  Expose
a new mac_copy_mbuf() routine at the top end of the Framework and
use that; use the existing mpo_copy_mbuf_label() routine on the
bottom end.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA, SPAWAR
Approved by:	re (scottl)
2005-07-05 23:39:51 +00:00
Paul Saab
d758711729 Fix for a bug in newreno partial ack handling where if a large amount
of data is partial acked, snd_cwnd underflows, causing a burst.

Found, Submitted by:	Noritoshi Demizu
Approved by:		re
2005-07-05 19:23:02 +00:00
Max Laier
b4373150d9 Remove ambiguity from hlen. IPv4 is now indicated by is_ipv4 and we need a
proper hlen value for IPv6 to implement O_REJECT and O_LOG.

Reviewed by:	glebius, brooks, gnn
Approved by:	re (scottl)
2005-07-03 15:42:22 +00:00
Andrew Thompson
2fcb030ad5 Check the alignment of the IP header before passing the packet up to the
packet filter. This would cause a panic on architectures that require strict
alignment such as sparc64 (tier1) and ia64/ppc (tier2).

This adds two new macros that check the alignment, these are compile time
dependent on __NO_STRICT_ALIGNMENT which is set for i386 and amd64 where
alignment isn't need so the cost is avoided.

 IP_HDR_ALIGNED_P()
 IP6_HDR_ALIGNED_P()

Move bridge_ip_checkbasic()/bridge_ip6_checkbasic() up so that the alignment
is checked for ipfw and dummynet too.

PR:		ia64/81284
Obtained from:	NetBSD
Approved by:	re (dwhite), mlaier (mentor)
2005-07-02 23:13:31 +00:00
Paul Saab
482ac96888 Fix for a bug in the change that defers sack option processing until
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.

Found by:	Andrey Chernov.
Submitted by:	Noritoshi Demizu, Raja Mukerji.
Approved by:	re
2005-07-01 22:54:18 +00:00
Paul Saab
69e0362019 Fix for a SACK crash caused by a bug in tcp_reass(). tcp_reass()
does not clear tlen and frees the mbuf (leaving th pointing at
freed memory), if the data segment is a complete duplicate.
This change works around that bug. A fix for the tcp_reass() bug
will appear later (that bug is benign for now, as neither th nor
tlen is referenced in tcp_input() after the call to tcp_reass()).

Found by:	Pawel Jakub Dawidek.
Submitted by:	Raja Mukerji, Noritoshi Demizu.
Approved by:	re
2005-07-01 22:52:46 +00:00
Gleb Smirnoff
a196a3c8aa When doing ARP load balancing source IP is taken in network byte order,
so residue of division for all hosts on net is the same, and thus only
one VHID answers. Change source IP in host byte order.

Reviewed by:	mlaier
Approved by:	re (scottl)
2005-07-01 08:22:13 +00:00
Simon L. B. Nielsen
0a389eab22 Fix ipfw packet matching errors with address tables.
The ipfw tables lookup code caches the result of the last query.  The
kernel may process multiple packets concurrently, performing several
concurrent table lookups.  Due to an insufficient locking, a cached
result can become corrupted that could cause some addresses to be
incorrectly matched against a lookup table.

Submitted by:	ru
Reviewed by:	csjp, mlaier
Security:	CAN-2005-2019
Security:	FreeBSD-SA-05:13.ipfw

Correct bzip2 permission race condition vulnerability.

Obtained from:	Steve Grubb via RedHat
Security:	CAN-2005-0953
Security:	FreeBSD-SA-05:14.bzip2
Approved by:	obrien

Correct TCP connection stall denial of service vulnerability.

A TCP packets with the SYN flag set is accepted for established
connections, allowing an attacker to overwrite certain TCP options.

Submitted by:	Noritoshi Demizu
Reviewed by:	andre, Mohan Srinivasan
Security:	CAN-2005-2068
Security:	FreeBSD-SA-05:15.tcp

Approved by:	re (security blanket), cperciva
2005-06-29 21:36:49 +00:00
Paul Saab
5a53ca1627 - Postpone SACK option processing until after PAWS checks. SACK option
processing is now done in the ACK processing case.
- Merge tcp_sack_option() and tcp_del_sackholes() into a new function
  called tcp_sack_doack().
- Test (SEG.ACK < SND.MAX) before processing the ACK.

Submitted by:	Noritoshi Demizu
Reveiewed by:	Mohan Srinivasan, Raja Mukerji
Approved by:	re
2005-06-27 22:27:42 +00:00
Poul-Henning Kamp
dca9c930da Libalias incorrectly applies proxy rules to the global divert
socket: it should only look for existing translation entries,
not create new ones (no matter how it got the idea).

Approved by:	re(scottl)
2005-06-27 22:21:42 +00:00
Gleb Smirnoff
59dde15e82 Disable checksum processing in LibAlias, when it works as a
kernel module. LibAlias is not aware about checksum offloading,
so the caller should provide checksum calculation. (The only
current consumer is ng_nat(4)). When TCP packet internals has
been changed and it requires checksum recalculation, a cookie
is set in th_x2 field of TCP packet, to inform caller that it
needs to recalculate checksum. This ugly hack would be removed
when LibAlias is made more kernel friendly.

Incremental checksum updates are left as is, since they don't
conflict with offloading.

Approved by:	re (scottl)
2005-06-27 07:36:02 +00:00
David Malone
01399f34a5 Fix some long standing bugs in writing to the BPF device attached to
a DLT_NULL interface. In particular:

        1) Consistently use type u_int32_t for the header of a
           DLT_NULL device - it continues to represent the address
           family as always.
        2) In the DLT_NULL case get bpf_movein to store the u_int32_t
           in a sockaddr rather than in the mbuf, to be consistent
           with all the DLT types.
        3) Consequently fix a bug in bpf_movein/bpfwrite which
           only permitted packets up to 4 bytes less than the MTU
           to be written.
        4) Fix all DLT_NULL devices to have the code required to
           allow writing to their bpf devices.
        5) Move the code to allow writing to if_lo from if_simloop
           to looutput, because it only applies to DLT_NULL devices
           but was being applied to other devices that use if_simloop
           possibly incorrectly.

PR:		82157
Submitted by:	Matthew Luckie <mjl@luckie.org.nz>
Approved by:	re (scottl)
2005-06-26 18:11:11 +00:00
Stephan Uphoff
68d376254c Fix a timer ticks wrap around bug for minmssoverload processing.
Approved by:	re (scottl,dwhite)
MFC after:	4 weeks
2005-06-25 22:24:45 +00:00
Warner Losh
d980b05275 Add back missing copyright and license statement. This is identical
to the statement in ip_mroute.h, as well as being the same as what
OpenBSD has done with this file.  It matches the copyright in NetBSD's
1.1 through 1.14 versions of the file as well, which they subsequently
added back.

It appears to have been lost in the 4.4-lite1 import for FreeBSD 2.0,
but where and why I've not investigated further.  OpenBSD had the same
problem.  NetBSD had a copyright notice until Multicast 3.5 was
integrated verbatim back in 1995.  This appears to be the version that
made it into 4.4-lite1.

Approved by: re (scottl)
MFC after: 3 days
2005-06-23 18:42:58 +00:00
Paul Saab
9004ded9df Fix for a bug in tcp_sack_option() causing crashes.
Submitted by:	Noritoshi Demizu, Mohan Srinivasan.
Approved by:	re (scottl blanket SACK)
2005-06-23 00:18:54 +00:00
Bjoern A. Zeeb
67df9f3896 Fix IP(v6) over IP tunneling most likely broken with ifnet changes.
Reviewed by:	gnn
Approved by:	re (dwhite), rwatson (mentor)
2005-06-20 08:39:30 +00:00
Gleb Smirnoff
72f2d6578c - Don't use legacy function in a non-legacy one. This gives us
possibility to compile libalias without legacy support.
- Use correct way to mark variable as unused.

Approved by:	re (dwhite)
2005-06-20 08:31:48 +00:00
Max Laier
e4c959952b In verify_rev_path6():
- do not use static memory as we are under a shared lock only
 - properly rtfree routes allocated with rtalloc
 - rename to verify_path6()
 - implement the full functionality of the IPv4 version

Also make O_ANTISPOOF work with IPv6.

Reviewed by:	gnn
Approved by:	re (blanket)
2005-06-16 14:55:58 +00:00
Max Laier
ad7abe197d Fix indentation in INET6 section in preperation of more serious work.
Approved by:	re (blanket ip6fw removal)
2005-06-16 13:20:36 +00:00
Max Laier
cf21d53cbf When doing matching based on dst_ip/src_ip make sure we are really looking
on an IPv4 packet as these variables are uninitialized if not.  This used to
allow arbitrary IPv6 packets depending on the value in the uninitialized
variables.

Some opcodes (most noteably O_REJECT) do not support IPv6 at all right now.

Reviewed by:	brooks, glebius
Security:	IPFW might pass IPv6 packets depending on stack contents.
Approved by:	re (blanket)
2005-06-12 16:27:10 +00:00
Brooks Davis
fc74a9f93a Stop embedding struct ifnet at the top of driver softcs. Instead the
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.

This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.

Other changes of note:
 - Struct arpcom is no longer referenced in normal interface code.
   Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
   To enforce this ac_enaddr has been renamed to _ac_enaddr.
 - The second argument to ether_ifattach is now always the mac address
   from driver private storage rather than sometimes being ac_enaddr.

Reviewed by:	sobomax, sam
2005-06-10 16:49:24 +00:00
Brian Feldman
b34d56f1ef Modify send_pkt() to return the generated packet and have the caller
do the subsequent ip_output() in IPFW.  In ipfw_tick(), the keep-alive
packets must be generated from the data that resides under the
stateful lock, but they must not be sent at that time, as this would
cause a lock order reversal with the normal ordering (interface's
lock, then locks belonging to the pfil hooks).

In practice, this caused deadlocks when using IPFW and if_bridge(4)
together to do stateful transparent filtering.

MFC after: 1 week
2005-06-10 12:28:17 +00:00
Andrew Thompson
c8b0129238 Add dummynet(4) support to if_bridge, this code is largely based on bridge.c.
This is the final piece to match bridge.c in functionality, we can now be a
drop-in replacement.

Approved by:	mlaier (mentor)
2005-06-10 01:25:22 +00:00
Paul Saab
e912f906d0 Fix a mis-merge. Remove a redundant call to tcp_sackhole_insert
Submitted by:	Mohan Srinivasan
2005-06-09 17:55:29 +00:00
Paul Saab
8b9bbaaa94 Fix for a crash in tcp_sack_option() caused by hitting the limit on
the number of sack holes.

Reported by:	Andrey Chernov
Submitted by:	Noritoshi Demizu
Reviewed by:	Raja Mukerji
2005-06-09 14:01:04 +00:00
Paul Saab
db4b83fe49 Fix for a bug in the change that walks the scoreboard backwards from
the tail (in tcp_sack_option()). The bug was caused by incorrect
accounting of the retransmitted bytes in the sackhint.

Reported by:    Kris Kennaway.
Submitted by:   Noritoshi Demizu.
2005-06-06 19:46:53 +00:00
Andrew Thompson
8f86751705 Add hooks into the networking layer to support if_bridge. This changes struct
ifnet so a buildworld is necessary.

Approved by:	mlaier (mentor)
Obtained from:	NetBSD
2005-06-05 03:13:13 +00:00
Brian Feldman
5278d40bcc Better explain, then actually implement the IPFW ALTQ-rule first-match
policy.  It may be used to provide more detailed classification of
traffic without actually having to decide its fate at the time of
classification.

MFC after:	1 week
2005-06-04 19:04:31 +00:00
Paul Saab
9d17a7a64a Changes to tcp_sack_option() that
- Walks the scoreboard backwards from the tail to reduce the number of
  comparisons for each sack option received.
- Introduce functions to add/remove sack scoreboard elements, making
  the code more readable.

Submitted by:   Noritoshi Demizu
Reviewed by:    Raja Mukerji, Mohan Srinivasan
2005-06-04 08:03:28 +00:00
Max Laier
57cd6d263b Add support for IPv4 only rules to IPFW2 now that it supports IPv6 as well.
This is the last requirement before we can retire ip6fw.

Reviewed by:	dwhite, brooks(earlier version)
Submitted by:	dwhite (manpage)
Silence from:	-ipfw
2005-06-03 01:10:28 +00:00
Ian Dowse
ba5da2a06f Use IFF_LOCKGIANT/IFF_UNLOCKGIANT around calls to the interface
if_ioctl routine. This should fix a number of code paths through
soo_ioctl() that could call into Giant-locked network drivers without
first acquiring Giant.
2005-06-02 00:04:08 +00:00
Robert Watson
303939942c When aborting tcp_attach() due to a problem allocating or attaching the
tcpcb, lock the inpcb before calling in_pcbdetach() or in6_pcbdetach(),
as they expect the inpcb to be passed locked.

MFC after:	7 days
2005-06-01 12:14:56 +00:00
Robert Watson
e6e0b5ffd1 Assert tcbinfo lock, inpcb lock in tcp_disconnect().
Assert tcbinfo lock, inpcb lock in in tcp_usrclosed().

MFC after:	7 days
2005-06-01 12:08:15 +00:00
Robert Watson
e3d5315d01 Assert tcbinfo lock in tcp_drop() due to its call of tcp_close()
Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach()
Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop()

MFC after:	7 days
2005-06-01 12:06:07 +00:00
Robert Watson
1e2d989d0d Assert that tcbinfo is locked in tcp_input() before calling into
tcp_drop().

MFC after:	7 days
2005-06-01 12:03:18 +00:00
Robert Watson
416738a781 Assert the tcbinfo lock whenever tcp_close() is to be called by
tcp_input().

MFC after:	7 days
2005-06-01 11:49:14 +00:00
Robert Watson
7609aad7d9 Assert tcbinfo lock in tcp_attach(), as it is required; the caller
(tcp_usr_attach()) currently grabs it.

MFC after:	7 days
2005-06-01 11:44:43 +00:00
Robert Watson
fe6bfc3730 Commit correct version of previous commit (in_pcb.c:1.164). Use the
local variables as currently named.

MFC after:	7 days
2005-06-01 11:43:39 +00:00
Robert Watson
6b348152be Assert pcbinfo lock in in_pcbdisconnect() and in_pcbdetach(), as the
global pcb lists are modified.

MFC after:	7 days
2005-06-01 11:39:42 +00:00
Robert Watson
3ca1570c82 Slight white space tweak.
MFC after:	7 days
2005-06-01 11:38:35 +00:00
Robert Watson
277afaff66 De-spl UDP.
MFC after:	3 days
2005-06-01 11:24:00 +00:00
Seigo Tanimura
29ea671b36 Let OSPFv3 go through ipfw. Some more additional checks would be
desirable, though.
2005-05-28 07:46:44 +00:00
Paul Saab
808f11b768 This is conform with the terminology in
M.Mathis and J.Mahdavi,
  "Forward Acknowledgement: Refining TCP Congestion Control"
  SIGCOMM'96, August 1996.

Submitted by:   Noritoshi Demizu, Raja Mukerji
2005-05-25 17:55:27 +00:00
Paul Saab
64b5fbaa04 Rewrite of tcp_sack_option(). Kentaro Kurahone (NetBSD) pointed out
that if we sort the incoming SACK blocks, we can update the scoreboard
in one pass of the scoreboard. The added overhead of sorting upto 4
sack blocks is much lower than traversing (potentially) large
scoreboards multiple times. The code was updating the scoreboard with
multiple passes over it (once for each sack option). The rewrite fixes
that, reducing the complexity of the main loop from O(n^2) to O(n).

Submitted by:   Mohan Srinivasan, Noritoshi Demizu.
Reviewed by:    Raja Mukerji.
2005-05-23 19:22:48 +00:00
Paul Saab
2cdbfa66ee Replace t_force with a t_flag (TF_FORCEDATA).
Submitted by:   Raja Mukerji.
Reviewed by:    Mohan, Silby, Andre Opperman.
2005-05-21 00:38:29 +00:00
Paul Saab
4fc5324557 Introduce routines to alloc/free sack holes. This cleans up the code
considerably.

Submitted by:   Noritoshi Demizu.
Reviewed by:    Raja Mukerji, Mohan Srinivasan.
2005-05-16 19:26:46 +00:00
Gleb Smirnoff
32247f8629 - When carp interface is destroyed, and it affects global preemption
suppresion counter, decrease the latter. [1]
- Add sysctl to monitor preemption suppression.

PR:		kern/80972 [1]
Submitted by:	Frank Volf [1]
MFC after:	1 week
2005-05-15 01:44:26 +00:00
Paul Saab
fdace17f81 Fix for a bug where the "nexthole" sack hint is out of sync with the
real next hole to retransmit from the scoreboard, caused by a bug
which did not update the "nexthole" hint in one case in
tcp_sack_option().

Reported by:    Daniel Eriksson
Submitted by:   Mohan Srinivasan
2005-05-13 18:02:02 +00:00
Gleb Smirnoff
b3cf6808ce In div_output() explicitly set m->m_nextpkt to NULL. If divert socket
is not userland, but ng_ksocket, then m->m_nextpkt may be non-NULL. In
this case we would panic in sbappend.
2005-05-13 11:44:37 +00:00
Paul Saab
0077b0163f When looking for the next hole to retransmit from the scoreboard,
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.

This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.

The debug code that sanity checks the hints against the computed
value will be removed eventually.

Submitted by:   Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
2005-05-11 21:37:42 +00:00
Colin Percival
fe2eee8231 Fix two issues which were missed in FreeBSD-SA-05:08.kmem.
Reported by:	Uwe Doering
2005-05-07 00:41:36 +00:00
Gleb Smirnoff
cbfbc555e0 Add a workaround for 64-bit archs: store unsigned long return value in
temporary variable, check it and then cast to in_addr_t.
2005-05-06 13:01:31 +00:00
Gleb Smirnoff
6293e003c9 s/DEBUG/LIBALIAS_DEBUG/, since DEBUG is defined in LINT and
not supported for kernel build.
2005-05-06 11:07:49 +00:00
Colin Percival
fd94099ec2 If we are going to
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer
first.

Security: FreeBSD-SA-05:08.kmem
2005-05-06 02:50:00 +00:00
Gleb Smirnoff
e9d5db2888 More bits for kernel version:
- copy inet_aton() from libc
- disable getservbyname() lookup and accept only numeric port
2005-05-05 22:00:32 +00:00
Gleb Smirnoff
75bc262006 Always include alias.h before alias_local.h 2005-05-05 21:55:17 +00:00
Gleb Smirnoff
f87fe393ce When used in kernel define NO_FW_PUNCH, NO_LOGGING, NO_USE_SOCKETS. 2005-05-05 21:53:17 +00:00
Gleb Smirnoff
c8d3ca728f Fix argument order for bcopy() in last commit.
Noticed by:	njl
Pointy hat to:	glebius
2005-05-05 21:40:49 +00:00
Gleb Smirnoff
efdc8fbf79 Use bcopy() instead of memmove(). 2005-05-05 21:10:51 +00:00
Gleb Smirnoff
ae0440572f Hide fflush(3) under ifdef DEBUG. 2005-05-05 21:07:34 +00:00
Gleb Smirnoff
c8564bffd2 Things required to build libalias as kernel module:
- kernel module declarations and handler.
- macros to map malloc(3) calls to malloc(9) ones.
- malloc(9) declarations.
- call finishoff() from module handler MOD_UNLOAD case
  instead of atexit(3).
- use panic(9) instead of abort(3)
- take time from time_second instead of gettimeofday(2)
- define INADDR_NONE
2005-05-05 21:05:38 +00:00
Gleb Smirnoff
00fc9a5bb9 Add NO_USE_SOCKETS knob, which cuts off functionality socket binding. 2005-05-05 20:25:12 +00:00
Gleb Smirnoff
40106c140f Add NO_LOGGING knob, which cuts off functionality of debug logging to a file. 2005-05-05 20:22:09 +00:00
Gleb Smirnoff
c649a2e033 Play with includes so that libalias can be compiled both as userland
library and kernel module.
2005-05-05 19:27:32 +00:00
Andre Oppermann
9e4ca6315d If we don't get a suggested MTU during path MTU discovery
look up the packet size of the packet that generated the
response, step down the MTU by one step through ip_next_mtu()
and try again.

Suggested by:	dwmalone
2005-05-04 13:48:44 +00:00
Gleb Smirnoff
1f8f08e1c9 Cleanup IPFW2 ifdefs. 2005-05-04 13:24:37 +00:00
Gleb Smirnoff
c3c2f9a9ba Makefile is not needed here. 2005-05-04 13:24:12 +00:00
Andre Oppermann
4c037f8d6e Add another step of 1280 (gif(4) tunnels) to ip_next_mtu(). 2005-05-04 13:23:54 +00:00
Gleb Smirnoff
a1429ad928 IPFW version 2 is the only option in HEAD and RELENG_5.
Thus, cleanup unnecessary now ifdefs.
2005-05-04 13:12:52 +00:00
Andre Oppermann
c773494edd Pass icmp_error() the MTU argument directly instead of
an interface pointer.  This simplifies a couple of uses
and removes some XXX workarounds.
2005-05-04 13:09:19 +00:00
Robert Watson
b60d26c9b9 Remove now unused inirw variable from previous use of COMMON_END().
Reported by:	csjp
2005-05-01 14:01:38 +00:00
Peter Grehan
73fddedac8 Fix typo in last commit.
Approved by:	rwatson
2005-05-01 13:06:05 +00:00
Robert Watson
d1401c9000 Slide unlocking of the tcbinfo lock earlier in tcp_usr_send(), as it's
needed only for implicit connect cases.  Under load, especially on SMP,
this can greatly reduce contention on the tcbinfo lock.

NB: Ambiguities about the state of so_pcb need to be resolved so that
all use of the tcbinfo lock in non-implicit connection cases can be
eliminated.

Submited by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
2005-05-01 11:11:38 +00:00
Brooks Davis
31519b13c8 Introduce a struct icmphdr which contains the type, code, and cksum
fields of an ICMP packet.

Use this to allow ipfw to pullup only these values since it does not use
the rest of the packet and it was failed on ICMP packets because they
were not long enough.

struct icmp should probably be modified to use these at some point, but
that will break a fair bit of code so it can wait for another day.

On the off chance that adding this struct breaks something in ports,
bump __FreeBSD_version.

Reported by:	Randy Bush <randy at psg dot com>
Tested by:	Randy Bush <randy at psg dot com>
2005-04-26 18:10:21 +00:00
Paul Saab
91232d6ccc Remove some code that snuck in by accident.
Submitted by:	Mohan Srinivasan
2005-04-21 20:29:40 +00:00
Paul Saab
be3f3b5ead Fix for interaction problems between TCP SACK and TCP Signature.
If TCP Signatures are enabled, the maximum allowed sack blocks aren't
going to fit. The fix is to compute how many sack blocks fit and tack
these on last. Also on SYNs, defer padding until after the SACK
PERMITTED option has been added.

Found by:	Mohan Srinivasan.
Submitted by:	Mohan Srinivasan, Noritoshi Demizu.
Reviewed by:	Raja Mukerji.
2005-04-21 20:26:07 +00:00
Paul Saab
97b76190eb Undo rev 1.71 as it is the wrong change. 2005-04-21 20:24:43 +00:00
Paul Saab
a6235da61e - Make the sack scoreboard logic use the TAILQ macros. This improves
code readability and facilitates some anticipated optimizations in
  tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.

Submitted by:	Raja Mukerji.
Reviewed by:	Mohan Srinivasan, Noritoshi Demizu.
2005-04-21 20:11:01 +00:00
Paul Saab
a3047bc036 Fix for 2 bugs related to TCP Signatures :
- If the peer sends the Signature option in the SYN, use of Timestamps
  and Window Scaling were disabled (even if the peer supports them).
- The sender must not disable signatures if the option is absent in
  the received SYN. (See comment in syncache_add()).

Found, Submitted by:	Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>.
Reviewed by:		Mohan Srinivasan <mohans at yahoo-inc dot com>.
2005-04-21 20:09:09 +00:00
Andre Oppermann
1aedbd9c80 Move Path MTU discovery ICMP processing from icmp_input() to
tcp_ctlinput() and subject it to active tcpcb and sequence
number checking.  Previously any ICMP unreachable/needfrag
message would cause an update to the TCP hostcache.  Now only
ICMP PMTU messages belonging to an active TCP session with
the correct src/dst/port and sequence number will update the
hostcache and complete the path MTU discovery process.

Note that we don't entirely implement the recommended counter
measures of Section 7.2 of the paper.  However we close down
the possible degradation vector from trivially easy to really
complex and resource intensive.  In addition we have limited
the smallest acceptable MTU with net.inet.tcp.minmss sysctl
for some time already, further reducing the effect of any
degradation due to an attack.

Security:	draft-gont-tcpm-icmp-attacks-03.txt Section 7.2
MFC after:	3 days
2005-04-21 14:29:34 +00:00
Andre Oppermann
1600372b6b Ignore ICMP Source Quench messages for TCP sessions. Source Quench is
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.

Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.

Security:	draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after:	3 days
2005-04-21 12:37:12 +00:00
Gleb Smirnoff
9dc1f8e41e Remove anti-LOR bandaid, it is not needed now.
Sponsored by:	Rambler
2005-04-20 09:32:05 +00:00
Poul-Henning Kamp
6196e2db3e Make DUMMYNET compile without INET6 2005-04-19 10:12:21 +00:00
Poul-Henning Kamp
d137deac11 typo 2005-04-19 10:04:38 +00:00
Poul-Henning Kamp
7292e12676 Make IPFIREWALL compile without INET6 2005-04-19 09:56:14 +00:00
Brooks Davis
8195404bed Add IPv6 support to IPFW and Dummynet.
Submitted by:	Mariano Tortoriello and Raffaele De Lorenzo (via luigi)
2005-04-18 18:35:05 +00:00
Paul Saab
b7c755717c Rewrite of tcp_update_sack_list() to make it simpler and more readable
than our original OpenBSD derived version.

Submitted by:	Noritoshi Demizu
Reviewed by:	Mohan Srinivasan, Raja Mukerji
2005-04-18 18:10:56 +00:00
Brooks Davis
27a2f39bcf Centralized finding the protocol header in IP packets in preperation for
IPv6 support.  The header in IPv6 is more complex then in IPv4 so we
want to handle skipping over it in one location.

Submitted by:	Mariano Tortoriello and Raffaele De Lorenzo (via luigi)
2005-04-15 00:47:44 +00:00
Paul Saab
25e6f9ed4b Fix for a TCP SACK bug where more than (win/2) bytes could have been
in flight in SACK recovery.

Found by:	Noritoshi Demizu
Submitted by:	Mohan Srinivasan <mohans at yahoo-inc dot com>
		Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
		Raja Mukerji <raja at moselle dot com>
2005-04-14 20:09:52 +00:00
Paul Saab
cf09195ba5 - Tighten up the Timestamp checks to prevent a spoofed segment from
setting ts_recent to an arbitrary value, stopping further
  communication between the two hosts.
- If the Echoed Timestamp is greater than the current time,
  fall back to the non RFC 1323 RTT calculation.

Submitted by:	Raja Mukerji (raja at moselle dot com)
Reviewed by:	Noritoshi Demizu, Mohan Srinivasan
2005-04-10 05:24:59 +00:00
Paul Saab
e346eeff65 - If the reassembly queue limit was reached or if we couldn't allocate
a reassembly queue state structure, don't update (receiver) sack
  report.
- Similarly, if tcp_drain() is called, freeing up all items on the
  reassembly queue, clean the sack report.

Found, Submitted by:	Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by:	Mohan Srinivasan (mohans at yahoo-inc dot com),
		Raja Mukerji (raja at moselle dot com).
2005-04-10 05:21:29 +00:00
Paul Saab
b962fa74b5 When the rightmost SACK block expands, rcv_lastsack should be updated.
(Fix for kern/78226).

Submitted by : Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by  : Mohan Srinivasan (mohans at yahoo-inc dot com),
               Raja Mukerji (raja at moselle dot com).
2005-04-10 05:20:10 +00:00
Paul Saab
da39f5b963 Remove some unused sack fields.
Submitted by : Noritoshi Demizu, Mohan Srinivasan.
2005-04-10 05:19:22 +00:00
Maxim Konovalov
800af1fb81 o Nano optimize ip_reass() code path for the first fragment: do not
try to reasseble the packet from the fragments queue with the only
fragment, finish with the first fragment as soon as we create a queue.

Spotted by:	Vijay Singh

o Drop the fragment if maxfragsperpacket == 0, no chances we
will be able to reassemble the packet in future.

Reviewed by:	silby
2005-04-08 10:25:13 +00:00
Maxim Konovalov
29f2a6ec18 o Tweak the comment a bit. 2005-04-08 08:43:21 +00:00
Maxim Konovalov
e99971bf2f o Disable random port allocation when ip.portrange.first ==
ip.portrange.last and there is the only port for that because:
a) it is not wise; b) it leads to a panic in the random ip port
allocation code.  In general we need to disable ip port allocation
randomization if the last - first delta is ridiculous small.

PR:		kern/79342
Spotted by:	Anjali Kulkarni
Glanced at by:	silby
MFC after:	2 weeks
2005-04-08 08:42:10 +00:00
Gleb Smirnoff
8351d04f34 When a packet has been reinjected into ipfw(4) after dummynet(4) processing
we have a non-NULL args.rule. If the same packet later is subject to "tee"
rule, its original is sent again into ipfw_chk() and it reenters at the same
rule. This leads to infinite loop and frozen router.

Assign args.rule to NULL, any time we are going to send packet back to
ipfw_chk() after a tee rule. This is a temporary workaround, which we
will leave for RELENG_5. In HEAD we are going to make divert(4) save
next rule the same way as dummynet(4) does.

PR:		kern/79546
Submitted by:	Oleg Bulyzhin
Reviewed by:	maxim, andre
MFC after:	3 days
2005-04-06 14:00:33 +00:00
Brooks Davis
a0d17f7e98 Use ACTION_PTR(r) instead of (r->cmd + r->act_ofs).
Reviewed by:	md5
2005-04-06 00:26:08 +00:00
Brooks Davis
f4ff11976d Make dummynet_flush() match its prototype. 2005-04-05 23:38:16 +00:00
Poul-Henning Kamp
a8bc22b47a natd core dumps when -reverse switch is used because of a bug in
libalias.

In /usr/src/lib/libalias/alias.c, the functions LibAliasIn and
LibAliasOutTry call the legacy PacketAliasIn/PacketAliasOut instead
of LibAliasIn/LibAliasOut when the PKT_ALIAS_REVERSE option is set.
In this case, the context variable "la" gets lost because the legacy
compatibility routines expect "la" to be global.  This was obviously
an oversight when rewriting the PacketAlias* functions to the
LibAlias* functions.

The fix (as shown in the patch below) is to remove the legacy
subroutine calls and replace with the new ones using the "la" struct
as the first arg.

Submitted by:	Gil Kloepfer <fgil@kloepfer.org>
Confirmed by:	<nicolai@catpipe.net>
PR:		76839
MFC after:	3 days
2005-04-05 13:04:35 +00:00
Gleb Smirnoff
4cb39345c0 When several carp interfaces are attached to Ethernet interface,
carp_carpdev_state_locked() is called every time carp interface is attached.
The first call backs up flags of the first interface, and the second
call backs up them again, erasing correct values.
  To solve this, a carp_sc_state_locked() function is introduced. It is
called when interface is attached to parent, instead of calling
carp_carpdev_state_locked. carp_carpdev_state_locked() calls
carp_sc_state_locked() for each sc in chain.

Reported by:	Yuriy N. Shkandybin, sem
2005-03-30 11:44:43 +00:00
Gleb Smirnoff
d1a4742962 - Don't free mbuf, passed to interface output method if the latter
returns error. In this case mbuf has already been freed. [1]
- Remove redundant declaration.

PR:		kern/78893 [1]
Submitted by:	Liang Yi [1]
Reviewed by:	sam
MFC after:	1 day
2005-03-29 13:43:09 +00:00
Sam Leffler
812d865346 eliminate extraneous null ptr checks
Noticed by:	Coverity Prevent analysis tool
2005-03-29 01:10:46 +00:00
Sam Leffler
5309f84168 deal with malloc failures
Noticed by:	Coverity Prevent analysis tool
Together with:	mdodd
2005-03-26 22:20:22 +00:00
Maxim Konovalov
6ee79c59d2 o Document net.inet.ip.portrange.random* sysctls.
o Correct a comment about random port allocation threshold
implementation.

Reviewed by:	silby, ru
MFC after:	3 days
2005-03-23 09:26:38 +00:00
Gleb Smirnoff
d4d2297060 ifma_protospec is a pointer. Use NULL when assigning or compating it. 2005-03-20 14:31:45 +00:00
Gleb Smirnoff
50bb170471 Remove a workaround from previos revision. It proved to be incorrect.
Add two another workarounds for carp(4) interfaces:
- do not add connected route when address is assigned to carp(4) interface
- do not add connected route when other interface goes down

Embrace workarounds with #ifdef DEV_CARP
2005-03-20 10:27:17 +00:00
Gleb Smirnoff
ee6f227017 If vhid exists return more informative EEXIST instead of EINVAL. While here
remove redundant brackets.
2005-03-18 13:41:38 +00:00
Gleb Smirnoff
9860bab349 Fix a potential crash that could occur when CARP_LOG is being used.
Obtained from:	OpenBSD (pat)
2005-03-18 13:18:34 +00:00
Sam Leffler
6a9909b5e6 plug resource leak
Noticed by:	Coverity Prevent analysis tool
2005-03-16 05:27:19 +00:00
Robert Watson
d2bc35ab29 In tcp_usr_send(), broaden coverage of the socket buffer lock in the
non-OOB case so that the sbspace() check is performed under the same
lock instance as the append to the send socket buffer.

MFC after:	1 week
2005-03-14 22:15:14 +00:00
Gleb Smirnoff
422a115a4a Embrace with #ifdef DEV_CARP carp-related code. 2005-03-13 11:23:22 +00:00
Gleb Smirnoff
0504a89fdd Add antifootshooting workaround, which will make all routes "connected"
to carp(4) interfaces host routes. This prevents a problem, when connected
network is routed to carp(4) interface.
2005-03-10 15:26:45 +00:00
Paul Saab
e891d82b56 Add limits on the number of elements in the sack scoreboard both
per-connection and globally. This eliminates potential DoS attacks
where SACK scoreboard elements tie up too much memory.

Submitted by:	Raja Mukerji (raja at moselle dot com).
Reviewed by:	Mohan Srinivasan (mohans at yahoo-inc dot com).
2005-03-09 23:14:10 +00:00
Gleb Smirnoff
2ef4a436e0 Make ARP do not complain about wrong interface if correct interface
is a carp one and address matched it.

Reviewed by:	brooks
2005-03-09 10:00:01 +00:00
Joe Marcus Clarke
70037e98c4 Fix a problem in the Skinny ALG where a specially crafted packet could cause
a libalias application (e.g.  natd, ppp, etc.) to crash.  Note: Skinny support
is not enabled in natd or ppp by default.

Approved by:	secteam (nectar)
MFC after:	1 day
Secuiryt:	This fixes a remote DoS exploit
2005-03-03 03:06:37 +00:00
Gleb Smirnoff
b82936c5d4 Fix typo. Unbreak build. Take pointy hat. 2005-03-02 09:11:18 +00:00
Gleb Smirnoff
d220759b41 Add more locking when reading/writing to carp softc. When carp softc is
attached to a parent interface we use its mutex to lock the softc. This
means that in several places like carp_ioctl() we lock softc conditionaly.
This should be redesigned.

To avoid LORs when MII announces us a link state change, we schedule
a quick callout and call carp_carpdev_state_locked() from it.

Initialize callouts using NET_CALLOUT_MPSAFE.

Sponsored by:	Rambler
Reviewed by:	mlaier
2005-03-01 13:14:33 +00:00
Gleb Smirnoff
d92d54d54d - Add carp_mtx. Use it to protect list of all carp interfaces.
- In carp_send_ad_all() walk through list of all carp interfaces
  instead of walking through list of all interfaces.

Sponsored by:	Rambler
Reviewed by:	mlaier
2005-03-01 12:36:07 +00:00
Gleb Smirnoff
31199c8463 Use NET_CALLOUT_MPSAFE macro. 2005-03-01 12:01:17 +00:00
Gleb Smirnoff
3a84d72a78 Revert change to struct ifnet. Use ifnet pointer in softc. Embedding
ifnet into smth will soon be removed.

Requested by:	brooks
2005-03-01 10:59:14 +00:00
Gleb Smirnoff
4358dfc32f Remove debugging printf.
Reviewed by:	mlaier
2005-03-01 09:31:36 +00:00
Yaroslav Tykhiy
630481bb92 Support running carp(4) over a vlan(4) parent interface.
Encouraged by:	glebius
2005-02-28 16:19:11 +00:00
Gleb Smirnoff
1d0a237660 Remove unused field from carp softc.
OK'ed by:	mcbride@OpenBSD
2005-02-28 11:57:03 +00:00
Gleb Smirnoff
3e07def4cd Fix tcpdump(8) on carp(4) interface:
- Use our loop DLT type, not OpenBSD. [1]
- The fields that are converted to network byte order are not 32-bit
  fields but 16-bit fields, so htons should be used in htonl. [1]
- Secondly, ip_input changes ip->ip_len into its value without
  the ip-header length. So, restore the length to make bpf happy. [1]
- Use bpf_mtap2(), use temporary af1, since bpf_mtap2 doesn't
  understand uint8_t af identifier.

Submitted by:	Frank Volf [1]
2005-02-28 11:54:36 +00:00
Paul Saab
8291294024 If the receiver sends an ack that is out of [snd_una, snd_max],
ignore the sack options in that segment. Else we'd end up
corrupting the scoreboard.

Found by:	Raja Mukerji (raja at moselle dot com)
Submitted by:	Mohan Srinivasan
2005-02-27 20:39:04 +00:00
Max Laier
a4e5390551 Unbreak the build. carp_iamatch6 and carp_macmatch6 are not supposed to be
static as they are used elsewhere.
2005-02-27 11:32:26 +00:00
Gleb Smirnoff
e8c34a71eb Remove carp_softc.sc_ifp member in favor of union pointers in struct ifnet.
Obtained from:	OpenBSD
2005-02-26 13:55:07 +00:00
Gleb Smirnoff
5c1f0f6de5 Staticize local functions. 2005-02-26 10:33:14 +00:00
Gleb Smirnoff
88bf82a62e New lines when logging. 2005-02-25 11:26:39 +00:00
Gleb Smirnoff
947b7cf3c6 Embrace macros with do {} while (0)
Submitted by:	maxim
2005-02-25 10:49:47 +00:00
Gleb Smirnoff
39aeaa0eb5 Call carp_carpdev_state() from carp_set_addr6(). See log for rev 1.4.
Sponsored by:	Rambler
2005-02-25 10:12:11 +00:00
Gleb Smirnoff
1e9e65729b Improve logging:
- Simplify CARP_LOG() and making it working (we don't have addlog in FreeBSD).
 - Introduce CARP_DEBUG() which logs with LOG_DEBUG severity when
   net.inet.carp.log > 1
 - Use CARP_DEBUG to log state changes of carp interfaces.

After CARP_LOG() cleanup it appeared that carp_input_c() does not need sc
argument. Remove it.

Sponsored by:	Rambler
2005-02-25 10:09:44 +00:00
Gleb Smirnoff
6fba4c0bae Fix problem when master comes up with one interface down, and preempts
mastering on all other interfaces:

- call carp_carpdev_state() on initialize instead of just setting to INIT
- in carp_carpdev_state() check that interface is UP, instead of checking
  that it is not DOWN, because a rebooted machine may have interface in
  UNKNOWN state.

Sponsored by:	Rambler
Obtained from:	OpenBSD (partially)
2005-02-24 09:05:28 +00:00
Sam Leffler
db77984c5b fix potential invalid index into ip_protox array
Noticed by:	Coverity Prevent analysis tool
2005-02-23 00:38:12 +00:00
Maxime Henrion
2368737719 Unbreak CARP build on 64-bit architectures.
Tested on:	sparc64
2005-02-23 00:20:33 +00:00
Andre Oppermann
099dd0430b Bring back the full packet destination manipulation for 'ipfw fwd'
with the kernel compile time option:

 options IPFIREWALL_FORWARD_EXTENDED

This option has to be specified in addition to IPFIRWALL_FORWARD.

With this option even packets targeted for an IP address local
to the host can be redirected.  All restrictions to ensure proper
behaviour for locally generated packets are turned off.  Firewall
rules have to be carefully crafted to make sure that things like
PMTU discovery do not break.

Document the two kernel options.

PR:		kern/71910
PR:		kern/73129
MFC after:	1 week
2005-02-22 17:40:40 +00:00
Gleb Smirnoff
67df421496 Remove promisc counter from parent interface in carp_clone_destroy(),
so that parent interface is not left in promiscous mode after carp
interface is destroyed.

This is not perfect, since promisc counter is added when carp
interface is assigned an IP address. However, when address is removed
parent interface is still in promiscuous mode. Only removal of
carp interface removes promisc from parent. Same way in OpenBSD.

Sponsored by:	Rambler
2005-02-22 16:24:55 +00:00
Gleb Smirnoff
a97719482d Add CARP (Common Address Redundancy Protocol), which allows multiple
hosts to share an IP address, providing high availability and load
balancing.

Original work on CARP done by Michael Shalayeff, with many
additions by Marco Pfatschbacher and Ryan McBride.

FreeBSD port done solely by Max Laier.

Patch by:	mlaier
Obtained from:	OpenBSD (mickey, mcbride)
2005-02-22 13:04:05 +00:00
Gleb Smirnoff
797127a9bf We can make code simplier after last change.
Noticed by:	Andrew Thompson
2005-02-22 08:35:24 +00:00
Gleb Smirnoff
3a1757b9c0 In in_pcbconnect_setup() jailed sockets are treated specially: if local
address is not supplied, then jail IP is choosed and in_pcbbind() is called.
Since udp_output() does not save local addr after call to in_pcbconnect_setup(),
in_pcbbind() is called for each packet, and this is incorrect.

So, we shall treat jailed sockets specially in udp_output(), we will save
their local address.

This fixes a long standing bug with broken sendto() system call in jails.

PR:		kern/26506
Reviewed by:	rwatson
MFC after:	2 weeks
2005-02-22 07:50:02 +00:00
Gleb Smirnoff
914d092f5d In in_pcbconnect_setup() remove a check that route points at
loopback interface. Nobody have explained me sense of this check.
It breaks connect() system call to a destination address which is
loopback routed (e.g. blackholed).

Reviewed by:	silence on net@
MFC after:	2 weeks
2005-02-22 07:39:15 +00:00
Robert Watson
0daccb9c94 In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections.  As part of this state transition, solisten() calls into the
protocol to update protocol-layer state.  There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.

This change does the following:

- Pushes the socket state transition from the socket layer solisten() to
  to socket "library" routines called from the protocol.  This permits
  the socket routines to be called while holding the protocol mutexes,
  preventing a race exposing the incomplete socket state transition to TCP
  after the TCP state transition has completed.  The check for a socket
  layer state transition is performed by solisten_proto_check(), and the
  actual transition is performed by solisten_proto().

- Holds the socket lock for the duration of the socket state test and set,
  and over the protocol layer state transition, which is now possible as
  the socket lock is acquired by the protocol layer, rather than vice
  versa.  This prevents additional state related races in the socket
  layer.

This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another.  Similar changes are likely require
elsewhere in the socket/protocol code.

Reported by:		Peter Holm <peter@holm.cc>
Review and fixes from:	emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod:	gnn
2005-02-21 21:58:17 +00:00
Paul Saab
7643c37cf2 Remove 2 (SACK) fields from the tcpcb. These are only used by a
function that is called from tcp_input(), so they oughta be passed on
the stack instead of stuck in the tcpcb.

Submitted by:	Mohan Srinivasan
2005-02-17 23:04:56 +00:00
Paul Saab
7776346f83 Fix for a SACK (receiver) bug where incorrect SACK blocks are
reported to the sender - in the case where the sender sends data
outside the window (as WinXP does :().

Reported by:	Sam Jensen <sam at wand dot net dot nz>
Submitted by:	Mohan Srinivasan
2005-02-16 01:46:17 +00:00
Paul Saab
8db456bf17 - Retransmit just one segment on initiation of SACK recovery.
Remove the SACK "initburst" sysctl.
- Fix bugs in SACK dupack and partialack handling that can cause
  large bursts while in SACK recovery.

Submitted by:	Mohan Srinivasan
2005-02-14 21:01:08 +00:00
Maxim Konovalov
9945c0e21f o Add handling of an IPv4-mapped IPv6 address.
o Use SYSCTL_IN() macro instead of direct call of copyin(9).

Submitted by:	ume

o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.

Suggested by:	ume
2005-02-14 07:37:51 +00:00
Gleb Smirnoff
1af305441d Jump to common action checks after doing specific once. This fixes adding
of divert rules, which I break in previous commit.

Pointy hat to:	glebius
2005-02-06 11:13:59 +00:00
Maxim Konovalov
212a79b010 o Implement net.inet.tcp.drop sysctl and userland part, tcpdrop(8)
utility:

    The tcpdrop command drops the TCP connection specified by the
    local address laddr, port lport and the foreign address faddr,
    port fport.

Obtained from:	OpenBSD
Reviewed by:	rwatson (locking), ru (man page), -current
MFC after:	1 month
2005-02-06 10:47:12 +00:00
Gleb Smirnoff
670742a102 Add a ng_ipfw node, implementing a quick and simple interface between
ipfw(4) and netgraph(4) facilities.

Reviewed by:	andre, brooks, julian
2005-02-05 12:06:33 +00:00
Hajimu UMEMOTO
6d0a982bdf teach scope of IPv6 address to net.inet6.tcp6.getcred.
MFC after:	1 week
2005-02-04 14:43:05 +00:00
Robert Watson
06456da2c6 Update an additional reference to the rate of ISN tick callouts that was
missed in tcp_subr.c:1.216: projected_offset must also reflect how often
the tcp_isn_tick() callout will fire.

MFC after:	2 weeks
Submitted by:	silby
2005-01-31 01:35:01 +00:00
Christian S.J. Peron
0ba04c87b3 Change the state allocator from using regular malloc to using
a UMA zone instead. This should eliminate a bit of the locking
overhead associated with with malloc and reduce the memory
consumption associated with each new state.

Reviewed by:	rwatson, andre
Silence on:	ipfw@
MFC after:	1 week
2005-01-31 00:48:39 +00:00
Robert Watson
54082796aa Have tcp_isn_tick() fire 100 times a second, rather than HZ times a
second; since the default hz has changed to 1000 times a second,
this resulted in unecessary work being performed.

MFC after:		2 weeks
Discussed with:		phk, cperciva
General head nod:	silby
2005-01-30 23:30:28 +00:00
Robert Watson
024105493d Prefer (NULL) spelling of (0) for pointers.
MFC after:	3 days
2005-01-30 19:29:47 +00:00
Robert Watson
77c16eed7c Remove clause three from tcp_syncache.c license per permission of
McAfee.  Update copyright to McAfee from NETA.
2005-01-30 19:28:27 +00:00
Alan Cox
7258e9687b Correctly move the packet header in ip_insertoptions().
Reported by: Anupam Chanda
Reviewed by: sam@
MFC after: 2 weeks
2005-01-23 19:43:46 +00:00
Ruslan Ermilov
24a0682c64 Sort sections. 2005-01-20 09:17:07 +00:00
Gleb Smirnoff
28935658c4 - Reduce number of arguments passed to dummynet_io(), we already have cookie
in struct ip_fw_args itself.
- Remove redundant &= 0xffff from dummynet_io().
2005-01-16 11:13:18 +00:00
Gleb Smirnoff
6c69a7c30b o Clean up interface between ip_fw_chk() and its callers:
- ip_fw_chk() returns action as function return value. Field retval is
  removed from args structure. Action is not flag any more. It is one
  of integer constants.
- Any action-specific cookies are returned either in new "cookie" field
  in args structure (dummynet, future netgraph glue), or in mbuf tag
  attached to packet (divert, tee, some future action).

o Convert parsing of return value from ip_fw_chk() in ipfw_check_{in,out}()
  to a switch structure, so that the functions are more readable, and a future
  actions can be added with less modifications.

Approved by:	andre
MFC after:	2 months
2005-01-14 09:00:46 +00:00
Paul Saab
8d03f2b53b Fix a TCP SACK related crash resulting from incorrect computation
of len in tcp_output(), in the case where the FIN has already been
transmitted. The mis-computation of len is because of a gcc
optimization issue, which this change works around.

Submitted by:	Mohan Srinivasan
2005-01-12 21:40:51 +00:00
Brian Somers
2a4cd52421 include "alias.h", not <alias.h>
MFC after:	3 days
2005-01-10 10:54:06 +00:00
Warner Losh
c398230b64 /* -> /*- for license, minor formatting changes 2005-01-07 01:45:51 +00:00
Mike Silbersack
a69968ee4e Add a sysctl (net.inet.tcp.insecure_rst) which allows one to specify
that the RFC 793 specification for accepting RST packets should be
following.  When followed, this makes one vulnerable to the attacks
described in "slipping in the window", but it may be necessary in
some odd circumstances.
2005-01-03 07:08:37 +00:00
Mike Silbersack
5f311da2cc Port randomization leads to extremely fast port reuse at high
connection rates, which is causing problems for some users.

To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.

Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds.  These thresholds may be tuned via sysctl.

Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.

MFC After:	20 seconds
2005-01-02 01:50:57 +00:00
Robert Watson
74d4630b71 Remove an errant blank line apparently introduced in
ip_output.c:1.194.
2004-12-25 22:59:42 +00:00
Robert Watson
42cf3289c3 In the dropafterack case of tcp_input(), it's OK to release the TCP
pcbinfo lock before calling tcp_output(), as holding just the inpcb
lock is sufficient to prevent garbage collection.
2004-12-25 22:26:13 +00:00
Robert Watson
e0bef1cb35 Revert parts of tcp_input.c:1.255 associated with the header predicted
cases for tcp_input():

While it is true that the pcbinfo lock provides a pseudo-reference to
inpcbs, both the inpcb and pcbinfo locks are required to free an
un-referenced inpcb.  As such, we can release the pcbinfo lock as
long as the inpcb remains locked with the confidence that it will not
be garbage-collected.  This leads to a less conservative locking
strategy that should reduce contention on the TCP pcbinfo lock.

Discussed with: sam
2004-12-25 22:23:13 +00:00
Robert Watson
452d9f5b1c Attempt to consistently use () around return values in calls to
return() in newer code (sysctl, ISN, timewait).
2004-12-23 01:34:26 +00:00
Robert Watson
06da46b241 Remove an XXXRW comment relating to whether or not the TCP timers are
MPSAFE: they are now believed to be.

Correct a typo in a second comment.

MFC after:	2 weeks
2004-12-23 01:27:13 +00:00
Robert Watson
db0aae38b6 Remove the now unused tcp_canceltimers() function. tcpcb timers are
now stopped as part of tcp_discardcb().

MFC after:	2 weeks
2004-12-23 01:25:59 +00:00
Robert Watson
950ab1e470 Remove an annotation of a minor race relating to the update of
multiple MIB entries using sysctl in short order, which might
result in unexpected values for tcp_maxidle being generated by
tcp_slowtimo.  In practice, this will not happen, or at least,
doesn't require an explicit comment.

MFC after:	2 weeks
2004-12-23 01:21:54 +00:00
Gleb Smirnoff
5e5da86597 In certain cases ip_output() can free our route, so check
for its presence before RTFREE().

Noticed by:	ru
2004-12-10 07:51:14 +00:00
Gleb Smirnoff
d2a09f901a Revert last change.
Andre:
  First lets get major new features into the kernel in a clean and nice way,
  and then start optimizing. In this case we don't have any obfusication that
  makes later profiling and/or optimizing difficult in any way.

Requested by:	csjp, sam
2004-12-10 07:47:17 +00:00
Christian S.J. Peron
fbf2edb6e4 This commit adds a shared locking mechanism very similar to the
mechanism used by pfil.  This shared locking mechanism will remove
a nasty lock order reversal which occurs when ucred based rules
are used which results in hard locks while mpsafenet=1.

So this removes the debug.mpsafenet=0 requirement when using
ucred based rules with IPFW.

It should be noted that this locking mechanism does not guarantee
fairness between read and write locks, and that it will favor
firewall chain readers over writers. This seemed acceptable since
write operations to firewall chains protected by this lock tend to
be less frequent than reads.

Reviewed by:	andre, rwatson
Tested by:	myself, seanc
Silence on:	ipfw@
MFC after:	1 month
2004-12-10 02:17:18 +00:00
Gleb Smirnoff
f5a19d3909 Check that DUMMYNET_LOADED before seeking dummynet m_tag.
Reviewed by:	andre
MFC after:	1 week
2004-12-09 16:41:47 +00:00
Max Laier
067a8bab8a More fixing of multiple addresses in the same prefix. This time do not try
to arp resolve "secondary" local addresses.

Found and submitted by:	ru
With additions from:	OpenBSD (rev. 1.47)
Reviewed by:		ru
2004-12-09 00:12:41 +00:00
Ruslan Ermilov
5cae05ad33 Time out routes created by redirect. 2004-12-06 22:27:22 +00:00
Gleb Smirnoff
98335aa976 - Make route cacheing optional, configurable via IFF_LINK0 flag.
- Turn it off by default.

Requested by:	many
Reviewed by:	andre
Approved by:	julian (mentor)
MFC after:	3 days
2004-12-06 19:02:43 +00:00
Robert Watson
79a9e59c89 Assert the tcptw inpcb lock in tcp_timer_2msl_reset(), as fields in
the tcptw undergo non-atomic read-modify-writes.

MFC after:	2 weeks
2004-12-05 22:47:29 +00:00
Robert Watson
b9155d92b2 Assert inpcb lock in:
tcpip_fillheaders()
  tcp_discardcb()
  tcp_close()
  tcp_notify()
  tcp_new_isn()
  tcp_xmit_bandwidth_limit()

Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).

MFC after:	2 weeks
2004-12-05 22:27:53 +00:00
Robert Watson
6fbed4af22 Minor grammer fix in comment. 2004-12-05 22:20:59 +00:00
Robert Watson
89924e5865 Pass the inpcb reference into ip_getmoptions() rather than just the
inp->inp_moptions pointer, so that ip_getmoptions() can perform
necessary locking when doing non-atomic reads.

Lock the inpcb by default to copy any data to local variables, then
unlock before performing sooptcopyout().

MFC after:	2 weeks
2004-12-05 22:08:37 +00:00
Robert Watson
92c71ab30b Define INP_UNLOCK_ASSERT() to assert that an inpcb is unlocked.
MFC after:	2 weeks
2004-12-05 22:07:14 +00:00
Robert Watson
5c918b56d8 Push the inpcb argument into ip_setmoptions() when setting IP multicast
socket options, so that it is available for locking.
2004-12-05 21:38:33 +00:00
Robert Watson
993d9505d4 Start working through inpcb locking for ip_ctloutput() by cleaning up
modifications to the inpcb IP options mbuf:

- Lock the inpcb before passing it into ip_pcbopts() in order to prevent
  simulatenous reads and read-modify-writes that could result in races.
- Pass the inpcb reference into ip_pcbopts() instead of the option chain
  pointer in the inpcb.
- Assert the inpcb lock in ip_pcbots.
- Convert one or two uses of a pointer as a boolean or an integer
  comparison to a comparison with NULL for readability.
2004-12-05 19:11:09 +00:00
Paul Saab
7d5ed1ceea Fixes a bug in SACK causing us to send data beyond the receive window.
Found by: Pawel Worach and Daniel Hartmeier
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
2004-11-29 18:47:27 +00:00
Robert Watson
2be3bf2244 Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify-
write of various time/rtt-related fields in the tcpcb.
2004-11-28 11:06:22 +00:00
Robert Watson
18ad5842c5 Expand coverage of the receive socket buffer lock when handling urgent
pointer updates: test available space while holding the socket buffer
mutex, and continue to hold until until the pointer update has been
performed.

MFC after:	2 weeks
2004-11-28 11:01:31 +00:00
Robert Watson
c8443a1dc0 Do export the advertised receive window via the tcpi_rcv_space field of
struct tcp_info.
2004-11-27 20:20:11 +00:00
Robert Watson
b8af5dfa81 Implement parts of the TCP_INFO socket option as found in Linux 2.6.
This socket option allows processes query a TCP socket for some low
level transmission details, such as the current send, bandwidth, and
congestion windows.  Linux provides a 'struct tcpinfo' structure
containing various variables, rather than separate socket options;
this makes the API somewhat fragile as it makes it dificult to add
new entries of interest as requirements and implementation evolve.
As such, I've included a large pad at the end of the structure.
Right now, relatively few of the Linux API fields are filled in, and
some contain no logical equivilent on FreeBSD.  I've include __'d
entries in the structure to make it easier to figure ou what is and
isn't omitted.  This API/ABI should be considered unstable for the
time being.
2004-11-26 18:58:46 +00:00
Mike Silbersack
6a220ed80a Fix a problem where our TCP stack would ignore RST packets if the receive
window was 0 bytes in size.  This may have been the cause of unsolved
"connection not closing" reports over the years.

Thanks to Michiel Boland for providing the fix and providing a concise
test program for the problem.

Submitted by:	Michiel Boland
MFC after:	2 weeks
2004-11-25 19:04:20 +00:00
Robert Watson
de30ea131f In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the
contents of the tcpcb are read and modified in volume.

In tcp_input(), replace th comparison with 0 with a comparison with
NULL.

At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'.  Try to improve consistency between
various assertions regarding headlocked to be more informative.

MFC after:	2 weeks
2004-11-23 23:41:20 +00:00
Robert Watson
cce83ffb5a tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it.  Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after:	2 weeks
2004-11-23 17:21:30 +00:00
Robert Watson
b42ff86e73 De-spl tcp_slowtimo; tcp_maxidle assignment is subject to possible
but unlikely races that could be corrected by having tcp_keepcnt
and tcp_keepintvl modifications go through handler functions via
sysctl, but probably is not worth doing.  Updates to multiple
sysctls within evaluation of a single addition are unlikely.

Annotate that tcp_canceltimers() is currently unused.

De-spl tcp_timer_delack().

De-spl tcp_timer_2msl().

MFC after:	2 weeks
2004-11-23 16:45:07 +00:00
Robert Watson
7258e91f0f Assert the inpcb lock in tcp_twstart(), which does both read-modify-write
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().

Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller.  This is a bug.

Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.

Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.

MFC after:	2 weeks
2004-11-23 16:23:13 +00:00
Robert Watson
8263bab34d Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(),
and tcp_drop(), due to read-modify-write of TCP state variables.

MFC after:	2 weeks
2004-11-23 16:06:15 +00:00
Robert Watson
8438db0f59 Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock
protects access to the ISN state variables.

Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.

Staticize internal ISN variables since they're not used outside of
tcp_subr.c.

MFC after:	2 weeks
2004-11-23 15:59:43 +00:00
Robert Watson
ca127a3e80 Remove "Unlocked read" annotations associated with previously unlocked
use of socket buffer fields in the TCP input code.  These references
are now protected by use of the receive socket buffer lock.

MFC after:	1 week
2004-11-22 13:16:27 +00:00
Robert Watson
98734750b4 s/send/sent/ in comment describing TCPS_SYN_RECEIVED. 2004-11-21 14:38:04 +00:00
Gleb Smirnoff
c1384b5ae2 - Since divert protocol is not connection oriented, remove SS_ISCONNECTED flag
from divert sockets.
- Remove div_disconnect() method, since it shouldn't be called now.
- Remove div_abort() method. It was never called directly, since protocol
  doesn't have listen queue. It was called only from div_disconnect(),
  which is removed now.

Reviewed by:	rwatson, maxim
Approved by:	julian (mentor)
MT5 after:	1 week
MT4 after:	1 month
2004-11-18 13:49:18 +00:00
Max Laier
9a6a6eeba2 Fix host route addition for more than one address to a loopback interface
after allowing more than one address with the same prefix.

Reported by:	Vladimir Grebenschikov <vova NO fbsd SPAM ru>
Submitted by:	ru (also NetBSD rev. 1.83)
Pointyhat to:	mlaier
2004-11-17 23:14:03 +00:00
Max Laier
81d96ce8a4 Merge copyright notices.
Requested by:	njl
2004-11-13 17:05:40 +00:00
Gleb Smirnoff
ea0bd57615 Fix ng_ksocket(4) operation as a divert socket, which is pretty useful
and has been broken twice:

- in the beginning of div_output() replace KASSERT with assignment, as
  it was in rev. 1.83. [1] [to be MFCed]
- refactor changes introduced in rev. 1.100: do not prepend a new tag
  unconditionally. Before doing this check whether we have one. [2]

A small note for all hacking in this area:
when divert socket is not a real userland, but ng_ksocket(4), we receive
_the same_ mbufs, that we transmitted to socket. These mbufs have rcvif,
the tags we've put on them. And we should treat them correctly.

Discussed with:	mlaier [1]
Silence from:	green [2]
Reviewed by:	maxim
Approved by:	julian (mentor)
MFC after:	1 week
2004-11-12 22:17:42 +00:00
Max Laier
48321abefe Change the way we automatically add prefix routes when adding a new address.
This makes it possible to have more than one address with the same prefix.
The first address added is used for the route. On deletion of an address
with IFA_ROUTE set, we try to find a "fallback" address and hand over the
route if possible.
I plan to MFC this in 4 weeks, hence I keep the - now obsolete - argument to
in_ifscrub as it must be considered KAPI as it is not static in in.c. I will
clean this after the MFC.

Discussed on:	arch, net
Tested by:	many testers of the CARP patches
Nits from:	ru, Andrea Campi <andrea+freebsd_arch webcom it>
Obtained from:	WIDE via OpenBSD
MFC after:	1 month
2004-11-12 20:53:51 +00:00
Poul-Henning Kamp
e21e4c19c9 Add missing '='
Spotted by:	obrien
2004-11-11 19:02:01 +00:00
Andre Oppermann
5e7b233055 Fix a double-free in the 'hlen > m->m_len' sanity check.
Bug report by:	<james@towardex.com>
MFC after:	2 weeks
2004-11-09 09:40:32 +00:00
SUZUKI Shinsuke
3d54848fc2 support TCP-MD5(IPv4) in KAME-IPSEC, too.
MFC after: 3 week
2004-11-08 18:49:51 +00:00
Poul-Henning Kamp
756d52a195 Initialize struct pr_userreqs in new/sparse style and fill in common
default elements in net_init_domain().

This makes it possible to grep these structures and see any bogosities.
2004-11-08 14:44:54 +00:00
Robert Watson
d6915262af Do some re-sorting of TCP pcbinfo locking and assertions: make sure to
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled.  Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed.  Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.

MFC after:	3 weeks
2004-11-07 19:19:35 +00:00
Andre Oppermann
e9a4cd2426 Fix a double-free in the 'm->m_len < sizeof (struct ip)' sanity check.
Bug report by:	<james@towardex.com>
MFC after:	2 weeks
2004-11-06 10:47:36 +00:00
Poul-Henning Kamp
c83c1318f5 Hide udp_in6 behind #ifdef INET6 2004-11-04 07:14:03 +00:00
Bruce M Simpson
38f061057b When performing IP fast forwarding, immediately drop traffic which is
destined for a blackhole route.

This also means that blackhole routes do not need to be bound to lo(4)
or disc(4) interfaces for the net.inet.ip.fastforwarding=1 case.

Submitted by:	james at towardex dot com
Sponsored by:	eXtensible Open Router Project <URL:http://www.xorp.org/>
MFC after:	3 weeks
2004-11-04 02:14:38 +00:00
Robert Watson
d4b509bd7f Until this change, the UDP input code used global variables udp_in,
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked().  While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port.  To correct this and
related races:

- Eliminate udp_ip6, which is believed to be generated but then never
  used.  Eliminate ip_2_ip6_hdr() as it is now unneeded.

- Eliminate setting, testing, and existence of 'init' status fields
  for the IPv6 structures.  While with multiple UDP delivery this
  could lead to amortization of IPv4 -> IPv6 conversion when
  delivering an IPv4 UDP packet to an IPv6 socket, it added
  substantial complexity and side effects.

- Move global structures into the stack, declaring udp_in in
  udp_input(), and udp_in6 in udp_append() to be used if a conversion
  is required.  Pass &udp_in into udp_append().

- Re-annotate comments to reflect updates.

With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism.  This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.

Discovered by:	kris (Bug Magnet)
MFC after:	3 weeks
2004-11-04 01:25:23 +00:00
Andre Oppermann
c94c54e4df Remove RFC1644 T/TCP support from the TCP side of the network stack.
A complete rationale and discussion is given in this message
and the resulting discussion:

 http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel.  Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols)  and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on:	-arch
2004-11-02 22:22:22 +00:00
Robert Watson
ab5c14d828 Correct a bug in TCP SACK that could result in wedging of the TCP stack
under high load: only set function state to loop and continuing sending
if there is no data left to send.

RELENG_5_3 candidate.

Feet provided:	Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by:	Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by:	mohan <mohans at yahoo-inc dot com>
2004-10-30 12:02:50 +00:00
Robert Watson
c427483381 Add a matching tunable for net.inet.tcp.sack.enable sysctl. 2004-10-26 08:59:09 +00:00
Bruce M Simpson
d6fa5d2806 Check that rt_mask(rt) is non-NULL before dereferencing it, in the
RTM_ADD case, thus avoiding a panic.

Submitted by:	Iasen Kostov
2004-10-26 03:31:58 +00:00
Andre Oppermann
84bb6a2e75 IPDIVERT is a module now and tell the other parts of the kernel about it.
IPDIVERT depends on IPFIREWALL being loaded or compiled into the kernel.
2004-10-25 20:02:34 +00:00
Ruslan Ermilov
a35d88931c For variables that are only checked with defined(), don't provide
any fake value.
2004-10-24 15:33:08 +00:00
Andre Oppermann
cd109b0d82 Shave 40 unused bytes from struct tcpcb. 2004-10-22 19:55:04 +00:00
Andre Oppermann
21dcc96f4a When printing the initialization string and IPDIVERT is not compiled into the
kernel refer to it as "loadable" instead of "disabled".
2004-10-22 19:18:06 +00:00
Andre Oppermann
24fc79b0a4 Refuse to unload the ipdivert module unless the 'force' flag is given to kldunload.
Reflect the fact that IPDIVERT is a loadable module in the divert(4) and ipfw(8)
man pages.
2004-10-22 19:12:01 +00:00
Andre Oppermann
57bbe2e1ab Destroy the UMA zone on unload. 2004-10-19 22:51:20 +00:00
Andre Oppermann
2de1a9eb6e Slightly extend the locking during unload to fully cover the protocol
deregistration.  This does not entirely close the race but narrows the
even previously extremely small chance of a race some more.
2004-10-19 22:08:13 +00:00
Robert Watson
279128e295 Annotate a newly introduced race present due to the unloading of
protocols: it is possible for sockets to be created and attached
to the divert protocol between the test for sockets present and
successful unload of the registration handler.  We will need to
explore more mature APIs for unregistering the protocol and then
draining consumers, or an atomic test-and-unregister mechanism.
2004-10-19 21:35:42 +00:00
Andre Oppermann
72584fd2c0 Convert IPDIVERT into a loadable module. This makes use of the dynamic loadability
of protocols.  The call to divert_packet() is done through a function pointer.  All
semantics of IPDIVERT remain intact.  If IPDIVERT is not loaded ipfw will refuse to
install divert rules and  natd will complain about 'protocol not supported'.  Once
it is loaded both will work and accept rules and open the divert socket.  The module
can only be unloaded if no divert sockets are open.  It does not close any divert
sockets when an unload is requested but will return EBUSY instead.
2004-10-19 21:14:57 +00:00
Andre Oppermann
969bb53e80 Properly declare the "net.inet" sysctl subtree. 2004-10-19 21:06:14 +00:00
Andre Oppermann
539be79a9d Pre-emptively define IPPROTO_SPACER to 32767, the same value as PROTO_SPACER
to document that this value is globally assigned for a special purpose and
may not be reused within the IPPROTO number space.
2004-10-19 20:59:01 +00:00
Andre Oppermann
dff3237ee5 Make use of the PROTO_SPACER functionality for dynamically loadable
protocols in inetsw[] and define initially eight spacer slots.

Remove conflicting declaration 'struct pr_usrreqs nousrreqs'.  It is
now declared and initialized in kern/uipc_domain.c.
2004-10-19 15:58:22 +00:00
Andre Oppermann
de38924dc0 Support for dynamically loadable and unloadable IP protocols in the ipmux.
With pr_proto_register() it has become possible to dynamically load protocols
within the PF_INET domain.  However the PF_INET domain has a second important
structure called ip_protox[] that is derived from the 'struct protosw inetsw[]'
and takes care of the de-multiplexing of the various protocols that ride on
top of IP packets.

The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[]
array mux in a consistent and easy way.  To register a protocol within
ip_protox[] the existence of a corresponding and matching protocol definition
in inetsw[] is required.  The function does not allow to overwrite an already
registered protocol.  The unregister function simply replaces the mux slot with
the default index pointer to IPPROTO_RAW as it was previously.
2004-10-19 15:45:57 +00:00
Andre Oppermann
1cf15713ed Add a macro for the destruction of INP_INFO_LOCK's used by loadable modules. 2004-10-19 14:34:13 +00:00
Andre Oppermann
de1c2ac4bf Make comments more clear. Change the order of one if() statement to check the
more likely variable first.
2004-10-19 14:31:56 +00:00
Robert Watson
81158452be Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
  mutex, avoiding sofree() having to drop the socket mutex and re-order,
  which could lead to races permitting more than one thread to enter
  sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
  the protocol to the socket, preventing races in clearing and
  evaluation of the reference such that sofree() might be called more
  than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket.  The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets.  The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after:	3 days
Reviewed by:	dwhite
Discussed with:	gnn, dwhite, green
Reported by:	Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by:	Vlad <marchenko at gmail dot com>
2004-10-18 22:19:43 +00:00
Robert Watson
6b8e5a9862 Don't release the udbinfo lock until after the last use of UDP inpcb
in udp_input(), since the udbinfo lock is used to prevent removal of
the inpcb while in use (i.e., as a form of reference count) in the
in-bound path.

RELENG_5 candidate.
2004-10-12 20:03:56 +00:00
Robert Watson
00fcf9d12d Modify the thrilling "%D is using my IP address %s!" message so that
it isn't printed if the IP address in question is '0.0.0.0', which is
used by nodes performing DHCP lookup, and so constitute a false
positive as a report of misconfiguration.
2004-10-12 17:10:40 +00:00
Robert Watson
6c67b8b695 When the access control on creating raw sockets was modified so that
processes in jail could create raw sockets, additional access control
checks were added to raw IP sockets to limit the ways in which those
sockets could be used.  Specifically, only the socket option IP_HDRINCL
was permitted in rip_ctloutput().  Other socket options were protected
by a call to suser().  This change was required to prevent processes
in a Jail from modifying system properties such as multicast routing
and firewall rule sets.

However, it also introduced a regression: processes that create a raw
socket with root privilege, but then downgraded credential (i.e., a
daemon giving up root, or a setuid process switching back to the real
uid) could no longer issue other unprivileged generic IP socket option
operations, such as IP_TOS, IP_TTL, and the multicast group membership
options, which prevented multicast routing daemons (and some other
tools) from operating correctly.

This change pushes the access control decision down to the granularity
of individual socket options, rather than all socket options, on raw
IP sockets.  When rip_ctloutput() doesn't implement an option, it will
now pass the request directly to in_control() without an access
control check.  This should restore the functionality of the generic
IP socket options for raw sockets in the above-described scenarios,
which may be confirmed with the ipsockopt regression test.

RELENG_5 candidate.

Reviewed by:	csjp
2004-10-12 16:47:25 +00:00
Robert Watson
cf2942b67c Acquire the send socket buffer lock around tcp_output() activities
reaching into the socket buffer.  This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it).  Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although  haven't
seen it reported).

RELENG_5 candidate.
2004-10-09 16:48:51 +00:00
Robert Watson
fcf4e3a168 When running with debug.mpsafenet=0, initialize IP multicast routing
callouts as non-CALLOUT_MPSAFE.  Otherwise, they may trigger an
assertion regarding Giant if they enter other parts of the stack from
the callout.

MFC after:	3 days
Reported by:	Dikshie < dikshie at ppk dot itb dot ac dot id >
2004-10-07 14:13:35 +00:00
Paul Saab
a55db2b6e6 - Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
  retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
  number of sack retransmissions done upon initiation of sack recovery.

Submitted by:	Mohan Srinivasan <mohans@yahoo-inc.com>
2004-10-05 18:36:24 +00:00
Brian Feldman
c99ee9e042 Add support to IPFW for matching by TCP data length. 2004-10-03 00:47:15 +00:00
Brian Feldman
6daf7ebd28 Add support to IPFW for classification based on "diverted" status
(that is, input via a divert socket).
2004-10-03 00:26:35 +00:00
Brian Feldman
974dfe3084 Add to IPFW the ability to do ALTQ classification/tagging. 2004-10-03 00:17:46 +00:00
Brian Feldman
88ef2880c1 Validate the action pointer to be within the rule size, so that trying to
add corrupt ipfw rules would not potentially panic the system or worse.
2004-09-30 17:42:00 +00:00
Max Laier
d6a8d58875 Add an additional struct inpcb * argument to pfil(9) in order to enable
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.

This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.

Suggested by:		rwatson
A lot of work by:	csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by:		rwatson, csjp
Tested by:		-pf, -ipfw, LINT, csjp and myself
MFC after:		3 days

LOR IDs:		14 - 17 (not fixed yet)
2004-09-29 04:54:33 +00:00
Robert Watson
48ac555d83 Assign so_pcb to NULL rather than 0 as it's a pointer.
Spotted by:	dwhite
2004-09-29 04:01:13 +00:00
Maxim Konovalov
4bc37f9836 o Turn net.inet.ip.check_interface sysctl off by default.
When net.inet.ip.check_interface was MFCed to RELENG_4 3+ years ago in
rev. 1.130.2.17 ip_input.c it was 1 by default but shortly changed to
0 (accidently?) in rev. 1.130.2.20 in RELENG_4 only.  Among with the
fact this knob is not documented it breaks POLA especially in bridge
environment.

OK'ed by:	andre
Reviewed by:	-current
2004-09-24 12:18:40 +00:00
Andre Oppermann
db09bef308 Fix an out of bounds write during the initialization of the PF_INET protocol
family to the ip_protox[] array.  The protocol number of IPPROTO_DIVERT is
larger than IPPROTO_MAX and was initializing memory beyond the array.
Catch all these kinds of errors by ignoring protocols that are higher than
IPPROTO_MAX or 0 (zero).

Add more comments ip_init().
2004-09-16 18:33:39 +00:00
Andre Oppermann
76ff6dcf46 Clarify some comments for the M_FASTFWD_OURS case in ip_input(). 2004-09-15 20:17:03 +00:00
Andre Oppermann
e098266191 Remove the last two global variables that are used to store packet state while
it travels through the IP stack.  This wasn't much of a problem because IP
source routing is disabled by default but when enabled together with SMP and
preemption it would have very likely cross-corrupted the IP options in transit.

The IP source route options of a packet are now stored in a mtag instead of the
global variable.
2004-09-15 20:13:26 +00:00
Andre Oppermann
bda337d05e Do not allow 'ipfw fwd' command when IPFIREWALL_FORWARD is not compiled into
the kernel.  Return EINVAL instead.
2004-09-13 19:27:23 +00:00
Andre Oppermann
f91248c1ad If we have to 'ipfw fwd'-tag a packet the second time in ipfw_pfil_out() don't
prepend an already existing tag again.  Instead unlink it and prepend it again
to have it as the first tag in the chain.

PR:		kern/71380
2004-09-13 19:20:14 +00:00
Andre Oppermann
f4fca2d8d3 Make comments more clear for the packet changed cases after pfil hooks. 2004-09-13 17:09:06 +00:00
Andre Oppermann
eedc0a7535 Fix ip_input() fallback for the destination modified cases (from the packet
filters).  After the ipfw to pfil move ip_input() expects M_FASTFWD_OURS
tagged packets to have ip_len and ip_off in host byte order instead of
network byte order.

PR:		kern/71652
Submitted by:	mlaier (patch)
2004-09-13 17:01:53 +00:00
Andre Oppermann
7c0102f575 Make 'ipfw tee' behave as inteded and designed. A tee'd packet is copied
and sent to the DIVERT socket while the original packet continues with the
next rule.  Unlike a normally diverted packet no IP reassembly attemts are
made on tee'd packets and they are passed upwards totally unmodified.

Note: This will not be MFC'd to 4.x because of major infrastucture changes.

PR:		kern/64240 (and many others collapsed into that one)
2004-09-13 16:46:05 +00:00
Gleb Smirnoff
324398687f Check flag do_bridge always, even if kernel was compiled without
BRIDGE support. This makes dynamic bridge.ko working.

Reviewed by:	sam
Approved by:	julian (mentor)
MFC after:	1 week
2004-09-09 12:34:07 +00:00
John-Mark Gurney
cb459254a2 revert comment from rev1.158 now that rev1.225 backed it out..
MFC after:	3 days
2004-09-06 15:48:38 +00:00
Gleb Smirnoff
f46a6aac29 Recover normal behavior: return EINVAL to attempt to add a divert rule
when module is built without IPDIVERT.

Silence from:	andre
Approved by:	julian (mentor)
2004-09-05 20:06:50 +00:00
John-Mark Gurney
b5d47ff592 fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...
2004-09-05 02:34:12 +00:00
Andre Oppermann
3161f583ca Apply error and success logic consistently to the function netisr_queue() and
its users.

netisr_queue() now returns (0) on success and ERRNO on failure.  At the
moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full)
are supported.

Previously it would return (1) on success but the return value of IF_HANDOFF()
was interpreted wrongly and (0) was actually returned on success.  Due to this
schednetisr() was never called to kick the scheduling of the isr.  However this
was masked by other normal packets coming through netisr_dispatch() causing the
dequeueing of waiting packets.

PR:		kern/70988
Found by:	MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after:	3 days
2004-08-27 18:33:08 +00:00
Andre Oppermann
a9c92b54a9 In the case the destination of a packet was changed by the packet filter
to point to a local IP address; and the packet was sourced from this host
we fill in the m_pkthdr.rcvif with a pointer to the loopback interface.

Before the function ifunit("lo0") was used to obtain the ifp.  However
this is sub-optimal from a performance point of view and might be dangerous
if the loopback interface has been renamed.  Use the global variable 'loif'
instead which always points to the loopback interface.

Submitted by:	brooks
2004-08-27 15:39:34 +00:00
Andre Oppermann
319c4c256a Remove a junk line left over from the recent IPFW to PFIL_HOOKS conversion. 2004-08-27 15:32:28 +00:00
Andre Oppermann
c21fd23260 Always compile PFIL_HOOKS into the kernel and remove the associated kernel
compile option.  All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.

If no hooks are connected the entire packet filter hooks section and related
activities are jumped over.  This removes any performance impact if no hooks
are active.

Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.
2004-08-27 15:16:24 +00:00
Ruslan Ermilov
9bfe6d472a Revert the last change to sys/modules/ipfw/Makefile and fix a
standalone module build in a better way.

Silence from:	andre
MFC after:	3 days
2004-08-26 14:18:30 +00:00
Pawel Jakub Dawidek
a7f3feff1b Allocate memory when dumping pipes with M_WAITOK flag.
On a system with huge number of pipes, M_NOWAIT failes almost always,
because of memory fragmentation.
My fix is different than the patch proposed by Pawel Malachowski,
because in FreeBSD 5.x we cannot sleep while holding dummynet mutex
(in 4.x there is no such lock).
My fix is also ugly, but there is no easy way to prepare nice and clean fix.

PR:		kern/46557
Submitted by:	Eugene Grosbein <eugen@grosbein.pp.ru>
Reviewed by:	mlaier
2004-08-25 09:31:30 +00:00
Max Laier
ca7a789aa6 Allow early drop for non-ALTQ enabled queues in an ALTQ-enabled kernel.
Previously the early drop was disabled unconditionally for ALTQ-enabled
kernels.

This should give some benefit for the normal gateway + LAN-server case with
a busy LAN leg and an ALTQ managed uplink.

Reviewed and style help from:	cperciva, pjd
2004-08-22 16:42:28 +00:00
Robert Watson
392e840716 When sliding the m_data pointer forward, update m_pktrhdr.len as well
as m_len, or the pkthdr length will be inconsistent with the actual
length of data in the mbuf chain.  The symptom of this occuring was
"out of data" warnings from in_cksum_skip() on large UDP packets sent
via the loopback interface.

Foot shot:	green
2004-08-22 01:32:48 +00:00
Christian S.J. Peron
5090559b7f When a prison is given the ability to create raw sockets (when the
security.jail.allow_raw_sockets sysctl MIB is set to 1) where privileged
access to jails is given out, it is possible for prison root to manipulate
various network parameters which effect the host environment. This commit
plugs a number of security holes associated with the use of raw sockets
and prisons.

This commit makes the following changes:

- Add a comment to rtioctl warning developers that if they add
  any ioctl commands, they should use super-user checks where necessary,
  as it is possible for PRISON root to make it this far in execution.
- Add super-user checks for the execution of the SIOCGETVIFCNT
  and SIOCGETSGCNT IP multicast ioctl commands.
- Add a super-user check to rip_ctloutput(). If the calling cred
  is PRISON root, make sure the socket option name is IP_HDRINCL,
  otherwise deny the request.

Although this patch corrects a number of security problems associated
with raw sockets and prisons, the warning in jail(8) should still
apply, and by default we should keep the default value of
security.jail.allow_raw_sockets MIB to 0 (or disabled) until
we are certain that we have tracked down all the problems.

Looking forward, we will probably want to eliminate the
references to curthread.

This may be a MFC candidate for RELENG_5.

Reviewed by:	rwatson
Approved by:	bmilekic (mentor)
2004-08-21 17:38:57 +00:00
Robert Watson
e6ccd70936 When prepending space onto outgoing UDP datagram payloads to hold the
UDP/IP header, make sure that space is also allocated for the link
layer header.  If an mbuf must be allocated to hold the UDP/IP header
(very likely), then this will avoid an additional mbuf allocation at
the link layer.  This trick is also used by TCP and other protocols to
avoid extra calls to the mbuf allocator in the ethernet (and related)
output routines.
2004-08-21 16:14:04 +00:00
Andre Oppermann
ce63226177 Fix a stupid typo which prevented an ipfw KLD unload from successfully cleaning
up its remains.  Do not terminate 'if' lines with ';'.

Spotted by:	claudio@OpenBSD.ORG (sitting 3m from my desk)
Pointy hat to:	andre
2004-08-20 00:36:55 +00:00
Andre Oppermann
70222723f3 When unloading ipfw module use callout_drain() to make absolutely sure that
all callouts are stopped and finished.  Move it before IPFW_LOCK() to avoid
deadlocking when draining callouts.
2004-08-19 23:31:40 +00:00
Andre Oppermann
6f2d4ea6f8 For IPv6 access pointer to tcpcb only after we have checked it is valid.
Found by:	Coverity's automated analysis (via Ted Unangst)
2004-08-19 20:16:17 +00:00
Andre Oppermann
50ab727669 Give a useful error message if someone tries to compile IPFIREWALL into the
kernel without specifying PFIL_HOOKS as well.
2004-08-19 18:38:23 +00:00
Andre Oppermann
9108601915 Do not unconditionally ignore IPDIVERT and IPFIREWALL_FORWARD when building
the ipfw KLD.

 For IPFIREWALL_FORWARD this does not have any side effects.  If the module
 has it but not the kernel it just doesn't do anything.

 For IPDIVERT the KLD will be unloadable if the kernel doesn't have IPDIVERT
 compiled in too.  However this is the least disturbing behaviour.  The user
 can just recompile either module or the kernel to match the other one.  The
 access to the machine is not denied if ipfw refuses to load.
2004-08-19 17:59:26 +00:00
Andre Oppermann
e4c97eff8e Bring back the sysctl 'net.inet.ip.fw.enable' to unbreak the startup scripts
and to be able to disable ipfw if it was compiled directly into the kernel.
2004-08-19 17:38:47 +00:00
Robert Watson
5c32ea6517 Push down pcbinfo and inpcb locking from udp_send() into udp_output().
This provides greater context for the locking and allows us to avoid
locking the pcbinfo structure if not binding operations will take
place (i.e., already bound, connected, and no expliti sendto()
address).
2004-08-19 01:13:10 +00:00
Robert Watson
4c2bb15a89 In in_pcbrehash(), do assert the inpcb lock as well as the pcbinfo lock. 2004-08-19 01:11:17 +00:00
Robert Watson
0f48e25b63 Fix build of ip_input.c with "options IPSEC" -- the "pass:" label
is used with both FAST_IPSEC and IPSEC, but was defined for only
FAST_IPSEC.
2004-08-18 03:11:04 +00:00
Peter Wemm
1e5cc10dc2 Make the kernel compile again if you are not using PFIL_HOOKS 2004-08-18 00:37:46 +00:00
Andre Oppermann
9b932e9e04 Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland
and preserves the ipfw ABI.  The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.

However there are many changes how ipfw is and its add-on's are handled:

 In general ipfw is now called through the PFIL_HOOKS and most associated
 magic, that was in ip_input() or ip_output() previously, is now done in
 ipfw_check_[in|out]() in the ipfw PFIL handler.

 IPDIVERT is entirely handled within the ipfw PFIL handlers.  A packet to
 be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
 reassembly.  If not, or all fragments arrived and the packet is complete,
 divert_packet is called directly.  For 'tee' no reassembly attempt is made
 and a copy of the packet is sent to the divert socket unmodified.  The
 original packet continues its way through ip_input/output().

 ipfw 'forward' is done via m_tag's.  The ipfw PFIL handlers tag the packet
 with the new destination sockaddr_in.  A check if the new destination is a
 local IP address is made and the m_flags are set appropriately.  ip_input()
 and ip_output() have some more work to do here.  For ip_input() the m_flags
 are checked and a packet for us is directly sent to the 'ours' section for
 further processing.  Destination changes on the input path are only tagged
 and the 'srcrt' flag to ip_forward() is set to disable destination checks
 and ICMP replies at this stage.  The tag is going to be handled on output.
 ip_output() again checks for m_flags and the 'ours' tag.  If found, the
 packet will be dropped back to the IP netisr where it is going to be picked
 up by ip_input() again and the directly sent to the 'ours' section.  When
 only the destination changes, the route's 'dst' is overwritten with the
 new destination from the forward m_tag.  Then it jumps back at the route
 lookup again and skips the firewall check because it has been marked with
 M_SKIP_FIREWALL.  ipfw 'forward' has to be compiled into the kernel with
 'option IPFIREWALL_FORWARD' to enable it.

 DUMMYNET is entirely handled within the ipfw PFIL handlers.  A packet for
 a dummynet pipe or queue is directly sent to dummynet_io().  Dummynet will
 then inject it back into ip_input/ip_output() after it has served its time.
 Dummynet packets are tagged and will continue from the next rule when they
 hit the ipfw PFIL handlers again after re-injection.

 BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
 they did before.  Later this will be changed to dedicated ETHER PFIL_HOOKS.

More detailed changes to the code:

 conf/files
	Add netinet/ip_fw_pfil.c.

 conf/options
	Add IPFIREWALL_FORWARD option.

 modules/ipfw/Makefile
	Add ip_fw_pfil.c.

 net/bridge.c
	Disable PFIL_HOOKS if ipfw for bridging is active.  Bridging ipfw
	is still directly invoked to handle layer2 headers and packets would
	get a double ipfw when run through PFIL_HOOKS as well.

 netinet/ip_divert.c
	Removed divert_clone() function.  It is no longer used.

 netinet/ip_dummynet.[ch]
	Neither the route 'ro' nor the destination 'dst' need to be stored
	while in dummynet transit.  Structure members and associated macros
	are removed.

 netinet/ip_fastfwd.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.

 netinet/ip_fw.h
	Removed 'ro' and 'dst' from struct ip_fw_args.

 netinet/ip_fw2.c
	(Re)moved some global variables and the module handling.

 netinet/ip_fw_pfil.c
	New file containing the ipfw PFIL handlers and module initialization.

 netinet/ip_input.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.  ip_forward() does not longer require
	the 'next_hop' struct sockaddr_in argument.  Disable early checks
	if 'srcrt' is set.

 netinet/ip_output.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.

 netinet/ip_var.h
	Add ip_reass() as general function.  (Used from ipfw PFIL handlers
	for IPDIVERT.)

 netinet/raw_ip.c
	Directly check if ipfw and dummynet control pointers are active.

 netinet/tcp_input.c
	Rework the 'ipfw forward' to local code to work with the new way of
	forward tags.

 netinet/tcp_sack.c
	Remove include 'opt_ipfw.h' which is not needed here.

 sys/mbuf.h
	Remove m_claim_next() macro which was exclusively for ipfw 'forward'
	and is no longer needed.

Approved by:	re (scottl)
2004-08-17 22:05:54 +00:00
Robert Watson
a4f757cd5d White space cleanup for netinet before branch:
- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by:	re (scottl)
Submitted by:	Xin LI <delphij@frontfree.net>
2004-08-16 18:32:07 +00:00
David E. O'Brien
5af87d0ea1 Put the 'antispoof' opcode in the proper place in the opcode list such
that it doesn't break the ipfw2 ABI.
2004-08-16 12:05:19 +00:00
David Malone
1f44b0a1b5 Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

        1) introduce a ip_newid() static inline function that checks
        the sysctl and then decides if it should return a sequential
        or random IP ID.

        2) named the sysctl net.inet.ip.random_id

        3) IPv6 flow IDs and fragment IDs are now always random.
        Flow IDs and frag IDs are significantly less common in the
        IPv6 world (ie. rarely generated per-packet), so there should
        be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by:	andre, silby, mlaier, ume
Based on:	NetBSD
MFC after:	2 months
2004-08-14 15:32:40 +00:00
Poul-Henning Kamp
e7581f0fc2 Fix outgoing ICMP on global instance. 2004-08-14 14:21:09 +00:00
Christian S.J. Peron
31c88a3043 Add the ability to associate ipfw rules with a specific prison ID.
Since the only thing truly unique about a prison is it's ID, I figured
this would be the most granular way of handling this.

This commit makes the following changes:

- Adds tokenizing and parsing for the ``jail'' command line option
  to the ipfw(8) userspace utility.
- Append the ipfw opcode list with O_JAIL.
- While Iam here, add a comment informing others that if they
  want to add additional opcodes, they should append them to the end
  of the list to avoid ABI breakage.
- Add ``fw_prid'' to the ipfw ucred cache structure.
- When initializing ucred cache, if the process is jailed,
  set fw_prid to the prison ID, otherwise set it to -1.
- Update man page to reflect these changes.

This change was a strong motivator behind the ucred caching
mechanism in ipfw.

A sample usage of this new functionality could be:

    ipfw add count ip from any to any jail 2

It should be noted that because ucred based constraints
are only implemented for TCP and UDP packets, the same
applies for jail associations.

Conceptual head nod by:	pjd
Reviewed by:	rwatson
Approved by:	bmilekic (mentor)
2004-08-12 22:06:55 +00:00
David Malone
849112666a In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach
so that the locks held are the same as the IPv4 case.

Reviewed by:	rwatson
2004-08-12 18:19:36 +00:00
Andre Oppermann
9d804f818c Fix two cases of incorrect IPQ_UNLOCK'ing in the merged ip_reass() function.
The first one was going to 'dropfrag', which unlocks the IPQ, before the lock
was aquired; The second one doing a unlock and then a 'goto dropfrag' which
led to a double-unlock.

Tripped over by:	des
2004-08-12 08:37:42 +00:00
Robert Watson
c19c5239a6 When udp_send() fails, make sure to free the control mbufs as well as
the data mbuf.  This was done in most error cases, but not the case
where the inpcb pointer is surprisingly NULL.
2004-08-12 01:34:27 +00:00
Andre Oppermann
420a281164 Backout removal of UMA_ZONE_NOFREE flag for all zones which are established
for structures with timers in them.  It might be that a timer might fire
even when the associated structure has already been free'd.  Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.

Discussed with:	bosko, tegge, rwatson
2004-08-11 20:30:08 +00:00
Andre Oppermann
4efb805c0c Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and
TCP code.  This flag would have prevented giving back excessive free slabs
to the global pool after a transient peak usage.
2004-08-11 17:08:31 +00:00
Andre Oppermann
67d0b24ed1 Make use of in_localip() function and replace previous direct LIST_FOREACH
loops over INADDR_HASH.
2004-08-11 12:32:10 +00:00
Andre Oppermann
2eccc90b61 Add the function in_localip() which returns 1 if an internet address is for
the local host and configured on one of its interfaces.
2004-08-11 11:49:48 +00:00
Andre Oppermann
6e234ede37 Only invoke verify_path() for verrevpath and versrcreach when we have an IP packet. 2004-08-11 11:41:11 +00:00
Andre Oppermann
767981878c Only check for local broadcast addresses if the mbuf is flagged with M_BCAST. 2004-08-11 10:49:56 +00:00
Andre Oppermann
0b17fba7bc Consistently use NULL for pointer comparisons. 2004-08-11 10:46:15 +00:00
Andre Oppermann
de2e5d1e20 Make IP fastforwarding ALTQ-aware by adding the input traffic conditioner
check and disabling the early output interface queue length check.
2004-08-11 10:42:59 +00:00
Andre Oppermann
2f6e6e9b4c Correct the displayed bandwidth calculation for a readout via sysctl. The
saved value does not have to be scaled with HZ; it is already in bytes per
second.  Only the multiply by eight remains to show bits per second (bps).
2004-08-11 10:12:16 +00:00
Robert Watson
27f74fd0ed Assert the locks of inpcbinfo's and inpcb's passed into in_pcbconnect()
and in_pcbconnect_setup(), since these functions frob the port and
address state of inpcbs.
2004-08-11 04:35:20 +00:00
Andre Oppermann
bb7c5b3055 Make a comment that IP source routing is not SMP and PREEMPTION safe. 2004-08-09 16:17:37 +00:00
Andre Oppermann
a5053398d4 Make a comment that "ipfw forward" is not SMP and PREEMPTION safe. 2004-08-09 16:16:10 +00:00
Andre Oppermann
5f9541ecbd New ipfw option "antispoof":
For incoming packets, the packet's source address is checked if it
 belongs to a directly connected network.  If the network is directly
 connected, then the interface the packet came on in is compared to
 the interface the network is connected to.  When incoming interface
 and directly connected interface are not the same, the packet does
 not match.

Usage example:

 ipfw add deny ip from any to any not antispoof in

Manpage education by:	ru
2004-08-09 16:12:10 +00:00
Robert Watson
f31f65a708 Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events.  This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.

Reported by:	kuriyama
Tested by:	kuriyama
2004-08-06 03:45:45 +00:00
Robert Watson
9c1df6951f When iterating the UDP inpcb list processing an inbound broadcast
or multicast packet, we don't need to acquire the inpcb mutex
unless we are actually using inpcb fields other than the bound port
and address.  Since we hold the pcbinfo lock already, these can't
change.  Defer acquiring the inpcb mutex until we have a high
chance of a match.  This avoids about 120 mutex operations per UDP
broadcast packet received on one of my work systems.

Reviewed by:	sam
2004-08-06 02:08:31 +00:00
Robert Watson
98aed8ca56 Now that IPv6 performs basic in6pcb and inpcb locking, enable inpcb
lock assertions even if IPv6 is compiled into the kernel.  Previously,
inclusion of IPv6 and locking assertions would result in a rapid
assertion failure as IPv6 was not properly locking inpcbs.
2004-08-04 18:27:55 +00:00
Joe Marcus Clarke
5c7e7e80cc Fix Skinny and PPTP NAT'ing after the introduction of the {ip,tcp,udp}_next
functions.  Basically, the ip_next() function was used to get the PPTP and
Skinny headers when tcp_next() should have been used instead.  Symptoms of
this included a segfault in natd when trying to process a PPTP or Skinny
packet.

Approved by:	des
2004-08-04 15:17:08 +00:00
Andre Oppermann
81007fd4eb o Delayed checksums are now calculated in divert_packet() for diverted packets
Remove the XXX-escaped code that did it in ip_output()'s IPHACK section.
2004-08-03 14:13:36 +00:00
Andre Oppermann
24a098ea9b o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be
more consistent with the other sysctls around it.
2004-08-03 13:54:11 +00:00
Andre Oppermann
f0cada84b1 o Move all parts of the IP reassembly process into the function ip_reass() to
make it fully self-contained.
o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len
  including the IP header.
o Computation of the delayed checksum is moved into divert_packet().

Reviewed by:	silby
2004-08-03 12:31:38 +00:00
Jeffrey Hsu
2ff39e1543 Fix bug with tracking the previous element in a list.
Found by:	edrt@citiz.net
Submitted by:	pavlin@icir.org
2004-08-03 02:01:44 +00:00
Yaroslav Tykhiy
a4eb4405e3 Disallow a particular kind of port theft described by the following scenario:
Alice is too lazy to write a server application in PF-independent
	manner.  Therefore she knocks up the server using PF_INET6 only
	and allows the IPv6 socket to accept mapped IPv4 as well.  An evil
	hacker known on IRC as cheshire_cat has an account in the same
	system.  He starts a process listening on the same port as used
	by Alice's server, but in PF_INET.  As a consequence, cheshire_cat
	will distract all IPv4 traffic supposed to go to Alice's server.

Such sort of port theft was initially enabled by copying the code that
implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for
the implied case of the same owner for both connections.  After this
change, the above scenario will be impossible.  In the same setting,
the user who attempts to start his server last will get EADDRINUSE.

Of course, using IPv4 mapped to IPv6 leads to security complications
in the first place, but there is no reason to make it even more unsafe.

This change doesn't apply to KAME since it affects a FreeBSD-specific
part of the code.  It doesn't modify the out-of-box behaviour of the
TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default.

MFC after:	1 month
2004-07-28 13:03:07 +00:00
Jayanth Vijayaraghavan
5d3b1b7556 Fix a bug in the sack code that was causing data to be retransmitted
with the FIN bit set for all segments, if a FIN has already been sent before.
The fix will allow the FIN bit to be set for only the last segment, in case
it has to be retransmitted.

Fix another bug that would have caused snd_nxt to be pulled by len if
there was an error from ip_output. snd_nxt should not be touched
during sack retransmissions.
2004-07-28 02:15:14 +00:00
Jayanth Vijayaraghavan
e9f2f80e09 Fix for a SACK bug where the very last segment retransmitted
from the SACK scoreboard could result in the next (untransmitted)
segment to be skipped.
2004-07-26 23:41:12 +00:00
John-Mark Gurney
0aa8ce5012 compare pointer against NULL, not 0
when inpcb is NULL, this is no longer invalid since jlemon added the
tcp_twstart function... this prevents close "failing" w/ EINVAL when it
really was successful...

Reviewed by:	jeremy (NetBSD)
2004-07-26 21:29:56 +00:00
Colin Percival
56f21b9d74 Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with:	rwatson, scottl
Requested by:	jhb
2004-07-26 07:24:04 +00:00
Andre Oppermann
55db762b76 Extend versrcreach by checking against the rt_flags for RTF_REJECT and
RTF_BLACKHOLE as well.

To quote the submitter:

 The uRPF loose-check implementation by the industry vendors, at least on Cisco
 and possibly Juniper, will fail the check if the route of the source address
 is pointed to Null0 (on Juniper, discard or reject route). What this means is,
 even if uRPF Loose-check finds the route, if the route is pointed to blackhole,
 uRPF loose-check must fail. This allows people to utilize uRPF loose-check mode
 as a pseudo-packet-firewall without using any manual filtering configuration --
 one can simply inject a IGP or BGP prefix with next-hop set to a static route
 that directs to null/discard facility. This results in uRPF Loose-check failing
 on all packets with source addresses that are within the range of the nullroute.

Submitted by:	James Jun <james@towardex.com>
2004-07-21 19:55:14 +00:00
Robert Watson
2d01d331c6 M_PREPEND() the IP header on to the front of an outgoing raw IP packet
using M_DONTWAIT rather than M_WAITOK to avoid sleeping on memory
while holding a mutex.
2004-07-20 20:52:30 +00:00
Jayanth Vijayaraghavan
04f0d9a0ea Let IN_FASTREOCOVERY macro decide if we are in recovery mode.
Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.
2004-07-19 22:37:33 +00:00
Jayanth Vijayaraghavan
f787edd847 Fix a potential panic in the SACK code that was causing
1) data to be sent to the right of snd_recover.
2) send more data then whats in the send buffer.

The fix is to postpone sack retransmit to a subsequent recovery episode
if the current retransmit pointer is beyond snd_recover.

Thanks to Mohan Srinivasan for helping fix the bug.

Submitted by:Daniel Lang
2004-07-19 22:06:01 +00:00
David Malone
932312d60b Fix the !INET6 build.
Reported by:	alc
2004-07-17 21:40:14 +00:00
David Malone
969860f3ed The tcp syncache code was leaving the IPv6 flowlabel uninitialised
for the SYN|ACK packet and then letting in6_pcbconnect set the
flowlabel later. Arange for the syncache/syncookie code to set and
recall the flow label so that the flowlabel used for the SYN|ACK
is consistent. This is done by using some of the cookie (when tcp
cookies are enabeled) and by stashing the flowlabel in syncache.

Tested and Discovered by:	Orla McGann <orly@cnri.dit.ie>
Approved by:			ume, silby
MFC after:			1 month
2004-07-17 19:44:13 +00:00
Max Laier
c550f2206d Define semantic of M_SKIP_FIREWALL more precisely, i.e. also pass associated
icmp_error() packets. While here retire PACKET_TAG_PF_GENERATED (which
served the same purpose) and use M_SKIP_FIREWALL in pf as well. This should
speed up things a bit as we get rid of the tag allocations.

Discussed with:	juli
2004-07-17 05:10:06 +00:00
Juli Mallett
765d141c78 Make M_SKIP_FIREWALL a global (and semantic) flag, preventing anything from
using M_PROTO6 and possibly shooting someone's foot, as well as allowing the
firewall to be used in multiple passes, or with a packet classifier frontend,
that may need to explicitly allow a certain packet.  Presently this is handled
in the ipfw_chk code as before, though I have run with it moved to upper
layers, and possibly it should apply to ipfilter and pf as well, though this
has not been investigated.

Discussed with:	luigi, rwatson
2004-07-17 02:40:13 +00:00
Hajimu UMEMOTO
8a59da300c when IN6P_AUTOFLOWLABEL is set, the flowlabel is not set on
outgoing tcp connections.

Reported by:	Orla McGann <orly@cnri.dit.ie>
Reviewed by:	Orla McGann <orly@cnri.dit.ie>
Obtained from:	KAME
2004-07-16 18:08:13 +00:00
Poul-Henning Kamp
3e019deaed Do a pass over all modules in the kernel and make them return EOPNOTSUPP
for unknown events.

A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
2004-07-15 08:26:07 +00:00
Stefan Farfeleder
439dfb0c35 Remove erroneous semicolons. 2004-07-13 16:06:19 +00:00
Robert Watson
7cfc690440 After each label in tcp_input(), assert the inpcbinfo and inpcb lock
state that we expect.
2004-07-12 19:28:07 +00:00
Brian Somers
0ac4013324 Change the following environment variables to kernel options:
bootp -> BOOTP
    bootp.nfsroot -> BOOTP_NFSROOT
    bootp.nfsv3 -> BOOTP_NFSV3
    bootp.compat -> BOOTP_COMPAT
    bootp.wired_to -> BOOTP_WIRED_TO

- i.e. back out the previous commit.  It's already possible to
pxeboot(8) with a GENERIC kernel.

Pointed out by: dwmalone
2004-07-08 22:35:36 +00:00
Brian Somers
59e1ebc9b5 Change the following kernel options to environment variables:
BOOTP -> bootp
    BOOTP_NFSROOT -> bootp.nfsroot
    BOOTP_NFSV3 -> bootp.nfsv3
    BOOTP_COMPAT -> bootp.compat
    BOOTP_WIRED_TO -> bootp.wired_to

This lets you PXE boot with a GENERIC kernel by putting this sort of thing
in loader.conf:

    bootp="YES"
    bootp.nfsroot="YES"
    bootp.nfsv3="YES"
    bootp.wired_to="bge1"

or even setting the variables manually from the OK prompt.
2004-07-08 13:40:33 +00:00
Dag-Erling Smørgrav
de47739e71 Push WARNS back up to 6, but define NO_WERROR; I want the warts out in the
open where people can see them and hopefully fix them.
2004-07-06 12:15:24 +00:00
Dag-Erling Smørgrav
9fa0fd2682 Introduce inline {ip,udp,tcp}_next() functions which take a pointer to an
{ip,udp,tcp} header and return a void * pointing to the payload (i.e. the
first byte past the end of the header and any required padding).  Use them
consistently throughout libalias to a) reduce code duplication, b) improve
code legibility, c) get rid of a bunch of alignment warnings.
2004-07-06 12:13:28 +00:00
Dag-Erling Smørgrav
e3e2c21639 Rewrite twowords() to access its argument through a char pointer and not
a short pointer.  The previous implementation seems to be in a gray zone
of the C standard, and GCC generates incorrect code for it at -O2 or
higher on some platforms.
2004-07-06 09:22:18 +00:00
Dag-Erling Smørgrav
95347a8ee0 Temporarily lower WARNS to 3 while I figure out the alignment issues on
alpha.
2004-07-06 08:44:41 +00:00
Dag-Erling Smørgrav
ed01a58215 Make libalias WARNS?=6-clean. This mostly involves renaming variables
named link, foo_link or link_foo to lnk, foo_lnk or lnk_foo, fixing
signed / unsigned comparisons, and shoving unused function arguments
under the carpet.

I was hoping WARNS?=6 might reveal more serious problems, and perhaps
the source of the -O2 breakage, but found no smoking gun.
2004-07-05 11:10:57 +00:00
Dag-Erling Smørgrav
ffcb611a9d Parenthesize return values. 2004-07-05 10:55:23 +00:00
Dag-Erling Smørgrav
f311ebb4ec Mechanical whitespace cleanup. 2004-07-05 10:53:28 +00:00
Poul-Henning Kamp
e6bbb69149 Add LibAliasOutTry() which checks a packet for a hit in the tables, but
does not create a new entry if none is found.
2004-07-04 12:53:07 +00:00
Ruslan Ermilov
1a0a934547 Mechanically kill hard sentence breaks. 2004-07-02 23:52:20 +00:00
Jayanth Vijayaraghavan
a0445c2e2c On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly.
Fix this problem by separating out the SACK and the newreno cases. Also, check
if we are in FASTRECOVERY for the sack case and if so, turn off dupacks.

Fix an issue where the congestion window was not being incremented by ssthresh.

Thanks to Mohan Srinivasan for finding this problem.
2004-07-01 23:34:06 +00:00
Ruslan Ermilov
c9a246418d Bumped document date.
Fixed markup.
Fixed examples to match the new API.
2004-07-01 17:51:48 +00:00
Poul-Henning Kamp
e3e244bff6 Rwatson, write 100 times for tomorrow:
First unlock, then assign NULL to pointer.
2004-06-27 21:54:34 +00:00
Pawel Jakub Dawidek
0a44517d3a Those are unneeded too. 2004-06-27 09:06:10 +00:00
Pawel Jakub Dawidek
46e3b1cbe7 Add two missing includes and remove two uneeded.
This is quite serious fix, because even with MAC framework compiled in,
MAC entry points in those two files were simply ignored.
2004-06-27 09:03:22 +00:00
Robert Watson
1e4d7da707 Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
  acquire the socket buffer lock and use the _locked() variants of both
  calls.  Note that the _locked() sowakeup() versions unlock the mutex on
  return.  This is done in uipc_send(), divert_packet(), mroute
  socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
  explicit unlock and use the _locked() sowakeup() variant.  This is done
  in soisdisconnecting(), soisdisconnected() when setting the can't send/
  receive flags and dropping data, and in uipc_rcvd() which adjusting
  back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.
2004-06-26 19:10:39 +00:00
Robert Watson
3f9d1ef905 Remove spl's from TCP protocol entry points. While not all locking
is merged here yet, this will ease the merge process by bringing the
locked and unlocked versions into sync.
2004-06-26 17:50:50 +00:00
Paul Saab
652178a12a White space & spelling fixes
Submitted by:	Xin LI <delphij@frontfree.net>
2004-06-25 04:11:26 +00:00
Bruce M Simpson
37332f049f Whitespace. 2004-06-25 02:29:58 +00:00
Robert Watson
5905999b2f Broaden scope of the socket buffer lock when processing an ACK so that
the read and write of sb_cc are atomic.  Call sbdrop_locked() instead
of sbdrop() since we already hold the socket buffer lock.
2004-06-24 03:07:27 +00:00
Robert Watson
927c5cea3f Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.

Update socket locking annotations
2004-06-24 02:57:12 +00:00
Robert Watson
a138d21769 In ip_ctloutput(), acquire the inpcb lock around some of the basic
inpcb flag and status updates.
2004-06-24 02:05:47 +00:00
Robert Watson
d67ec3dd48 When asserting non-Giant locks in the network stack, also assert
Giant if debug.mpsafenet=0, as any points that require synchronization
in the SMPng world also required it in the Giant-world:

- inpcb locks (including IPv6)
- inpcbinfo locks (including IPv6)
- dummynet subsystem lock
- ipfw2 subsystem lock
2004-06-24 02:01:48 +00:00
Robert Watson
3f11a2f374 Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted.  sbreserve() now acquires
the lock before calling sbreserve_locked().  In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order.  In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.
2004-06-24 01:37:04 +00:00
Paul Saab
76947e3222 Move the sack sysctl's under net.inet.tcp.sack
net.inet.tcp.do_sack -> net.inet.tcp.sack.enable
net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit

Requested by:	wollman
2004-06-23 21:34:07 +00:00
Paul Saab
6d90faf3d8 Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by:	gnn
Obtained from:	Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
2004-06-23 21:04:37 +00:00
Robert Watson
bb7479a613 Acquire socket lock around frobbing of socket state in divert sockets. 2004-06-22 04:00:51 +00:00
Robert Watson
ffcbc0e4c5 Prefer use of the inpcb as a MAC label source for outgoing packets sent
via divert sockets, when available.
2004-06-22 03:58:50 +00:00
Robert Watson
d330008e3b If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE. 2004-06-20 21:44:50 +00:00
Robert Watson
1f82efb3b7 Assert the inpcb lock before letting MAC check whether we can deliver
to the inpcb in tcp_input().
2004-06-20 20:17:29 +00:00
Robert Watson
1b83216eda IP multicast code no longer needs to acquire Giant before appending
an mbuf onto a socket buffer.  This is left over from debug.mpsafenet
affecting the forwarding/bridging plane only.
2004-06-20 20:10:05 +00:00
Robert Watson
4e397bc524 In tcp_ctloutput(), don't hold the inpcb lock over a call to
ip_ctloutput(), as it may need to perform blocking memory allocations.
This also improves consistency with locking relative to other points
that call into ip_ctloutput().

Bumped into by:	Grover Lines <grover@ceribus.net>
2004-06-18 20:22:21 +00:00
Bruce M Simpson
4f450ff9a5 Check that m->m_pkthdr.rcvif is not NULL before checking if a packet
was received on a broadcast address on the input path. Under certain
circumstances this could result in a panic, notably for locally-generated
packets which do not have m_pkthdr.rcvif set.

This is a similar situation to that which is solved by
src/sys/netinet/ip_icmp.c rev 1.66.

PR:		kern/52935
2004-06-18 12:58:45 +00:00
Bruce M Simpson
f3e0b7ef7f Appease GCC. 2004-06-18 09:53:58 +00:00
Bruce M Simpson
5214cb3f59 If SO_DEBUG is enabled for a TCP socket, and a received segment is
encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer
when registering trace records.  'ipov' is specific to IPv4, and
will therefore be uninitialized.

[This fandango is only necessary in the first place because of our
host-byte-order IP field pessimization.]

PR:		kern/60856
Submitted by:	Galois Zheng
2004-06-18 03:31:07 +00:00
Bruce M Simpson
da181cc144 Don't set FIN on a retransmitted segment after a FIN has been sent,
unless the segment really contains the last of the data for the stream.

PR:		kern/34619
Obtained from:	OpenBSD (tcp_output.c rev 1.47)
Noticed by:	Joseph Ishac
Reviewed by:	George Neville-Neil
2004-06-18 02:47:59 +00:00
Bruce M Simpson
27de0135ce Ensure that dst is bzeroed before calling rtalloc_ign(), to avoid possible
routing table corruption.

PR:		kern/40563, freebsd4/432 (KAME)
Obtained from:	NetBSD (in_gif.c rev 1.26.10.1)
Requested by:	Jean-Luc Richier
2004-06-18 02:04:07 +00:00
Max Laier
7c1fe95333 Commit pf version 3.5 and link additional files to the kernel build.
Version 3.5 brings:
 - Atomic commits of ruleset changes (reduce the chance of ending up in an
   inconsistent state).
 - A 30% reduction in the size of state table entries.
 - Source-tracking (limit number of clients and states per client).
 - Sticky-address (the flexibility of round-robin with the benefits of
   source-hash).
 - Significant improvements to interface handling.
 - and many more ...
2004-06-16 23:24:02 +00:00
Max Laier
a306c902b8 Prepare for pf 3.5 import:
- Remove pflog and pfsync modules. Things will change in such a fashion
   that there will be one module with pf+pflog that can be loaded into
   GENERIC without problems (which is what most people want). pfsync is no
   longer possible as a module.
 - Add multicast address for in-kernel multicast pfsync protocol. Protocol
   glue will follow once the import is done.
 - Add one more mbuf tag
2004-06-16 22:59:06 +00:00
Maxim Konovalov
ef14c36965 o connect(2): if there is no a route to the destination
do not pick up the first local ip address for the source
ip address, return ENETUNREACH instead.

Submitted by:	Gleb Smirnoff
Reviewed by:	-current (silence)
2004-06-16 10:02:36 +00:00
Bruce M Simpson
d420fcda27 Fix build for IPSEC && !INET6
PR:		kern/66125
Submitted by:	Cyrille Lefevre
2004-06-16 09:35:07 +00:00
Bruce M Simpson
49b19bfc47 Reverse a patch which has no effect on -CURRENT and should probably be
applied directly to -STABLE.

Noticed by:	iedowse
Pointy hat to:	bms
2004-06-16 08:50:14 +00:00
Bruce M Simpson
57ab3660ff In ip_forward(), when calculating the MTU in effect for an IPSEC transport
mode tunnel, take the per-route MTU into account, *if* and *only if* it
is non-zero (as found in struct rt_metrics/rt_metrics_lite).

PR:		kern/42727
Obtained from:	NetBSD (ip_input.c rev 1.151)
2004-06-16 08:33:09 +00:00
Bruce M Simpson
e6b0a57025 In ip_forward(), set m->m_pkthdr.len correctly such that the mbuf chain
is sane, and ipsec4_getpolicybyaddr() will therefore complete.

PR:		kern/42727
Obtained from:	KAME (kame/freebsd4/sys/netinet/ip_input.c rev 1.42)
2004-06-16 08:28:54 +00:00
Bruce M Simpson
34e3ccb34b Disconnect a temporarily-connected UDP socket in out-of-mbufs case. This
fixes the problem of UDP sockets getting wedged in a connected state (and
bound to their destination) under heavy load.
Temporary bind/connect should probably be deleted in future
as an optimization, as described in "A Faster UDP" [Partridge/Pink 1993].

Notes:
 - INP_LOCK() is already held in udp_output(). The connection is in effect
   happening at a layer lower than the socket layer, therefore in theory
   socket locking should not be needed.
 - Inlining the in_pcbdisconnect() operation buys us nothing (in the case
   of the current state of the code), as laddr is not part of the
   inpcb hash or the udbinfo hash. Therefore there should be no need
   to rehash after restoring laddr in the error case (this was a
   concern of the original author of the patch).

PR:		kern/41765
Requested by:	gnn
Submitted by:	Jinmei Tatuya (with cleanups)
Tested by:	spray(8)
2004-06-16 05:41:00 +00:00
Robert Watson
a97719a4c5 Convert GIANT_REQUIRED to NET_ASSERT_GIANT for socket access. 2004-06-16 03:36:06 +00:00
Robert Watson
7721f5d760 Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field.  This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.
2004-06-15 03:51:44 +00:00
Robert Watson
c0b99ffa02 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
Max Laier
02b199f158 Link ALTQ to the build and break with ABI for struct ifnet. Please recompile
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.

__FreeBSD_version bump will follow.

Tested-by:	(i386)LINT
2004-06-13 17:29:10 +00:00
Doug Rabson
b8b3323469 Add a new driver to support IP over firewire. This driver is intended to
conform to the rfc2734 and rfc3146 standard for IP over firewire and
should eventually supercede the fwe driver. Right now the broadcast
channel number is hardwired and we don't support MCAP for multicast
channel allocation - more infrastructure is required in the firewire
code itself to fix these problems.
2004-06-13 10:54:36 +00:00
Robert Watson
310e7ceb94 Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
  manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
  copy while holding the socket lock, then release the socket lock
  to externalize to userspace.
2004-06-13 02:50:07 +00:00
Robert Watson
395a08c904 Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
Christian S.J. Peron
d316f2cf4f Modify ip fw so that whenever UID or GID constraints exist in a
ruleset, the pcb is looked up once per ipfw_chk() activation.

This is done by extracting the required information out of the PCB
and caching it to the ipfw_chk() stack. This should greatly reduce
PCB looking contention and speed up the processing of UID/GID based
firewall rules (especially with large UID/GID rulesets).

Some very basic benchmarks were taken which compares the number
of in_pcblookup_hash(9) activations to the number of firewall
rules containing UID/GID based contraints before and after this patch.

The results can be viewed here:
o http://people.freebsd.org/~csjp/ip_fw_pcb.png

Reviewed by:	andre, luigi, rwatson
Approved by:	bmilekic (mentor)
2004-06-11 22:17:14 +00:00
Robert Watson
c1d587c848 Remove unneeded Giant acquisition in divert_packet(), which is
left over from debug.mpsafenet affecting only the forwarding
plane.  Giant is now acquired in the ithread/netisr or in the
system call code.
2004-06-11 04:06:51 +00:00
Robert Watson
c14800e6ff Lock down parallel router_info list for tracking multicast IGMP
versions of various routers seen:

- Introduce igmp_mtx.
- Protect global variable 'router_info_head' and list fields
  in struct router_info with this mutex, as well as
  igmp_timers_are_running.
- find_rti() asserts that the caller acquires igmp_mtx.
- Annotate a failure to check the return value of
  MALLOC(..., M_NOWAIT).
2004-06-11 03:42:37 +00:00
Ruslan Ermilov
dd4d62c7d8 init_tables() must be run after sys/net/route.c:route_init(). 2004-06-10 20:20:37 +00:00
Ruslan Ermilov
cd8b5ae0ae Introduce a new feature to IPFW2: lookup tables. These are useful
for handling large sparse address sets.  Initial implementation by
Vsevolod Lobko <seva@ip.net.ua>, refined by me.

MFC after:	1 week
2004-06-09 20:10:38 +00:00
Hajimu UMEMOTO
cad1917d48 do not send icmp response if the original packet is encrypted.
Obtained from:	KAME
MFC after:	1 week
2004-06-07 09:56:59 +00:00
Bosko Milekic
ac830b58d1 Move the locking of the pcb into raw_output(). Organize code so
that m_prepend() is not called with possibility to wait while the
pcb lock is held.  What still needs revisiting is whether the
ripcbinfo lock is really required here.

Discussed with: rwatson
2004-06-03 03:15:29 +00:00
Poul-Henning Kamp
5dba30f15a add missing #include <sys/module.h> 2004-05-30 20:27:19 +00:00
Poul-Henning Kamp
41ee9f1c69 Add some missing <sys/module.h> includes which are masked by the
one on death-row in <sys/kernel.h>
2004-05-30 17:57:46 +00:00
Christian S.J. Peron
b5ef991561 Add a super-user check to ipfw_ctl() to make sure that the calling
process is a non-prison root. The security.jail.allow_raw_sockets
sysctl variable is disabled by default, however if the user enables
raw sockets in prisons, prison-root should not be able to interact
with firewall rule sets.

Approved by:	rwatson, bmilekic (mentor)
2004-05-25 15:02:12 +00:00
Yaroslav Tykhiy
4658dc8325 When checking for possible port theft, skip over a TCP inpcb
unless it's in the closed or listening state (remote address
== INADDR_ANY).

If a TCP inpcb is in any other state, it's impossible to steal
its local port or use it for port theft.  And if there are
both closed/listening and connected TCP inpcbs on the same
localIP:port couple, the call to in_pcblookup_local() will
find the former due to the design of that function.

No objections raised in:	-net, -arch
MFC after:			1 month
2004-05-20 06:35:02 +00:00
Maxim Konovalov
a49b21371a o Calculate a number of bytes to copy (cnt) correctly:
+----+-+-+-+-+----+----+- - - - - - - - - - - -  -+----+
  |    | |C| | |    |    |                          |    |
  | IP |N|O|L|P|    | IP |                          | IP |
  | #1 |O|D|E|T|    | #2 |                          | #n |
  |    |P|E|N|R|    |    |                          |    |
  +----+-+-+-+-+----+----+- - - - - - - - - - - -  -+----+
               ^    ^<---- cnt - (IPOPT_MINOFF - 1) ---->|
               |    |
src            |    +-- cp[IPOPT_OFF + 1] + sizeof(struct in_addr)
               |
dst            +-- cp[IPOPT_OFF + 1]

PR:		kern/66386
Submitted by:	Andrei Iltchenko
MFC after:	3 weeks
2004-05-11 19:14:44 +00:00
Maxim Konovalov
d0946241ac o IFNAMSIZ does include the trailing \0.
Approved by:	andre

o Document net.inet.icmp.reply_src.
2004-05-07 01:24:53 +00:00
Andre Oppermann
2bde81acd6 Provide the sysctl net.inet.ip.process_options to control the processing
of IP options.

 net.inet.ip.process_options=0  Ignore IP options and pass packets unmodified.
 net.inet.ip.process_options=1  Process all IP options (default).
 net.inet.ip.process_options=2  Reject all packets with IP options with ICMP
  filter prohibited message.

This sysctl affects packets destined for the local host as well as those
only transiting through the host (routing).

IP options do not have any legitimate purpose anymore and are only used
to circumvent firewalls or to exploit certain behaviours or bugs in TCP/IP
stacks.

Reviewed by:	sam (mentor)
2004-05-06 18:46:03 +00:00
Robert Watson
c18b97c630 Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4.  This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed.  In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 02:11:47 +00:00
Robert Watson
87f2bb8caf Assert inpcb lock in udp_append().
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 01:08:15 +00:00
Robert Watson
cbe42d48bd Assert the inpcb lock on 'last' in udp_append(), since it's always
called with it, and also requires it.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 00:10:16 +00:00
Maxim Konovalov
1a0c4873ed o Fix misindentation in the previous commit. 2004-05-03 17:15:34 +00:00
Andre Oppermann
7652802b06 Back out a change that slipped into the previous commit for which other
supporting parts have not yet been committed.

Remove pre-mature IP options ignoring option.
2004-05-03 16:07:13 +00:00
Andre Oppermann
06bb56f43c Optimize IP fastforwarding some more:
o New function ip_findroute() to reduce code duplication for the
  route lookup cases. (luigi)

o Store ip_len in host byte order on the stack instead of using
  it via indirection from the mbuf.  This allows to defer the host
  byte conversion to a later point and makes a quicker fallback to
  normal ip_input() processing. (luigi)

o Check if route is dampned with RTF_REJECT flag and drop packet
  already here when ARP is unable to resolve destination address.
  An ICMP unreachable is sent to inform the sender.

o Check if interface output queue is full and drop packet already
  here.  No ICMP notification is sent because signalling source quench
  is depreciated.

o Check if media_state is down (used for ethernet type interfaces)
  and drop the packet already here.  An ICMP unreachable is sent to
  inform the sender.

o Do not account sent packets to the interface address counters.  They
  are only for packets with that 'ia' as source address.

o Update and clarify some comments.

Submitted by:	luigi (most of it)
2004-05-03 13:52:47 +00:00
Darren Reed
2f3f1e6773 Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier. 2004-05-02 15:10:17 +00:00
Darren Reed
7fbb130049 oops, I forgot this file in a prior commit (change was still sitting here,
uncommitted):

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h
alongside all of the other macros that work ok mbuf's and tag's.
2004-05-02 15:07:37 +00:00
Darren Reed
ab884d993e Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside
all of the other macros that work ok mbuf's and tag's.
2004-05-02 06:36:30 +00:00
Bosko Milekic
5a59cefcd1 Give jail(8) the feature to allow raw sockets from within a
jail, which is less restrictive but allows for more flexible
jail usage (for those who are willing to make the sacrifice).
The default is off, but allowing raw sockets within jails can
now be accomplished by tuning security.jail.allow_raw_sockets
to 1.

Turning this on will allow you to use things like ping(8)
or traceroute(8) from within a jail.

The patch being committed is not identical to the patch
in the PR.  The committed version is more friendly to
APIs which pjd is working on, so it should integrate
into his work quite nicely.  This change has also been
presented and addressed on the freebsd-hackers mailing
list.

Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
PR: kern/65800
2004-04-26 19:46:52 +00:00
Mike Silbersack
80dd2a81fb Tighten up reset handling in order to make reset attacks as difficult as
possible while maintaining compatibility with the widest range of TCP stacks.

The algorithm is as follows:

---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.

For connections in all other states, a reset anywhere in the window
will cause the connection to be reset.  All other segments will be
silently dropped.
---

The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.

Idea by:        Darren Reed
Reviewed by:    truckman, jlemon, others(?)
2004-04-26 02:56:31 +00:00
Luigi Rizzo
b2a8ac7ca5 Another small set of changes to reduce diffs with the new arp code. 2004-04-25 15:00:17 +00:00
Luigi Rizzo
491522eade remove a stale comment on the behaviour of arpresolve 2004-04-25 14:06:23 +00:00
Luigi Rizzo
cfff63f1b8 Start the arp timer at init time.
It runs so rarely that it makes no sense to wait until the first request.
2004-04-25 12:50:14 +00:00
Luigi Rizzo
cd46a114fc This commit does two things:
1. rt_check() cleanup:
    rt_check() is only necessary for some address families to gain access
    to the corresponding arp entry, so call it only in/near the *resolve()
    routines where it is actually used -- at the moment this is
    arpresolve(), nd6_storelladdr() (the call is embedded here),
    and atmresolve() (the call is just before atmresolve to reduce
    the number of changes).
    This change will make it a lot easier to decouple the arp table
    from the routing table.

    There is an extra call to rt_check() in if_iso88025subr.c to
    determine the routing info length. I have left it alone for
    the time being.

    The interface of arpresolve() and nd6_storelladdr() now changes slightly:
     + the 'rtentry' parameter (really a hint from the upper level layer)
       is now passed unchanged from *_output(), so it becomes the route
       to the final destination and not to the gateway.
     + the routines will return 0 if resolution is possible, non-zero
       otherwise.
     + arpresolve() returns EWOULDBLOCK in case the mbuf is being held
       waiting for an arp reply -- in this case the error code is masked
       in the caller so the upper layer protocol will not see a failure.

2. arpcom untangling
    Where possible, use 'struct ifnet' instead of 'struct arpcom' variables,
    and use the IFP2AC macro to access arpcom fields.
    This mostly affects the netatalk code.

=== Detailed changes: ===
net/if_arcsubr.c
   rt_check() cleanup, remove a useless variable

net/if_atmsubr.c
   rt_check() cleanup

net/if_ethersubr.c
   rt_check() cleanup, arpcom untangling

net/if_fddisubr.c
   rt_check() cleanup, arpcom untangling

net/if_iso88025subr.c
   rt_check() cleanup

netatalk/aarp.c
   arpcom untangling, remove a block of duplicated code

netatalk/at_extern.h
   arpcom untangling

netinet/if_ether.c
   rt_check() cleanup (change arpresolve)

netinet6/nd6.c
   rt_check() cleanup (change nd6_storelladdr)
2004-04-25 09:24:52 +00:00
Mike Silbersack
6b2fc10b64 Wrap two long lines in the previous commit. 2004-04-23 23:29:49 +00:00
Andre Oppermann
2d166c0202 Correct an edge case in tcp_mss() where the cached path MTU
from tcp_hostcache would have overridden a (now) lower MTU of
an interface or route that changed since first PMTU discovery.
The bug would have caused TCP to redo the PMTU discovery when
not strictly necessary.

Make a comment about already pre-initialized default values
more clear.

Reviewed by:	sam
2004-04-23 22:44:59 +00:00
Andre Oppermann
22b5770b99 Add the option versrcreach to verify that a valid route to the
source address of a packet exists in the routing table.  The
default route is ignored because it would match everything and
render the check pointless.

This option is very useful for routers with a complete view of
the Internet (BGP) in the routing table to reject packets with
spoofed or unrouteable source addresses.

Example:

 ipfw add 1000 deny ip from any to any not versrcreach

also known in Cisco-speak as:

  ip verify unicast source reachable-via any

Reviewed by:	luigi
2004-04-23 14:28:38 +00:00
Andre Oppermann
b62dccc7e5 Fix a potential race when purging expired hostcache entries.
Spotted by:	luigi
2004-04-23 13:54:28 +00:00
Mike Silbersack
174624e01d Take out an unneeded variable I forgot to remove in the last commit,
and make two small whitespace fixes so that diffs vs rev 1.142 are minimal.
2004-04-22 08:34:55 +00:00
Mike Silbersack
6ac48b7409 Simplify random port allocation, and add net.inet.ip.portrange.randomized,
which can be used to turn off randomized port allocation if so desired.

Requested by:	alfred
2004-04-22 08:32:14 +00:00
Bruce M Simpson
de9f59f850 Fix a typo in a comment. 2004-04-20 19:04:24 +00:00
Mike Silbersack
6dd946b3f7 Switch from using sequential to random ephemeral port allocation,
implementation taken directly from OpenBSD.

I've resisted committing this for quite some time because of concern over
TIME_WAIT recycling breakage (sequential allocation ensures that there is a
long time before ports are recycled), but recent testing has shown me that
my fears were unwarranted.
2004-04-20 06:45:10 +00:00
Mike Silbersack
c1537ef063 Enhance our RFC1948 implementation to perform better in some pathlogical
TIME_WAIT recycling cases I was able to generate with http testing tools.

In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number.  As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.

I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.

Except in such contrived benchmarking situations, this problem should never
come up in normal usage...  until networks get faster.

No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.
2004-04-20 06:33:39 +00:00
Luigi Rizzo
ac912b2dc8 Replace Bcopy with 'the real thing' as in the rest of the file. 2004-04-18 11:45:49 +00:00
Luigi Rizzo
e6e51f0518 In an effort to simplify the routing code, try to deprecate rtalloc()
in favour of rtalloc_ign(), which is what would end up being called
anyways.

There are 25 more instances of rtalloc() in net*/ and
about 10 instances of rtalloc_ign()
2004-04-14 01:13:14 +00:00
Warner Losh
f36cfd49ad Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson
2004-04-07 20:46:16 +00:00
Ruslan Ermilov
390cdc6a76 Fixed a bug in previous revision: compute the payload checksum before
we convert ip_len into a network byte order; in_delayed_cksum() still
expects it in host byte order.

The symtom was the ``in_cksum_skip: out of data by %d'' complaints
from the kernel.

To add to the previous commit log.  These fixes make tcpdump(1) happy
by not complaining about UDP/TCP checksum being bad for looped back
IP multicast when multicast router is deactivated.

Reported by:	Vsevolod Lobko
2004-04-07 10:01:39 +00:00
Bruce Evans
30a4ab088a Fixed misspelling of IPPORT_MAX as USHRT_MAX. Don't include <sys/limits.h>
to implement this mistake.

Fixed some nearby style bugs (initialization in declaration, misformatting
of this initialization, missing blank line after the declaration, and
comparision of the non-boolean result of the initialization with 0 using
"!".  In KNF, "!" is not even used to compare booleans with 0).
2004-04-06 10:59:11 +00:00
Robert Watson
47f32f6fa6 Two missed in previous commit -- compare pointer with NULL rather than
using it as a boolean.
2004-04-05 00:52:05 +00:00
Robert Watson
24459934e9 Prefer NULL to 0 when checking pointer values as integers or booleans. 2004-04-05 00:49:07 +00:00
Pawel Jakub Dawidek
52710de1cb Fix a panic possibility caused by returning without releasing locks.
It was fixed by moving problemetic checks, as well as checks that
doesn't need locking before locks are acquired.

Submitted by:		Ryan Sommers <ryans@gamersimpact.com>
In co-operation with:	cperciva, maxim, mlaier, sam
Tested by:		submitter (previous patch), me (current patch)
Reviewed by:		cperciva, mlaier (previous patch), sam (current patch)
Approved by:		sam
Dedicated to:		enough!
2004-04-04 20:14:55 +00:00
Luigi Rizzo
f7c5baa1c6 + arpresolve(): remove an unused argument
+ struct ifnet: remove unused fields, move ipv6-related field close
  to each other, add a pointer to l3<->l2 translation tables (arp,nd6,
  etc.) for future use.

+ struct route: remove an unused field, move close to each
  other some fields that might likely go away in the future
2004-04-04 06:14:55 +00:00
Daniel Eischen
ab39bc9a92 Unbreak natd.
Reported and submitted by:	Sean McNeil (sean at mcneil.com)
2004-04-02 17:57:57 +00:00
Dag-Erling Smørgrav
e271f829b8 Raise WARNS level to 2. 2004-03-31 21:33:55 +00:00
Dag-Erling Smørgrav
2871c50186 Deal with aliasing warnings.
Reviewed by:	ru
Approved by:	silence on the lists
2004-03-31 21:32:58 +00:00
Robert Watson
7101d752b2 Invert the logic of NET_LOCK_GIANT(), and remove the one reference to it.
Previously, Giant would be grabbed at entry to the IP local delivery code
when debug.mpsafenet was set to true, as that implied Giant wouldn't be
grabbed in the driver path.  Now, we will use this primitive to
conditionally grab Giant in the event the entire network stack isn't
running MPSAFE (debug.mpsafenet == 0).
2004-03-28 23:12:19 +00:00
Pawel Jakub Dawidek
56dc72c3b6 Remove unused argument. 2004-03-28 15:48:00 +00:00
Pawel Jakub Dawidek
b0330ed929 Reduce 'td' argument to 'cred' (struct ucred) argument in those functions:
- in_pcbbind(),
	- in_pcbbind_setup(),
	- in_pcbconnect(),
	- in_pcbconnect_setup(),
	- in6_pcbbind(),
	- in6_pcbconnect(),
	- in6_pcbsetport().
"It should simplify/clarify things a great deal." --rwatson

Requested by:	rwatson
Reviewed by:	rwatson, ume
2004-03-27 21:05:46 +00:00
Pawel Jakub Dawidek
6823b82399 Remove unused argument.
Reviewed by:	ume
2004-03-27 20:41:32 +00:00
Hajimu UMEMOTO
a5d1aae31a Validate IPv6 socket options more carefully to avoid a panic.
PR:		kern/61513
Reviewed by:	cperciva, nectar
2004-03-26 19:52:18 +00:00
Pawel Jakub Dawidek
8da601dfb7 Remove unused function.
It was used in FreeBSD 4.x, but now we're using cr_canseesocket().
2004-03-25 15:12:12 +00:00
Ruslan Ermilov
26f16ebeb1 Untangle IP multicast routing interaction with delayed payload checksums.
Compute the payload checksum for a locally originated IP multicast where
God intended, in ip_mloopback(), rather than doing it in ip_output() and
only when multicast router is active.  This is more correct as we do not
fool ip_input() that the packet has the correct payload checksum when in
fact it does not (when multicast router is inactive).  This is also more
efficient if we don't join the multicast group we send to, thus allowing
the hardware to checksum the payload.
2004-03-25 08:46:27 +00:00
Robert Watson
bdae44a844 Lock down global variables in if_gre:
- Add gre_mtx to protect global softc list.
- Hold gre_mtx over various list operations (insert, delete).
- Centralize if_gre interface teardown in gre_destroy(), and call this
  from modevent unload and gre_clone_destroy().
- Export gre_mtx to ip_gre.c, which walks the gre list to look up gre
  interfaces during encapsulation.  Add a wonking comment on how we need
  some sort of drain/reference count mechanism to keep gre references
  alive while in use and simultaneous destroy.

This commit does not lockdown softc data, which follows in a future
commit.
2004-03-22 16:04:43 +00:00
Matthew N. Dodd
2964fb6538 - Fix indentation lost by 'diff -b'.
- Un-wrap short line.
2004-03-21 18:51:26 +00:00
Matthew N. Dodd
64bf80ce1b Remove interface type specific code from arprequest(), and in_arpinput().
The AF_ARP case in the (*if_output)() routine will handle the interface type
specific bits.

Obtained from:	NetBSD
2004-03-21 06:36:05 +00:00
Dag-Erling Smørgrav
f0f93429cf Run through indent(1) so I can read the code without getting a headache.
The result isn't quite knf, but it's knfer than the original, and far
more consistent.
2004-03-16 21:30:41 +00:00
Matthew N. Dodd
e952fa39de De-register. 2004-03-14 00:44:11 +00:00
Robert Watson
fe5a02c927 Lock down IP-layer encapsulation library:
- Add encapmtx to protect ip_encap.c global variables (encapsulation
   list).
 - Unifdef #ifdef 0 pieces of encap_init() which was (and now really
   is) basically a no-op.
 - Lock encapmtx when walking encaptab, modifying it, comparing
   entries, etc.
 - Remove spl's.

Note that currently there's no facilite to make sure outstanding
use of encapsulation methods on a table entry have drained bfore
we allow a table entry to be removed.  As such, it's currently the
caller's responsibility to make sure that draining takes place.

Reviewed by:	mlaier
2004-03-10 02:48:50 +00:00
Robert Watson
846840ba95 Scrub unused variable zeroin_addr. 2004-03-10 01:01:04 +00:00
Jeffrey Hsu
a062038267 To comply with the spec, do not copy the TOS from the outer IP
header to the inner IP header of the PIM Register if this is a PIM
Null-Register message.

Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2004-03-08 07:47:27 +00:00
Jeffrey Hsu
4c9792f9d3 Include <sys/types.h> for autoconf/automake detection.
Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2004-03-08 07:45:32 +00:00
Max Laier
b81dae751b Add some missing DUMMYNET_UNLOCK() in config_pipe().
Noticed by:	Simon Coggins
Approved by:	bms(mentor)
2004-03-03 01:33:22 +00:00
Max Laier
4672d81921 Two minor follow-ups on the MT_TAG removal:
ifp is now passed explicitly to ether_demux; no need to look it up again.
Make mtag a global var in ip_input.

Noticed by:	rwatson
Approved by:	bms(mentor)
2004-03-02 14:37:23 +00:00
Robert Watson
6200a93f82 Rename NET_PICKUP_GIANT() to NET_LOCK_GIANT(), and NET_DROP_GIANT()
to NET_UNLOCK_GIANT().  While they are used in similar ways, the
semantics are quite different -- NET_LOCK_GIANT() and NET_UNLOCK_GIANT()
directly wrap mutex lock and unlock operations, whereas drop/pickup
special case the handling of Giant recursion.  Add a comment saying
as much.

Add NET_ASSERT_GIANT(), which conditionally asserts Giant based
on the value of debug_mpsafenet.
2004-03-01 22:37:01 +00:00
Hajimu UMEMOTO
04d3a45241 fix -O0 compilation without INET6.
Pointed out by:	ru
2004-03-01 19:10:31 +00:00
Robert Watson
768bbd68cc Remove unneeded {} originally used to hold local variables for dummynet
in a code block, as the variable is now gone.

Submitted by:	sam
2004-02-28 19:50:43 +00:00
Robert Watson
a7b6a14aee Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by:	sam
2004-02-28 15:12:20 +00:00
Max Laier
25a4adcec4 Bring eventhandler callbacks for pf.
This enables pf to track dynamic address changes on interfaces (dailup) with
the "on (<ifname>)"-syntax. This also brings hooks in anticipation of
tracking cloned interfaces, which will be in future versions of pf.

Approved by: bms(mentor)
2004-02-26 04:27:55 +00:00
Max Laier
cc5934f5af Tweak existing header and other build infrastructure to be able to build
pf/pflog/pfsync as modules. Do not list them in NOTES or modules/Makefile
(i.e. do not connect it to any (automatic) builds - yet).

Approved by: bms(mentor)
2004-02-26 03:53:54 +00:00
Don Lewis
47934cef8f Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire().  Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails.  Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by:	bms
2004-02-26 00:27:04 +00:00
Max Laier
ac9d7e2618 Re-remove MT_TAGs. The problems with dummynet have been fixed now.
Tested by: -current, bms(mentor), me
Approved by: bms(mentor), sam
2004-02-25 19:55:29 +00:00
Bruce Evans
0613995bd0 Fixed namespace pollution in rev.1.74. Implementation of the syncache
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly.  Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.

Approved by:	jlemon (years ago, for a more invasive fix)
2004-02-25 13:03:01 +00:00
Bruce Evans
a545b1dc4d Don't use the negatively-opaque type uma_zone_t or be chummy with
<vm/uma.h>'s idempotency indentifier or its misspelling.
2004-02-25 11:53:19 +00:00
Jeffrey Hsu
89c02376fc Relax a KASSERT condition to allow for a valid corner case where
the FIN on the last segment consumes an extra sequence number.

Spurious panic reported by Mike Silbersack <silby@silby.com>.
2004-02-25 08:53:17 +00:00
Andre Oppermann
12e2e97051 Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

 net.inet.tcp.reass.maxsegments (loader tuneable)
  specifies the maximum number of segments all tcp reassemly queues can
  hold (defaults to 1/16 of nmbclusters).

 net.inet.tcp.reass.maxqlen
  specifies the maximum number of segments any individual tcp session queue
  can hold (defaults to 48).

 net.inet.tcp.reass.cursegments (readonly)
  counts the number of segments currently in all reassembly queues.

 net.inet.tcp.reass.overflows (readonly)
  counts how often either the global or local queue limit has been reached.

Tested by:	bms, silby
Reviewed by:	bms, silby
2004-02-24 15:27:41 +00:00
Pawel Jakub Dawidek
41fe0c8ad5 Fixed ucred structure leak.
Approved by:	scottl (mentor)
PR:		54163
MFC after:	3 days
2004-02-19 14:13:21 +00:00
Max Laier
36e8826ffb Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is
not working properly with the patch in place.

Approved by: bms(mentor)
2004-02-18 00:04:52 +00:00
Hajimu UMEMOTO
da0f40995d IPSEC and FAST_IPSEC have the same internal API now;
so merge these (IPSEC has an extra ipsecstat)

Submitted by:	"Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>
2004-02-17 14:02:37 +00:00
Bruce M Simpson
88f6b0435e Shorten the name of the socket option used to enable TCP-MD5 packet
treatment.

Submitted by:	Vincent Jardin
2004-02-16 22:21:16 +00:00
Hajimu UMEMOTO
70dbc6cbfc don't update outgoing ifp, if ipsec tunnel mode encapsulation
was not made.

Obtained from:	KAME
2004-02-16 17:05:06 +00:00
Bruce M Simpson
91179f796d Spell types consistently throughout this file. Do not use the __packed attribute, as we are often #include'd from userland without <sys/cdefs.h> in front of us, and it is not strictly necessary.
Noticed by:	Sascha Blank
2004-02-16 14:40:56 +00:00
Bruce M Simpson
32ff046639 Final brucification pass. Spell types consistently (u_int). Remove bogus
casts. Remove unnecessary parenthesis.

Submitted by:	bde
2004-02-14 21:49:48 +00:00
Max Laier
97075d0c0a Do not expose ip_dn_find_rule inline function to userland and unbreak world.
----------------------------------------------------------------------
2004-02-13 22:26:36 +00:00
Max Laier
189a0ba4e7 Do not check receive interface when pfil(9) hook changed address.
Approved by: bms(mentor)
2004-02-13 19:20:43 +00:00
Max Laier
1094bdca51 This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).

This is (mostly) work from: sam

Silence from: -arch
Approved by: bms(mentor), sam, rwatson
2004-02-13 19:14:16 +00:00
Bruce M Simpson
265ed01285 Brucification.
Submitted by:	bde
2004-02-13 18:21:45 +00:00
Hajimu UMEMOTO
efddf5c64d supported IPV6_RECVPATHMTU socket option.
Obtained from:	KAME
2004-02-13 14:50:01 +00:00
Bruce M Simpson
b30190b542 Update the prototype for tcpsignature_apply() to reflect the spelling of
the types used by m_apply()'s callback function, f, as documented in mbuf(9).

Noticed by:	njl
2004-02-12 20:16:09 +00:00
Bruce M Simpson
bca0e5bfc3 style(9) pass; whitespace and comments.
Submitted by:	njl
2004-02-12 20:12:48 +00:00
Bruce M Simpson
a0194ef1ea Remove an unnecessary initialization that crept in from the code which
verifies TCP-MD5 digests.

Noticed by:	njl
2004-02-12 20:08:28 +00:00
Bruce M Simpson
45d370ee8b Fix a typo; left out preprocessor conditional for sigoff variable, which
is only used by TCP_SIGNATURE code.

Noticed by:	Roop Nanuwa
2004-02-11 09:46:54 +00:00
Bruce M Simpson
1cfd4b5326 Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by:	sentex.net
2004-02-11 04:26:04 +00:00
Hajimu UMEMOTO
f073c60f73 pass pcb rather than so. it is expected that per socket policy
works again.
2004-02-03 18:20:55 +00:00
Andre Oppermann
b74d89bbbb Add sysctl net.inet.icmp.reply_src to specify the interface name
used for the ICMP reply source in reponse to packets which are not
directly addressed to us.  By default continue with with normal
source selection.

Reviewed by:	bms
2004-02-02 22:53:16 +00:00
Andre Oppermann
1488eac8ec More verbose description of the source ip address selection for ICMP replies.
Reviewed by:	bms
2004-02-02 22:17:09 +00:00
Poul-Henning Kamp
be8a62e821 Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead.  Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
2004-01-31 10:40:25 +00:00
Maxim Sobolev
4c83789253 Remove NetBSD'isms (add FreeBSD'isms?), which makes gre(4) working again. 2004-01-30 09:03:01 +00:00
Ruslan Ermilov
0ca2861fc9 Correct the descriptions of the net.inet.{udp,raw}.recvspace sysctls. 2004-01-27 22:17:39 +00:00
Maxim Sobolev
7735aeb9bb Add support for WCCPv2. It should be enablem manually using link2
ifconfig(8) flag since header for version 2 is the same but IP payload
is prepended with additional 4-bytes field.

Inspired by:	Roman Synyuk <roman@univ.kiev.ua>
MFC after:	2 weeks
2004-01-26 12:33:56 +00:00
Maxim Sobolev
6e628b8187 (whilespace-only)
Kill trailing spaces.
2004-01-26 12:21:59 +00:00
Andre Oppermann
241f1e33b1 Remove leftover FREE() from changes in rev 1.50.
Noticed by:	Jun Kuriyama <kuriyama@imgsrc.co.jp>
2004-01-23 01:39:12 +00:00
Andre Oppermann
201d185b69 Split the overloaded variable 'win' into two for their specific purposes:
recwin and sendwin.  This removes a big source of confusion and makes
following the code much easier.

Reviewed by:	sam (mentor)
Obtained from:	DragonFlyBSD rev 1.6 (hsu)
2004-01-22 23:22:14 +00:00
Andre Oppermann
1ddba8d63e Move the reduction by one of the syncache limit after the zone has been
allocated.

Reviewed by:    sam (mentor)
Obtained from:  DragonFlyBSD rev 1.6 (hsu)
2004-01-22 23:14:48 +00:00
Andre Oppermann
73080de2be Remove an unused variable and put the sockaddr_in6 onto the stack instead
of malloc'ing it.

Reviewed by:	sam (mentor)
Obtained from:	DragonFlyBSD rev 1.6 (hsu)
2004-01-22 23:10:11 +00:00
Jeffrey Hsu
61a36e3dfc Merge from DragonFlyBSD rev 1.10:
date: 2003/09/02 10:04:47;  author: hsu;  state: Exp;  lines: +5 -6
Account for when Limited Transmit is not congestion window limited.

Obtained from:	DragonFlyBSD
2004-01-20 21:40:25 +00:00
Poul-Henning Kamp
5e289f9eb6 Mostly mechanical rework of libalias:
Makes it possible to have multiple packet aliasing instances in a
single process by moving all static and global variables into an
instance structure called "struct libalias".

Redefine a new API based on s/PacketAlias/LibAlias/g

Add new "instance" argument to all functions in the new API.

Implement old API in terms of the new API.
2004-01-17 10:52:21 +00:00
Hajimu UMEMOTO
548c676b32 do not deref freed pointer
Submitted by:	"Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>
Reviewed by:	itojun
2004-01-13 09:51:47 +00:00
Andre Oppermann
bed824fa90 Disable the minmssoverload connection drop by default until the detection
logic is refined.
2004-01-12 15:46:04 +00:00
Don Lewis
e29ef13f6c Check that sa_len is the appropriate value in tcp_usr_bind(),
tcp6_usr_bind(), tcp_usr_connect(), and tcp6_usr_connect() before checking
to see whether the address is multicast so that the proper errno value
will be returned if sa_len is incorrect.  The checks are identical to the
ones in in_pcbbind_setup(), in6_pcbbind(), and in6_pcbladdr(), which are
called after the multicast address check passes.

MFC after:	30 days
2004-01-10 08:53:00 +00:00
Andre Oppermann
1ddc17c1d5 Reduce TCP_MINMSS default to 216. The AX.25 protocol (packet radio)
is frequently used with an MTU of 256 because of slow speeds and a
high packet loss rate.
2004-01-09 14:14:10 +00:00
Andre Oppermann
53369ac9bb Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU.  This is done
dynamically based on feedback from the remote host and network
components along the packet path.  This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

 o during tcp connection setup the advertized local MSS is
   exchanged between the endpoints.  The remote endpoint can
   set this arbitrarily low (except for a minimum MTU of 64
   octets enforced in the BSD code).  When the local host is
   sending data it is forced to send many small IP packets
   instead of a large one.

   For example instead of the normal TCP payload size of 1448
   it forces TCP payload size of 12 (MTU 64) and thus we have
   a 120 times increase in workload and packets. On fast links
   this quickly saturates the local CPU and may also hit pps
   processing limites of network components along the path.

   This type of attack is particularly effective for servers
   where the attacker can download large files (WWW and FTP).

   We mitigate it by enforcing a minimum MTU settable by sysctl
   net.inet.tcp.minmss defaulting to 256 octets.

 o the local host is reveiving data on a TCP connection from
   the remote host.  The local host has no control over the
   packet size the remote host is sending.  The remote host
   may chose to do what is described in the first attack and
   send the data in packets with an TCP payload of at least
   one byte.  For each packet the tcp_input() function will
   be entered, the packet is processed and a sowakeup() is
   signalled to the connected process.

   For example an attack with 2 Mbit/s gives 4716 packets per
   second and the same amount of sowakeup()s to the process
   (and context switches).

   This type of attack is particularly effective for servers
   where the attacker can upload large amounts of data.
   Normally this is the case with WWW server where large POSTs
   can be made.

   We mitigate this by calculating the average MSS payload per
   second.  If it goes below 'net.inet.tcp.minmss' and the pps
   rate is above 'net.inet.tcp.minmssoverload' defaulting to
   1000 this particular TCP connection is resetted and dropped.

MITRE CVE:	CAN-2004-0002
Reviewed by:	sam (mentor)
MFC after:	1 day
2004-01-08 17:40:07 +00:00
Andre Oppermann
bf87c82ebb If path mtu discovery is enabled set the DF bit in all cases we
send packets on a tcp connection.

PR:		kern/60889
Tested by:	Richard Wendland <richard@wendland.org.uk>
Approved by:	re (scottl)
2004-01-08 11:17:11 +00:00
Andre Oppermann
e0f630ea7a Do not set the ip_id to zero when DF is set on packet and
restore the general pre-randomid behaviour.

Setting the ip_id to zero causes several problems with
packet reassembly when a device along the path removes
the DF bit for some reason.

Other BSD and Linux have found and fixed the same issues.

PR:		kern/60889
Tested by:	Richard Wendland <richard@wendland.org.uk>
Approved by:	re (scottl)
2004-01-08 11:13:40 +00:00
Andre Oppermann
dba7bc6a65 Enable the following TCP options by default to give it more exposure:
rfc3042  Limited retransmit
 rfc3390  Increasing TCP's initial congestion Window
 inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues.  I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by:	sam (mentor)
2004-01-06 23:29:46 +00:00
Andre Oppermann
87c3bd2755 According to RFC1812 we have to ignore ICMP redirects when we
are acting as router (ipforwarding enabled).

This doesn't fix the problem that host routes from ICMP redirects
are never removed from the kernel routing table but removes the
problem for machines doing packet forwarding.

Reviewed by:	sam (mentor)
2004-01-06 23:20:07 +00:00
Ruslan Ermilov
3b95e1346a Document the net.inet.ip.subnets_are_local sysctl. 2003-12-30 16:05:03 +00:00
Maxim Sobolev
73d7ddbc56 Sync with NetBSD:
if_gre.c rev.1.41-1.49

 o Spell output with two ts.
 o Remove assigned-to but not used variable.
 o fix grammatical error in a diagnostic message.
 o u_short -> u_int16_t.
 o gi_len is ip_len, so it has to be network byteorder.

if_gre.h rev.1.11-1.13

 o prototype must not have variable name.
 o u_short -> u_int16_t.
 o Spell address with two d's.

ip_gre.c rev.1.22-1.29

 o KNF - return is not a function.
 o The "osrc" variable in gre_mobile_input() is only ever set but not
   referenced; remove it.
 o correct (false) assumptions on mbuf chain.  not sure if it really helps, but
   anyways, it is necessary to perform m_pullup.
 o correct arg to m_pullup (need to count IP header size as well).
 o remove redundant adjustment of m->m_pkthdr.len.
 o clear m_flags just for safety.
 o tabify.
 o u_short -> u_int16_t.

MFC after:	2 weeks
2003-12-30 11:41:43 +00:00
Sam Leffler
437ffe1823 o eliminate widespread on-stack mbuf use for bpf by introducing
a new bpf_mtap2 routine that does the right thing for an mbuf
  and a variable-length chunk of data that should be prepended.
o while we're sweeping the drivers, use u_int32_t uniformly when
  when prepending the address family (several places were assuming
  sizeof(int) was 4)
o return M_ASSERTVALID to BPF_MTAP* now that all stack-allocated
  mbufs have been eliminated; this may better be moved to the bpf
  routines

Reviewed by:	arch@ and several others
2003-12-28 03:56:00 +00:00
Maxim Konovalov
fad1d65260 o Fix a comment: softticks lives in sys/kern/kern_timeout.c.
PR:		kern/60613
Submitted by:	Gleb Smirnoff
MFC after:	3 days
2003-12-27 14:08:53 +00:00
Hajimu UMEMOTO
8b8a0cef40 NULL is not 0.
Submitted by:	"Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>
2003-12-24 18:22:04 +00:00
Ruslan Ermilov
3117579171 I didn't notice it right away, but check the right length too. 2003-12-23 14:08:50 +00:00
Ruslan Ermilov
78e2d2bd28 Fix a problem introduced in revision 1.84: m_pullup() does not
necessarily return the same mbuf chain so we need to recompute
mtod() consumers after pulling up.
2003-12-23 13:33:23 +00:00
Peter Wemm
a89ec05e3e Catch a few places where NULL (pointer) was used where 0 (integer) was
expected.
2003-12-23 02:36:43 +00:00
Sam Leffler
ededbec187 o move mutex init/destroy logic to the module load/unload hooks;
otherwise they are initialized twice when the code is statically
  configured in the kernel because the module load method gets
  invoked before the user application calls ip_mrouter_init
o add a mutex to synchronize the module init/done operations; this
  sort of was done using the value of ip_mroute but X_ip_mrouter_done
  sets it to NULL very early on which can lead to a race against
  ip_mrouter_init--using the additional mutex means this is safe now
o don't call ip_mrouter_reset from ip_mrouter_init; this now happens
  once at module load and X_ip_mrouter_done does the appropriate
  cleanup work to insure the data structures are in a consistent
  state so that a subsequent init operation inherits good state

Reviewed by:	juli
2003-12-20 18:32:48 +00:00
John Baldwin
a5b061f9d2 Fix some becuase -> because typos.
Reported by:	Marco Wertejuk <wertejuk@mwcis.com>
2003-12-17 16:12:01 +00:00
Robert Watson
2d92ec9858 Switch TCP over to using the inpcb label when responding in timed
wait, rather than the socket label.  This avoids reaching up to
the socket layer during connection close, which requires locking
changes.  To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer().  Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf.  Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-12-17 14:55:11 +00:00
Maxim Konovalov
1c86761b2a o IN_MULTICAST wants an address in host byte order.
PR:		kern/60304
Submitted by:	demon
MFC after:	1 week
2003-12-16 18:21:47 +00:00
Maksim Yevmenkin
a6a66f5c4c Do not panic when flushing dummynet firewall rules
Reviewed by: andre
Approved by: re (scottl)
2003-12-06 09:01:25 +00:00
Andre Oppermann
f5bd8e9aff Swap destination and source arguments of two bcopy() calls.
Before committing the initial tcp_hostcache I changed them from memcpy()
to conform with FreeBSD style without realizing the difference in argument
definition.

This fixes hostcache operation for IPv6 (in general and explicitly IPv6
path mtu discovery) and T/TCP (RFC1644).

Submitted by:	Taku YAMAMOTO <taku@cent.saitama-u.ac.jp>
Approved by:	re (rwatson)
2003-12-02 21:25:12 +00:00
Sam Leffler
d559f5c3d8 Include opt_ipsec.h so IPSEC/FAST_IPSEC is defined and the appropriate
code is compiled in to support the O_IPSEC operator.  Previously no
support was included and ipsec rules were always matching.  Note that
we do not return an error when an ipsec rule is added and the kernel
does not have IPsec support compiled in; this is done intentionally
but we may want to revisit this (document this in the man page).

PR:		58899
Submitted by:	Bjoern A. Zeeb
Approved by:	re (rwatson)
2003-12-02 00:23:45 +00:00
Andre Oppermann
cd6c4060c8 Fix an optimization where I made an ifdef'd out section to broad.
When the hostcache bucket limit is reached the last bucket wasn't
removed from the bucket row but inserted a few lines later at the
bucket row head again.  This leads to infinite loop when the same
bucket row is accessed the next time for a lookup/insert or purge
action.

Tested by:	imp, Matt Smith
Approved by:	re (rwatson)
2003-11-28 16:33:03 +00:00
Andre Oppermann
623f556031 Fix verify_rev_path() function. The author of this function tried to
cut corners which completely broke down when the routing table locking
was introduced.

Reviewed by:	sam (mentor)
Approved by:	re (rwatson)
2003-11-27 09:40:13 +00:00
Andre Oppermann
0cfbbe3bde Make sure all uses of stack allocated struct route's are properly
zeroed.  Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.

Reviewed by:	sam (mentor)
Approved by:	re  (scottl)
2003-11-26 20:31:13 +00:00
Sam Leffler
5bd311a566 Split the "inp" mutex class into separate classes for each of divert,
raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness
complaints.

Reviewed by:	rwatson
Approved by:	re (rwatson)
2003-11-26 01:40:44 +00:00
Andre Oppermann
943ae30252 Restructure a too broad ifdef which was disabling the setting of the
tcp flightsize sysctl value for local networks in the !INET6 case.

Approved by:	re (scottl)
2003-11-25 20:58:59 +00:00
Sam Leffler
6714d7c751 Correct a problem where ipfw-generated packets were being returned
for ipfw processing w/o an indication the packets were generated
by ipfw--and so should not be processed (this manifested itself
as a LOR.)  The flag bit in the mbuf that was used to mark the
packets was not listed in M_COPYFLAGS so if a packet had a header
prepended (as done by IPsec) the flag was lost.  Correct this by
defining a new M_PROTO6 flag and use it to mark packets that need
this processing.

Reviewed by:	bms
Approved by:	re (rwatson)
MFC after:	2 weeks
2003-11-24 03:57:03 +00:00
Sam Leffler
6a3ca7514d Use MPSAFE callouts only when debug.mpsafenet is 1. Both timer routines
potentially transmit packets that may enter KAME IPsec w/o Giant if the
callouts are marked MPSAFE.

Reviewed by:	ume
Approved by:	re (rwatson)
2003-11-23 18:13:41 +00:00
Thomas Moestl
1f831750b5 bzero() the the sockaddr used for the destination address for
rtalloc_ign() in in_pcbconnect_setup() before it is filled out.
Otherwise, stack junk would be left in sin_zero, which could
cause host routes to be ignored because they failed the comparison
in rn_match().
This should fix the wrong source address selection for connect() to
127.0.0.1, among other things.

Reviewed by:	sam
Approved by:	re (rwatson)
2003-11-23 03:02:00 +00:00
Andre Oppermann
97d8d152c2 Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table.  Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination.  Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by:	sam (mentor), bms
Reviewed by:	-net, -current, core@kame.net (IPv6 parts)
Approved by:	re (scottl)
2003-11-20 20:07:39 +00:00
Andre Oppermann
26d02ca7ba Remove RTF_PRCLONING from routing table and adjust users of it
accordingly.  The define is left intact for ABI compatibility
with userland.

This is a pre-step for the introduction of tcp_hostcache.  The
network stack remains fully useable with this change.

Reviewed by:	sam (mentor), bms
Reviewed by:	-net, -current, core@kame.net (IPv6 parts)
Approved by:	re (scottl)
2003-11-20 19:47:31 +00:00
Maxim Konovalov
dbf7b38125 Fix an arguments order in check_uidgid() call.
PR:		kern/59314
Submitted by:	Andrey V. Shytov
Approved by:	re (rwatson, jhb)
2003-11-20 10:28:33 +00:00
Robert Watson
a557af222b Introduce a MAC label reference in 'struct inpcb', which caches
the   MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols.  This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks.  Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by:	sam, bms
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
Olivier Houchard
8c8268cb4f In rip_abort(), unlock the inpcb if we didn't detach it, or we may
recurse on the lock before destroying the mutex.

Submitted by:	sam
2003-11-17 19:21:53 +00:00
Brian Feldman
633461295a Fix a few cases where MT_TAG-type "fake mbufs" are created on the stack, but
do not have mh_nextpkt initialized.  Somtimes what's there is "1", and the
ip_input() code pukes trying to m_free() it, rendering divert sockets and
such broken.
This really underscores the need to get rid of MT_TAG.

Reviewed by:	rwatson
2003-11-17 03:17:49 +00:00
Andre Oppermann
be7e82e44a Make two casts correct for all types of 64bit platforms.
Explained by:	bde
2003-11-16 12:50:33 +00:00
Andre Oppermann
df903fee84 Correct a cast to make it compile on 64bit platforms (noticed by tinderbox)
and remove two unneccessary variable initializations.
Make the introduction comment more clear with regard which parts of
the packet are touched.

Requested by:	luigi
2003-11-15 17:03:37 +00:00
Andre Oppermann
c76ff7084f Make ipstealth global as we need it in ip_fastforward too. 2003-11-15 01:45:56 +00:00
Andre Oppermann
02c1c7070e Remove the global one-level rtcache variable and associated
complex locking and rework ip_rtaddr() to do its own rtlookup.
Adopt all its callers to this and make ip_output() callable
with NULL rt pointer.

Reviewed by:	sam (mentor)
2003-11-14 21:48:57 +00:00
Andre Oppermann
9188b4a169 Introduce ip_fastforward and remove ip_flow.
Short description of ip_fastforward:

 o adds full direct process-to-completion IPv4 forwarding code
 o handles ip fragmentation incl. hw support (ip_flow did not)
 o sends icmp needfrag to source if DF is set (ip_flow did not)
 o supports ipfw and ipfilter (ip_flow did not)
 o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
 o returns anything it can't handle back to normal ip_input

Enable with sysctl -w net.inet.ip.fastforwarding=1

Reviewed by:	sam (mentor)
2003-11-14 21:02:22 +00:00
Sam Leffler
f7bbe2c0f1 add missing inpcb lock before call to tcp_twclose (which reclaims the inpcb)
Supported by:	FreeBSD Foundation
2003-11-13 05:18:23 +00:00
Sam Leffler
1b73ca0bf1 o reorder some locking asserts to reflect the order of the locks
o correct a read-lock assert in in_pcblookup_local that should be
  a write-lock assert (since time wait close cleanups may alter state)

Supported by:	FreeBSD Foundation
2003-11-13 05:16:56 +00:00
Andre Oppermann
16d6c90f5d Move global variables for icmp_input() to its stack. With SMP or
preemption two CPUs can be in the same function at the same time
and clobber each others variables.  Remove register declaration
from local variables.

Reviewed by:	sam (mentor)
2003-11-13 00:32:13 +00:00
Andre Oppermann
2683ceb661 Do not fragment a packet with hardware assistance if it has the DF
bit set.

Reviewed by:	sam (mentor)
2003-11-12 23:35:40 +00:00
Bruce M Simpson
83453a06de Add a new sysctl knob, net.inet.udp.strict_mcast_mship, to the udp_input path.
This switch toggles between strict multicast delivery, and traditional
multicast delivery.

The traditional (default) behaviour is to deliver multicast datagrams to all
sockets which are members of that group, regardless of the network interface
where the datagrams were received.

The strict behaviour is to deliver multicast datagrams received on a
particular interface only to sockets whose membership is bound to that
interface.

Note that as a matter of course, multicast consumers specifying INADDR_ANY
for their interface get joined on the interface where the default route
happens to be bound. This switch has no effect if the interface which the
consumer specifies for IP_ADD_MEMBERSHIP is not UP and RUNNING.

The original patch has been cleaned up somewhat from that submitted. It has
been tested on a multihomed machine with multiple QuickTime RTP streams
running over the local switch, which doesn't do IGMP snooping.

PR:		kern/58359
Submitted by:	William A. Carrel
Reviewed by:	rwatson
MFC after:	1 week
2003-11-12 20:17:11 +00:00
Andre Oppermann
122aad88d5 dropwithreset is not needed in this case as tcp_drop() is already notifying
the other side. Before we were sending two RST packets.
2003-11-12 19:38:01 +00:00
Robert Watson
eca8a663d4 Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c).  This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE.  This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time.  This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change.  Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from:	bmilekic
Obtained from:		TrustedBSD Project
Sponsored by:		DARPA, Network Associates Laboratories
2003-11-12 03:14:31 +00:00
Sam Leffler
a0bf1601a7 correct typos
Pointed out by:	Mike Silbersack
2003-11-11 18:16:54 +00:00
Sam Leffler
3d0b255a9a o add missing inpcb locking in tcp_respond
o replace spl's with lock assertions

Supported by:	FreeBSD Foundation
2003-11-11 17:54:47 +00:00
Sam Leffler
383df78dc8 use Giant-less callouts when debug_mpsafenet is non-zero
Supported by:	FreeBSD Foundation
2003-11-10 23:29:33 +00:00
Ian Dowse
3ab2096b80 In in_pcbconnect_setup(), don't use the cached inp->inp_route unless
it is marked as RTF_UP. This appears to fix a crash that was sometimes
triggered when dhclient(8) tried to send a packet after an interface
had been detatched.

Reviewed by:	sam
2003-11-10 22:45:37 +00:00
Jeffrey Hsu
1ce43e2348 Mark TCP syncache timer as not Giant-free ready yet. 2003-11-10 20:42:04 +00:00
Sam Leffler
7138d65c3f replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF
macros that expand to include assertions when the system is built
with INVARIANTS

Supported by:	FreeBSD Foundation
2003-11-08 23:36:32 +00:00
Sam Leffler
252f24a2cf divert socket fixups:
o pickup Giant in divert_packet to protect sbappendaddr since it
  can be entered through MPSAFE callouts or through ip_input when
  mpsafenet is 1
o add missing locking on output
o add locking to abort and shutdown
o add a ctlinput handler to invalidate held routing table references
  on an ICMP redirect (may not be needed)

Supported by:	FreeBSD Foundation
2003-11-08 23:09:42 +00:00
Sam Leffler
8484384564 assert optional inpcb is passed in locked
Supported by:	FreeBSD Foundation
2003-11-08 23:03:29 +00:00
Sam Leffler
59daba27d9 add locking assertions
Supported by:	FreeBSD Foundation
2003-11-08 23:02:36 +00:00
Sam Leffler
3c47a187b7 assert inpcb is locked in udp_output
Supported by:	FreeBSD Foundation
2003-11-08 23:00:48 +00:00
Sam Leffler
c29afad673 o correct locking problem: the inpcb must be held across tcp_respond
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond

Supported by:	FreeBSD Foundation
2003-11-08 22:59:22 +00:00
Sam Leffler
2a0746208b use local values instead of chasing pointers
Supported by:	FreeBSD Foundation
2003-11-08 22:57:13 +00:00
Sam Leffler
fa286d7db2 replace mtx_assert by INP_LOCK_ASSERT
Supported by:	FreeBSD Foundation
2003-11-08 22:55:52 +00:00
Sam Leffler
50d7c061a3 add some missing locking
Supported by:	FreeBSD Foundation
2003-11-08 22:53:41 +00:00
Sam Leffler
1d78192b35 the sbappendaddr call in socket_send must be protected by Giant
because it can happen from an MPSAFE callout

Supported by:	FreeBSD Foundation
2003-11-08 22:51:18 +00:00
Sam Leffler
e3f268fc89 add locking assertions that turn into noops if INET6 is configured;
this is necessary because the ipv6 code shares the in_pcb code with
ipv4 but (presently) lacks proper locking

Supported by:	FreeBSD Foundation
2003-11-08 22:48:27 +00:00
Sam Leffler
7902224c6b o add a flags parameter to netisr_register that is used to specify
whether or not the isr needs to hold Giant when running; Giant-less
  operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
  calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
  Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
  have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive

Supported by:	FreeBSD Foundation
2003-11-08 22:28:40 +00:00
Sam Leffler
27a940c9a2 unbreak compilation of FAST_IPSEC
Supported by:	FreeBSD Foundation
2003-11-08 00:34:34 +00:00
Sam Leffler
aab621f060 MFp4: reminder that random id code is not reentrant
Supported by:	FreeBSD Foundation
2003-11-07 23:31:29 +00:00
Sam Leffler
8f1ee3683d Move uid/gid checking logic out of line and lock inpcb usage. This
has a LOR between IPFW inpcb locks but I'm committing it now as the
lesser of two evils (the other being unlocked use of in_pcblookup).

Supported by:	FreeBSD Foundation
2003-11-07 23:26:57 +00:00
Hajimu UMEMOTO
aef3a65eb7 use ipsec_getnhist() instead of obsoleted ipsec_gethist().
Submitted by:	"Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>
Reviewed by:	Ari Suutari <ari@suutari.iki.fi> (ipfw@)
2003-11-07 20:25:47 +00:00
Sam Leffler
ad67584665 Fix locking of the ip forwarding cache. We were holding a reference
to a routing table entry w/o bumping the reference count or locking
against the entry being free'd.  This caused major havoc (for some
reason it appeared most frequently for folks running natd).  Fix
is to bump the reference count whenever we copy the route cache
contents into a private copy so the entry cannot be reclaimed out
from under us.  This is a short term fix as the forthcoming routing
table changes will eliminate this cache entirely.

Supported by:	FreeBSD Foundation
2003-11-07 01:47:52 +00:00
Hajimu UMEMOTO
0f9ade718d - cleanup SP refcnt issue.
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all.  secpolicy no longer contain
  spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy.  assign ID field to
  all SPD entries.  make it possible for racoon to grab SPD entry on
  pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header.  a mode is always needed
  to compare them.
- fixed that the incorrect time was set to
  sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
  because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
  alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
  XXX in theory refcnt should do the right thing, however, we have
  "spdflush" which would touch all SPs.  another solution would be to
  de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion.  ipsec_*_policy ->
  ipsec_*_pcbpolicy.
- avoid variable name confusion.
  (struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
  secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
  can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
  "src" of the spidx specifies ICMP type, and the port field in "dst"
  of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
  kernel forwards the packets.

Tested by:	nork
Obtained from:	KAME
2003-11-04 16:02:05 +00:00
Robert Watson
3de758d3e3 Note that when ip_output() is called from ip_forward(), it will already
have its options inserted, so the opt argument to ip_output()  must be
NULL.
2003-11-03 18:03:05 +00:00
Robert Watson
eecfe773aa Remove comment about desire for eventual explicit labeling of ICMP
header copy made on input path: this is now handled differently.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-03 18:01:38 +00:00
Sam Leffler
04df2fbbb8 Remove bogus RTFREE that was added in rev 1.47. The rmx code operates
directly on the radix tree and does not hold any routing table refernces.
This fixes the reference counting problems that manifested itself as a
panic during unmount of filesystems that were mounted by NFS over an
interface that had been removed.

Supported by:	FreeBSD Foundation
2003-11-03 06:11:44 +00:00
Sam Leffler
9ce7877897 Correct rev 1.56 which (incorrectly) reversed the test used to
decide if in_pcbpurgeif0 should be invoked.

Supported by:	FreeBSD Foundation
2003-11-03 03:22:39 +00:00
Mike Silbersack
4bd4fa3fe6 Add an additional check to the tcp_twrecycleable function; I had
previously only considered the send sequence space.  Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.

The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.
2003-11-02 07:47:03 +00:00
Mike Silbersack
96af9ea52b - Add a new function tcp_twrecycleable, which tells us if the ISN which
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.

- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.

This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.
2003-11-01 07:30:08 +00:00
Brooks Davis
9bf40ede4a Replace the if_name and if_unit members of struct ifnet with new members
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.

This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.

Approved By:	re (in principle)
Reviewed By:	njl, imp
Tested On:	i386, amd64, sparc64
Obtained From:	NetBSD (if_xname)
2003-10-31 18:32:15 +00:00
Sam Leffler
9c63e9dbd7 Overhaul routing table entry cleanup by introducing a new rtexpunge
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE).  This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).

Supported by:	FreeBSD Foundation
2003-10-30 23:02:51 +00:00
Sam Leffler
d0402f1b73 Potential fix for races shutting down callouts when unloading
the module.  Previously we grabbed the mutex used by the callouts,
then stopped the callout with callout_stop, but if the callout
was already active and blocked by the mutex then it would continue
later and reference the mutex after it was destroyed.  Instead
stop the callout first then lock.

Supported by:	FreeBSD Foundation
2003-10-29 19:15:00 +00:00
Sam Leffler
3520e9d61d o add locking to protect routing table refcnt manipulations
o add some more debugging help for figuring out why folks are
  getting complaints about releasing routing table entries with
  a zero refcnt
o fix comment that talked about spl's
o remove duplicate define of DUMMYNET_DEBUG

Supported by:	FreeBSD Foundation
2003-10-29 19:03:58 +00:00
Hajimu UMEMOTO
59dfcba4aa add ECN support in layer-3.
- implement the tunnel egress rule in ip_ecn_egress() in ip_ecn.c.
   make ip{,6}_ecn_egress() return integer to tell the caller that
   this packet should be dropped.
 - handle ECN at fragment reassembly in ip_input.c and frag6.c.

Obtained from:	KAME
2003-10-29 15:07:04 +00:00
Hajimu UMEMOTO
11de19f44d ip6_savecontrol() argument is redundant 2003-10-29 12:52:28 +00:00
Sam Leffler
9c855a36c1 Introduce the notion of "persistent mbuf tags"; these are tags that stay
with an mbuf until it is reclaimed.  This is in contrast to tags that
vanish when an mbuf chain passes through an interface.  Persistent tags
are used, for example, by MAC labels.

Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp.  This fixes problems
with "tag leakage".

Pointed out by:	Jonathan Stone
Reviewed by:	Robert Watson
2003-10-29 05:40:07 +00:00
Sam Leffler
395bb18680 speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by:	ps/jayanth
Obtained from:	netbsd (jason thorpe)
2003-10-28 05:47:40 +00:00
Hajimu UMEMOTO
618d51bbdc revert following unwanted changes:
- __packed to __attribute__((__packed__)
  -  uintN_t back to u_intN_t

Reported by:	bde
2003-10-25 10:57:08 +00:00
Hajimu UMEMOTO
16cd67e933 correct namespace pollution.
Submitted by:	bde
2003-10-25 09:37:10 +00:00
Hajimu UMEMOTO
c302f5bc07 remove the ip6r0_addr and ip6r0_slmap members from ip6_rthdr0{}
according to rfc2292bis.

Obtained from:	KAME
2003-10-24 20:37:05 +00:00
Hajimu UMEMOTO
5434eaa208 correct tab and order. 2003-10-24 19:51:49 +00:00
Hajimu UMEMOTO
f95d46333d Switch Advanced Sockets API for IPv6 from RFC2292 to RFC3542
(aka RFC2292bis).  Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.

Obtained from:	KAME
2003-10-24 18:26:30 +00:00
Mike Silbersack
0709c23335 Reduce the number of tcp time_wait structs to maxsockets / 5; this ensures
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.

No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.

This should reduce the need for sysadmins to lower the default msl on
busy servers.
2003-10-24 05:44:14 +00:00
Sam Leffler
ac6b0748be o restructure initialization code so data structures are setup
when loaded as a module
o cleanup data structures on module unload when no application has
  been started (i.e. kldload, kldunload w/o mrtd)
o remove extraneous unlocks immediately prior to destroying them

Supported by:	FreeBSD Foundation
2003-10-24 00:09:18 +00:00
Mike Silbersack
184dcdc7c8 Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.
2003-10-21 18:28:36 +00:00
Hajimu UMEMOTO
b339980338 enclose IPv6 part with ifdef INET6.
Obtained from:	KAME
2003-10-20 16:19:01 +00:00
Hajimu UMEMOTO
31b3783c8d correct linkmtu handling.
Obtained from:	KAME
2003-10-20 15:27:48 +00:00
Hajimu UMEMOTO
31b1bfe1b0 - add dom_if{attach,detach} framework.
- transition to use ifp->if_afdata.

Obtained from:	KAME
2003-10-17 15:46:31 +00:00
Sam Leffler
f51f805f7e pfil hooks can modify packet contents so check if the destination
address has been changed when PFIL_HOOKS is enabled and, if it has,
arrange for the proper action by ip*_forward.

Supported by:	FreeBSD Foundation
Submitted by:	Pyun YongHyeon
2003-10-16 16:25:25 +00:00
Sam Leffler
b15694110f Drop dummynet lock when calling back into the network stack to deliver
packets.  This eliminates a LOR with Giant that caused outbound pipes
to fail.

Supported by:	FreeBSD Foundation
2003-10-16 16:21:25 +00:00
Kirk McKusick
b03587f06a Malloc buckets of size 128 have been having their 64-byte offset
trashed after being freed. This has caused several panics including
kern/42277 related to soft updates. Jim Kuhn tracked the problem
down to ipfw limit rule processing.  In the expiry of dynamic rules,
it is possible for an O_LIMIT_PARENT rule to be removed when it still
has live children.  When the children eventually do expire, a pointer
to the (long gone) parent is dereferenced and a count decremented.
Since this memory can, and is, allocated for other purposes (in the
case of kern/42277 an inodedep structure), chaos ensues. The offset
in question in inodedep is the offset of the 16 bit count field in
the ipfw2 ipfw_dyn_rule.

Submitted by:	Jim Kuhn <jkuhn@sandvine.com>
Reviewed by:	"Evgueni V. Gavrilov" <aquatique@rusunix.org>
Reviewed by:	Ben Pfountz <netprince@vt.edu>
MFC after:	1 week
2003-10-16 02:00:12 +00:00
Sam Leffler
b35a1e5d66 purge extraneous ';'s
Supported by:	FreeBSD Foundation
Noticed by:	bde
2003-10-15 18:19:28 +00:00
Sam Leffler
929b31ddab Lock ip forwarding route cache. While we're at it, remove the global
variable ipforward_rt by introducing an ip_forward_cacheinval() call
to use to invalidate the cache.

Supported by:	FreeBSD Foundation
2003-10-14 19:19:12 +00:00
Sam Leffler
888c2a3c4e remove dangling ';'s` that were harmless
Supported by:	FreeBSD Foundation
2003-10-14 18:45:50 +00:00
Hajimu UMEMOTO
06cd0a3f97 - fix typo in comment.
- style.

Obtained from:	KAME
2003-10-07 17:46:18 +00:00
Hajimu UMEMOTO
1ae02d474a nuke unused ICMPV6CTL_NAMES and KEYCTL_NAMES macros. 2003-10-07 15:14:33 +00:00
Hajimu UMEMOTO
8c99329e89 return(code) -> return (code)
Obtained from:	KAME
2003-10-07 15:02:29 +00:00
Sam Leffler
d1dd20be6e Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents.  Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
  may be returned; this was never used and added unwarranted complexity
  for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
  we assume the parent will remain as long as the clone; doing this avoids
  a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
   applications cannot/do-no know about mutex's.  Doing this requires
   that the mutex be the last element in the structure.  A better solution
   is to introduce an externalized version of struct rtentry but this is
   a major task because of the intertwining of rtentry and other data
   structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
   work to eliminate many held references.  If not these will be resolved
   prior to release.
3. ATM changes are untested.

Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS (partly)
2003-10-04 03:44:50 +00:00
Sam Leffler
87002f0dc1 hookup ctlinput for fast ipsec versions of esp+ah protocols
Supported by:	FreeBSD Foundation
2003-10-03 22:06:36 +00:00
Sam Leffler
12394d06d8 place some kernel-specific data structures under #ifdef _KERNEL
Sponsored by:	FreeBSD Foundation
2003-10-03 20:58:56 +00:00
Bruce M Simpson
c3b52d6499 Shorten 'bad gateway' AF_LINK message.
Submitted by:	green
2003-10-03 17:22:14 +00:00
Bruce M Simpson
beb2ced8ac Make arp_rtrequest()'s 'bad gateway' messages slightly more informative,
to aid me in tracking down LLINFO inconsistencies in the routing table.

Discussed with:	fenner
2003-10-03 17:21:17 +00:00
Bruce M Simpson
b75bead1f2 Only delete the route if arplookup() tried to create it. Do not delete
RTF_STATIC routes. Do not check for RTF_HOST so as to avoid being DoSed
when an RTF_GENMASK route exists in the table.

Add a more verbose comment about exactly what this code does.

Submitted by:	ru
2003-10-03 09:19:23 +00:00
Ruslan Ermilov
deb62e2887 By popular demand, added the "static ARP" per-interface option. 2003-10-01 08:32:37 +00:00
Hajimu UMEMOTO
5c6ebad8f6 add /*CONSTCOND*/ to reduce diffs against latest KAME.
Obtained from:	KAME
2003-09-25 13:40:06 +00:00
Bruce M Simpson
85cc199400 Fix a logic error in the check to see if arplookup() should free the route.
Noticed by:	Mike Hogsett
Reviewed by:	ru
2003-09-24 20:52:25 +00:00
Sam Leffler
134ea22494 o update PFIL_HOOKS support to current API used by netbsd
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules

Heavy lifting by:	"Max Laier" <max@love2party.net>
Supported by:		FreeBSD Foundation
Obtained from:		NetBSD (bits of pfil.h and pfil.c)
2003-09-23 17:54:04 +00:00
Bruce M Simpson
fedf1d01a2 Fix a bug in arplookup(), whereby a hostile party on a locally
attached network could exhaust kernel memory, and cause a system
panic, by sending a flood of spoofed ARP requests.

Approved by:	jake (mentor)
Reported by:	Apple Product Security <product-security@apple.com>
2003-09-23 16:39:31 +00:00
Joe Marcus Clarke
68f1756b2a Grrr...add the Skinny alias code forgotten in the last commit. 2003-09-23 07:42:33 +00:00
Joe Marcus Clarke
b07fbc17e9 Add Cisco Skinny Station protocol support to libalias, natd, and ppp.
Skinny is the protocol used by Cisco IP phones to talk to Cisco Call
Managers.  With this code, one can use a Cisco IP phone behind a FreeBSD
NAT gateway.

Currently, having the Call Manager behind the NAT gateway is not supported.
More information on enabling Skinny support in libalias, natd, and ppp
can be found in those applications' manpages.

PR:		55843
Reviewed by:	ru
Approved by:	ru
MFC after:	30 days
2003-09-23 07:41:55 +00:00
Sam Leffler
598345da4b Bandaid locking change: mark static rule mutex recursive so re-entry when
sending an ICMP packet doesn't cause a panic.  A better solution is needed;
possibly defering the transmit to a dedicated thread.

Observed by:	"Aaron Wohl" <freebsd@soith.com>
2003-09-17 22:06:47 +00:00
Sam Leffler
f34f3a7097 shuffle code so we don't "continue" and miss a needed unlock operation
Observed by:	Wiktor Niesiobedzki <w@evip.pl>
2003-09-17 21:13:16 +00:00
Sam Leffler
293941a556 Add locking.
o change timeout to MPSAFE callout
o restructure rule deletion to deal with locking requirements
o replace static buffer used for ipfw control operations with malloc'd storage

Sponsored by:	FreeBSD Foundation
2003-09-17 00:56:50 +00:00
Sam Leffler
91176902bc Minor fixups + add locking.
o change time to MPSAFE callout
o make debug printfs conditional on DUMMYNET_DEBUG and runtime controllable
  by net.inet.ip.dummynet.debug
o make boot-time printf dependent on bootverbose

Sponsored by:	FreeBSD Foundation
2003-09-17 00:54:04 +00:00
Ruslan Ermilov
78f94aa951 Fix a bunch of off-by-one errors in the range checking code. 2003-09-11 21:40:21 +00:00
Ruslan Ermilov
8e75a37bb0 Fixed -Wpointer-arith warning.
Submitted by:	Stefan Farfeleder
PR:		bin/56653
2003-09-09 23:50:57 +00:00
Ruslan Ermilov
fe08efe680 mdoc(7): Use the new feature of the .In macro. 2003-09-08 19:57:22 +00:00
Sam Leffler
468cf6f61a Add locking.
Special thanks to Pavlin Radoslavov <pavlin@icir.org> for testing and
fixing numerous problems.

Sponsored by:	FreeBSD Foundation
Reviewed by:	Pavlin Radoslavov <pavlin@icir.org>
2003-09-06 04:53:43 +00:00
Sam Leffler
2fad1e931e lock ip fragment queues
Submitted by:	Robert Watson <rwatson@freebsd.org>
Obtained from:	BSD/OS
2003-09-05 00:10:33 +00:00
Sam Leffler
26f91065e7 o add locking
o move the global divsrc socket address to a local variable
  instead of locking it

Sponsored by:	FreeBSD Foundation
2003-09-05 00:00:51 +00:00
Bruce M Simpson
8a538743b5 PR: kern/56343
Reviewed by:	tjr
Approved by:	jake (mentor)
2003-09-03 02:19:29 +00:00
Mike Silbersack
3390d47670 Implement MBUF_STRESS_TEST mark II.
Changes from the original implementation:

- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed.  Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.

- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments.  While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.

- Add two modes of random fragmentation.  Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.
2003-09-01 05:55:37 +00:00
Sam Leffler
638ed548b7 add locking
NB: There is a known LOR on the forwarding path; this needs to be resolved
    together with a similar issue in the bridge.  For the moment it is
    believed to be benign.

Sponsored by:	FreeBSD Fondation
2003-09-01 05:12:36 +00:00
Sam Leffler
611ceef62a remove warning about use of old divert sockets; this was marked
for removal before 5.2

Reviewed by:	silence on -net and -arch
2003-09-01 04:27:34 +00:00
Sam Leffler
3b6dd5a9d0 add locking
Sponsored by:	FreeBSD Foundation
2003-09-01 04:23:48 +00:00
Robert Watson
f19389746e Remove redundant initialization of rti; SLIST_FOREACH does that for
us.
2003-08-28 22:15:05 +00:00
Robert Watson
6b48911b00 M_PREPEND() with an argument of M_TRYWAIT can fail, meaning the
returned mbuf can be NULL.  Check for NULL in rip_output() when
prepending an IP header.  This prevents mbuf exhaustion from
causing a local kernel panic when sending raw IP packets.

PR:		kern/55886
Reported by:	Pawel Malachowski <pawmal-posting@freebsd.lublin.pl>
MFC after:	3 days
2003-08-26 14:11:48 +00:00
Jeffrey Hsu
578c5e1212 Remove redundant bzero.
Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2003-08-24 08:27:57 +00:00
Robert Watson
baee0c3e66 Introduce two new MAC Framework and MAC policy entry points:
mac_reflect_mbuf_icmp()
  mac_reflect_mbuf_tcp()

These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket.  For example, in respond to a ping or a RST
packet to a SYN on a closed port.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-08-21 18:39:16 +00:00
Robert Watson
b8ecbcd287 Before digging into IGMP locking, do a whitespace and prototype cleanup:
prefer tabs to 8 spaces, focus on consistent indentation, prefer modern
C function prototypes.  Not all the way to style(9), but substantially
closer.
2003-08-20 17:32:17 +00:00
Robert Watson
6c4b2ad305 Move from a custom-crafted singly-linked list to the SLIST_* macros
from queue(3).

Improve vertical compactness by using a IGMP_PRINTF() macro rather
than #ifdefing IGMP_DEBUG a large number of debugging printfs.

Reviewed by:	mdodd (SLIST changes)
2003-08-20 17:09:01 +00:00
Bruce M Simpson
8afa230470 Add the IP_ONESBCAST option, to enable undirected IP broadcasts to be sent on
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.

Suggested by:   fenestro
Obtained from:  BSD/OS
Reviewed by:    mini, sam
Approved by:    jake (mentor)
2003-08-20 14:46:40 +00:00
Sam Leffler
c06eb4e293 Change instances of callout_init that specify MPSAFE behaviour to
use CALLOUT_MPSAFE instead of "1" for the second parameter.  This
does not change the behaviour; it just makes the intent more clear.
2003-08-19 17:51:11 +00:00
Jeffrey Hsu
9ba208b413 * Bug fix in bw_meter_process(): the periodically processed bins
of bw_meter entries were processed up to one second ahead.
  After an unappropriate rescheduling of some of the bw_meter
  entries, the upcalls weren't delivered.

* pim_register_prepare() uses the appropriate sw_csum flag to
  call ip_fragment() so the IP checksum is computed properly.

* Modify pim_register_prepare() to take care of IP packets that
  don't need fragmentation.

* Add-back in_delayed_cksum() to encap_send(), because it seems it
  should be there.

Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2003-08-19 17:22:51 +00:00
Sam Leffler
53b57cd1ab add missing unlock when in_pcballoc returns an error 2003-08-19 17:11:46 +00:00
David E. O'Brien
4f4a104ee8 style.Makefile(5) 2003-08-18 15:25:39 +00:00
Gordon Tetlow
41d8423f71 Stage 3 of dynamic root support. Make all the libraries needed to run
binaries in /bin and /sbin installed in /lib. Only the versioned files
reside in /lib, the .so symlink continues to live /usr/lib so the
toolchain doesn't need to be modified.
2003-08-17 08:28:46 +00:00
Hartmut Brandt
a9ca5bdbd0 The syncache has made use of TCPDEBUG problematic, because the SYN
segments are lost for the application. This broke, for example,
ports/benchmarks/dbs which needs the SYN segment to filter the
contents of the trace buffer for the connection it is interested in.

This patch makes the SYN segments available again. Unfortunately they
are now associated with the listening socket instead of the new one, so
a change to applications is required, but without this patch it wouldn't
work altogether.

PR:		kern/45966
2003-08-13 10:20:57 +00:00
Hartmut Brandt
91f467d592 The tcp_trace call needs the length of the header. Unfortunately the
code has rotten a bit so that the header length is not correct at
the point when tcp_trace is called. Temporarily compute the correct
value before the call and restore the old value after. This makes
ports/benchmarks/dbs to almost work.

This is a NOP unless you compile with TCPDEBUG.
2003-08-13 08:50:42 +00:00
Hartmut Brandt
3c653157a5 A number of patches in the last years have created new return paths
in tcp_input that leave the function before hitting the tcp_trace
function call for the TCPDEBUG option. This has made TCPDEBUG mostly
useless (and tools like ports/benchmarks/dbs not working). Add
tcp_trace calls to the return paths that could be identified in this
maze.

This is a NOP unless you compile with TCPDEBUG.
2003-08-13 08:46:54 +00:00
Hartmut Brandt
b24521d779 Change the code that enables/disables the ATM channel to use the
new ATMIOCOPENVCC/CLOSEVCC. This allows us to not only use UBR channels
for IP over ATM, but also CBR, VBR and ABR. Change the format of the
link layer address to specify the channel characteristics. The old
format is still supported and opens UBR channels.
2003-08-12 14:20:32 +00:00
Jeffrey Hsu
59ca77f4a1 New PIM header files.
Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2003-08-07 18:17:43 +00:00
Jeffrey Hsu
1e78ac216e 1. Basic PIM kernel support
Disabled by default. To enable it, the new "options PIM" must be
added to the kernel configuration file (in addition to MROUTING):

options	MROUTING		# Multicast routing
options	PIM			# Protocol Independent Multicast

2. Add support for advanced multicast API setup/configuration and
extensibility.

3. Add support for kernel-level PIM Register encapsulation.
Disabled by default.  Can be enabled by the advanced multicast API.

4. Implement a mechanism for "multicast bandwidth monitoring and upcalls".

Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2003-08-07 18:16:59 +00:00
John Baldwin
8b149b5131 Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort.  In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by:	bde (kern_ktrace.c)
2003-08-07 15:04:27 +00:00
Hartmut Brandt
20e57b1045 Ups. I forgot this one in the SIOCATMENA/SIOCATMDIS removal commit.
This change allows one to specify almost the complete traffic parameters
for IPoverATM channels through the routing table. Up to now we used
4 byte DL addresses (flag, vpi, vciH, vciL). This format is still allowed.
If the address is longer, however, the 5th byte is interpreted as the
traffic class (UBR, CBR, VBR or ABR) and the remaining bytes are the
parameters for this traffic class:

  UBR: 0 byte or 3 byte PCR
  CBR: 3 byte PCR
  VBR: 3 byte PCR, 3 byte SCR, 3 byte MBS
  ABR: 3 byte PCR, 3 byte MCR, 3 byte ICR, 3 byte TBE, 1 byte NRM,
       1 byte TRM, 2 bytes ADTF, 1 byte RIF, 1 byte RDF and 1 byte CDF

A script to generate the corresponding 'route add' arguments will follow soon.
2003-08-06 15:56:37 +00:00
Jeffrey Hsu
1b6002ec30 * makes mfc[MFCTBLSIZ] and vif[MAXVIFS] tables accessible via
sysctl:
  - sysctlbyname("net.inet.ip.mfctable", ...)
  - sysctlbyname("net.inet.ip.viftable", ...)

  This change is needed so netstat can use sysctlbyname() to read
  the data from those tables.
  Otherwise, in some cases "netstat -g" may fail to report the
  multicast forwarding information (e.g., if we run a multicast
  router on PicoBSD).

* Bug fix: when sending IGMPMSG_WRONGVIF upcall to the multicast
  routing daemon, set properly "im->im_vif" to the receiving
  incoming interface of the packet that triggered that upcall
  rather than to the expected incoming interface of that packet.

* Bug fix: add missing increment of counter "mrtstat.mrts_upcalls"

* Few formatting nits (e.g., replace extra spaces with TABs)

Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2003-08-05 17:01:33 +00:00
Hartmut Brandt
7e3d4432af When adding a channel for INET failed at the device level (ioctl) the
code used to call rtrequest(RTM_DELETE, ...). This is a problem, because
the function that just has called us (route_output)
is not really happy with the route it just is creating beeing ripped out
from under it. Unfortunately we also cannot return an error from
ifa_rtrequest. Therefore mark the route just as RTF_REJECT.
2003-08-05 14:59:06 +00:00
Hartmut Brandt
5246b4ff88 Make this file to conform more to style(9) before really touching it. 2003-08-05 13:58:04 +00:00
Maxim Konovalov
e1bd2f381a o Fix a typo in previous commit. 2003-07-31 10:24:36 +00:00
Maxim Konovalov
853af3f3f0 o Do not overwrite saved interrupt priority level by alloc_hash(),
use a separate variable.
o Restore interrupt priority level before return (no-op in HEAD).

Spotted by:	Don Bowman <don@sandvine.com>
MFC after:	5 days
2003-07-25 09:59:16 +00:00
Sam Leffler
1f76a5e218 add IPSEC_FILTERGIF suport for FAST_IPSEC
PR:		kern/51922
Submitted by:	Eric Masson <e-masson@kisoft-services.com>
MFC after:	1 week
2003-07-22 18:58:34 +00:00
Mike Silbersack
7dc7f0311e Minor fix to the MBUF_STRESS_TEST code so that it keeps
pkthdr.len consistant at all times.  (Some debugging
code I'm working on is tripped otherwise.)

MFC after:	3 days
2003-07-19 05:50:32 +00:00
Robert Watson
83503a9227 Add a comment above rip_ctloutput() documenting that the privilege
check for raw IP system management operations is often (although
not always) implicit due to the namespacing of raw IP sockets.  I.e.,
you have to have privilege to get a raw IP socket, so much of the
management code sitting on raw IP sockets assumes that any requests
on the socket should be granted privilege.

Obtained from:	TrustedBSD Project
Product of:	France
2003-07-18 16:10:36 +00:00
Jeffrey Hsu
a12569ec4f Drop Giant around syncache timer processing. 2003-07-17 11:19:25 +00:00
Luigi Rizzo
4805529cf8 Allow set 31 to be used for rules other than 65535.
Set 31 is still special because rules belonging to it are not deleted
by the "ipfw flush" command, but must be deleted explicitly with
"ipfw delete set 31" or by individual rule numbers.

This implement a flexible form of "persistent rules" which you might
want to have available even after an "ipfw flush".
Note that this change does not violate POLA, because you could not
use set 31 in a ruleset before this change.

sbin/ipfw changes to allow manipulation of set 31 will follow shortly.

Suggested by: Paul Richards
2003-07-15 23:07:34 +00:00
Jeffrey Hsu
9d11646de7 Unify the "send high" and "recover" variables as specified in the
lastest rev of the spec.  Use an explicit flag for Fast Recovery. [1]

Fix bug with exiting Fast Recovery on a retransmit timeout
diagnosed by Lu Guohan. [2]

Reviewed by:		Thomas Henderson <thomas.r.henderson@boeing.com>
Reported and tested by:	Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2]
Approved by:		Thomas Henderson <thomas.r.henderson@boeing.com>,
			Sally Floyd <floyd@acm.org> [1]
2003-07-15 21:49:53 +00:00
Luigi Rizzo
72e02d4dac Implement comments embedded into ipfw2 instructions.
Since we already had 'O_NOP' instructions which always match, all
I needed to do is allow the NOP command to have arbitrary length
(i.e. move its label in a different part of the switch() which
validates instructions).

The kernel must know nothing about comments, everything else is
done in userland (which will be described in the upcoming ipfw2.c
commit).
2003-07-12 05:54:17 +00:00
Luigi Rizzo
7a1dfbc0d3 Merge the handlers of O_IP_SRC_MASK and O_IP_DST_MASK opcodes, and
support matching a list of addr/mask pairs so one can write
more efficient rulesets which were not possible before e.g.

    add 100 skipto 1000 not src-ip 10.0.0.0/8,127.0.0.1/8,192.168.0.0/16

The change is fully backward compatible.
ipfw2 and manpage commit to follow.

MFC after: 3 days
2003-07-08 07:44:42 +00:00
Luigi Rizzo
c3e5b9f154 Implement the 'ipsec' option to match packets coming out of an ipsec tunnel.
Should work with both regular and fast ipsec (mutually exclusive).
See manpage for more details.

Submitted by: Ari Suutari (ari.suutari@syncrontech.com)
Revised by: sam
MFC after: 1 week
2003-07-04 21:42:32 +00:00
Luigi Rizzo
f030c1518d Correct some comments, add opcode O_IPSEC to match packets
coming out of an ipsec tunnel.
2003-07-04 21:39:51 +00:00
Luigi Rizzo
5d3b4c2480 Remove a stale comment, fix indentation. 2003-06-28 14:23:22 +00:00
Luigi Rizzo
b5f3c4cff3 whitespace fix 2003-06-28 14:16:53 +00:00
Luigi Rizzo
9c1cfc8650 remove unused file (ipfw2 is the default in RELENG_5 and above; the old
ipfw1 has been unused and unmaintained for a long time).
2003-06-24 07:12:11 +00:00
Luigi Rizzo
ec4270c021 Fix typo in a (commented out) debugging string.
Spotted by: diff
2003-06-23 21:38:21 +00:00
Luigi Rizzo
67ab48d1ae Remove whitespace at end of line. 2003-06-23 21:18:56 +00:00
Luigi Rizzo
44c884e134 Add support for multiple values and ranges for the "iplen", "ipttl",
"ipid" options. This feature has been requested by several users.
On passing, fix some minor bugs in the parser.  This change is fully
backward compatible so if you have an old /sbin/ipfw and a new
kernel you are not in trouble (but you need to update /sbin/ipfw
if you want to use the new features).

Document the changes in the manpage.

Now you can write things like

	ipfw add skipto 1000 iplen 0-500

which some people were asking to give preferential treatment to
short packets.

The 'MFC after' is just set as a reminder, because I still need
to merge the Alpha/Sparc64 fixes for ipfw2 (which unfortunately
change the size of certain kernel structures; not that it matters
a lot since ipfw2 is entirely optional and not the default...)

PR: bin/48015

MFC after: 1 week
2003-06-22 17:33:19 +00:00
Mike Silbersack
fcaf9f9146 Map icmp time exceeded responses to EHOSTUNREACH rather than 0 (no error);
this makes connect act more sensibly in these cases.

PR:				50839
Submitted by:			Barney Wolff <barney@pit.databus.com>
Patch delayed by laziness of:	silby
MFC after:			1 week
2003-06-17 06:21:08 +00:00
Ruslan Ermilov
ada24e690c In the PKT_ALIAS_PROXY_ONLY mode, make sure to preserve the
original source IP address, as promised in the manual page.

Spotted by:	Vaclav Petricek
2003-06-13 21:54:01 +00:00
Ruslan Ermilov
9c88dc8855 Removed a couple of .Xo/.Xc that are leftovers of the "ninth-argument
limit" mdoc(7) atavism.
2003-06-13 21:39:22 +00:00
Ruslan Ermilov
7176089886 Clarify that original address and port when doing transparent proxying
are _destination_ address and port.
2003-06-13 21:36:24 +00:00
Ruslan Ermilov
61de149d30 Added myself to the AUTHORS section. 2003-06-13 21:32:01 +00:00
Philippe Charnier
9703a107f2 The .Fn function 2003-06-08 09:53:08 +00:00
Robert Watson
042bbfa3b5 When setting fragment queue pointers to NULL, or comparing them with
NULL, use NULL rather than 0 to improve readability.
2003-06-06 19:32:48 +00:00
Jeffrey Hsu
f058535deb Compensate for decreasing the minimum retransmit timeout.
Reviewed by:	jlemon
2003-06-04 10:03:55 +00:00
Bernd Walter
330462a315 Change handling to support strong alignment architectures such as alpha and
sparc64.

PR:		alpha/50658
Submitted by:	rizzo
Tested on:	alpha
2003-06-04 01:17:37 +00:00
Kelly Yancey
ed7ea0e1ab Account for packets processed at layer-2 (i.e. net.link.ether.ipfw=1).
MFC after:	2 weeks
2003-06-02 23:54:09 +00:00
Ruslan Ermilov
234dfc904a A new API function PacketAliasRedirectDynamic() can be used
to mark a fully specified static link as dynamic; i.e. make
it a one-time link.
2003-06-01 23:15:00 +00:00
Ruslan Ermilov
f1a529f3da Make the PacketAliasSetAddress() function call optional. If it
is not called, and no static rules match an outgoing packet, the
latter retains its source IP address.  This is in support of the
"static NAT only" mode.
2003-06-01 22:49:59 +00:00
Poul-Henning Kamp
4df05d61bd Remove unused variables.
Found by:       FlexeLint
2003-06-01 09:20:38 +00:00
Poul-Henning Kamp
e4d2978dd8 Add /* FALLTHROUGH */
Found by:       FlexeLint
2003-05-31 19:07:22 +00:00
Garrett Wollman
6e49b1fe55 Don't generate an ip_id for packets with the DF bit set; ip_id is
only meaningful for fragments.  Also don't bother to byte-swap the
ip_id when we do generate it; it is only used at the receiver as a
nonce.  I tried several different permutations of this code with no
measurable difference to each other or to the unmodified version, so
I've settled on the one for which gcc seems to generate the best code.
(If anyone cares to microoptimize this differently for an architecture
where it actually matters, feel free.)

Suggested by:	Steve Bellovin's paper in IMW'02
2003-05-31 17:55:21 +00:00
Robert Watson
430c635447 Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by:	re (scottl)
Reviewed by:	hsu
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-05-07 05:26:27 +00:00
Robert Watson
688fe1d954 Trim a call to mac_create_mbuf_from_mbuf() since m_tag meta-data
copying for mbuf headers now works properly in m_dup_pkthdr(), so
we don't need to do an explicit copy.

Approved by:	re (jhb)
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-05-06 20:34:04 +00:00
Matthew N. Dodd
e97c58c8cf Add definitions for IN6ADDR_LINKLOCAL_ALLMDNS_INIT and INADDR_ALLMDNS_GROUP. 2003-04-29 22:03:46 +00:00
Matthew N. Dodd
4957466b8e IP_RECVTTL socket option.
Reviewed by:	Stuart Cheshire <cheshire@apple.com>
2003-04-29 21:36:18 +00:00
Alexander Kabaev
104a9b7e3e Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on:	standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
2003-04-29 13:36:06 +00:00
David E. O'Brien
152385d122 Explicitly declare 'int' parameters. 2003-04-21 16:27:46 +00:00
David E. O'Brien
bfd738788b style.Makefile(5) 2003-04-20 18:38:59 +00:00
Mike Silbersack
53dcc544a8 Rename MBUF_FRAG_TEST to MBUF_STRESS_TEST as it will be extended
to include more than just frag tests.
2003-04-12 06:11:46 +00:00
Robert Watson
cacd79e2c9 Remove a potential panic condition introduced by reduced TCP wait
state.  Those changed attempted to work around the changed invariant
that inp->in_socket was sometimes now NULL, but the logic wasn't
quite right, meaning that inp->in_socket would be dereferenced by
cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC
were in use.  Attempt to clarify and correct the logic.

Note: the work-around originally introduced with the reduced TCP
wait state handling to use cr_cansee() instead of cr_canseesocket()
in this case isn't really right, although it "Does the right thing"
for most of the cases in the base system.  We'll need to address
this at some point in the future.

Pointed out by:	dcs
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-04-10 20:33:10 +00:00
Dag-Erling Smørgrav
fe58453891 Introduce an M_ASSERTPKTHDR() macro which performs the very common task
of asserting that an mbuf has a packet header.  Use it instead of hand-
rolled versions wherever applicable.

Submitted by:	Hiten Pandya <hiten@unixdaemons.com>
2003-04-08 14:25:47 +00:00
Dag-Erling Smørgrav
212059bd83 Replace memcpy() and ovbcopy() with bcopy(); ditch some caddr_t usage. 2003-04-04 12:14:00 +00:00
Matthew N. Dodd
2c56e246fa Back out support for RFC3514.
RFC3514 poses an unacceptale risk to compliant systems.
2003-04-02 20:14:44 +00:00
Matthew N. Dodd
4f6425f7ae - Use the correct constant define.
- Add a missing break.
2003-04-02 18:02:58 +00:00
Matthew N. Dodd
8faf6df9b3 Sync constant define with NetBSD.
Requested by:	 Tom Spindler <dogcow@babymeat.com>
2003-04-02 10:28:47 +00:00
Jeffrey Hsu
48d2549c3e Observe conservation of packets when entering Fast Recovery while
doing Limited Transmit.  Only artificially inflate the congestion
window by 1 segment instead of the usual 3 to take into account
the 2 already sent by Limited Transmit.

Approved in principle by:	Mark Allman <mallman@grc.nasa.gov>,
Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>
2003-04-01 21:16:46 +00:00
Matthew N. Dodd
09139a4537 Implement support for RFC 3514 (The Security Flag in the IPv4 Header).
(See: ftp://ftp.rfc-editor.org/in-notes/rfc3514.txt)

This fulfills the host requirements for userland support by
way of the setsockopt() IP_EVIL_INTENT message.

There are three sysctl tunables provided to govern system behavior.

	net.inet.ip.rfc3514:

		Enables support for rfc3514.  As this is an
		Informational RFC and support is not yet widespread
		this option is disabled by default.

	net.inet.ip.hear_no_evil

		 If set the host will discard all received evil packets.

	net.inet.ip.speak_no_evil

		If set the host will discard all transmitted evil packets.

The IP statistics counter 'ips_evil' (available via 'netstat') provides
information on the number of 'evil' packets recieved.

For reference, the '-E' option to 'ping' has been provided to demonstrate
and test the implementation.
2003-04-01 08:21:44 +00:00
Maxim Konovalov
7778283b40 Fix indentation. 2003-03-27 15:00:10 +00:00
Maxim Konovalov
be1e4c5162 o Protect set_fs_param() by splimp(9).
Quote from kern/37573:

	There is an obvious race in netinet/ip_dummynet.c:config_pipe().
	Interrupts are not blocked when changing the params of an
	existing pipe.  The specific crash observed:

	... -> config_pipe -> set_fs_parms -> config_red

	malloc a new w_q_lookup table but take an interrupt before
	intializing it, interrupt handler does:

	... -> dummynet_io -> red_drops

	red_drops dereferences the uninitialized (zeroed) w_q_lookup
	table.

o Flush accumulated credits for idle pipes.
o Flush accumulated credits when change pipe characteristics.
o Change dn_flow_queue.numbytes type to unsigned long.

	Overlapping dn_flow_queue->numbytes in ready_event() leads to
	numbytes becomes negative and SET_TICKS() macro returns a very
	big value.  heap_insert() overlaps dn_key again and inserts a
	queue to a ready heap with a sched_time points to the past.
	That leads to an "infinity" loop.

PR:		kern/33234, kern/37573, misc/42459, kern/43133,
		kern/44045, kern/48099
Submitted by:	Mike Hibler <mike@cs.utah.edu> (kern/37573)
MFC after:	6 weeks
2003-03-27 14:56:36 +00:00
Robert Watson
5e7ce4785f Modify the mac_init_ipq() MAC Framework entry point to accept an
additional flags argument to indicate blocking disposition, and
pass in M_NOWAIT from the IP reassembly code to indicate that
blocking is not OK when labeling a new IP fragment reassembly
queue.  This should eliminate some of the WITNESS warnings that
have started popping up since fine-grained IP stack locking
started going in; if memory allocation fails, the creation of
the fragment queue will be aborted.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-03-26 15:12:03 +00:00
Maxime Henrion
511e01e2d6 Try to make the MBUF_FRAG_TEST code work better.
- Don't try to fragment the packet if it's smaller than mbuf_frag_size.
- Preserve the size of the mbuf chain which is modified by m_split().
- Check that m_split() didn't return NULL.
- Make it so we don't end up with two M_PKTHDR mbuf in the chain.
- Use m->m_pkthdr.len instead of m->m_len so that we fragment the whole
  chain and not just the first mbuf.
- Fix a nearby style bug and rework the logic of the loops so that it's
  more clear.

This is still not quite right, because we're clearly abusing m_split() to
do something it was not designed for, but at least it works now.  We
should probably move this code into a m_fragment() function when it's
correct.
2003-03-25 23:49:14 +00:00
Mike Silbersack
9d9edc5693 Add the MBUF_FRAG_TEST option. When compiled in, this option
allows you to tell ip_output to fragment all outgoing packets
into mbuf fragments of size net.inet.ip.mbuf_frag_size bytes.
This is an excellent way to test if network drivers can properly
handle long mbuf chains being passed to them.

net.inet.ip.mbuf_frag_size defaults to 0 (no fragmentation)
so that you can at least boot before your network driver dies. :)
2003-03-25 05:45:05 +00:00
Maxime Henrion
aecfcdb824 Use __packed instead of __attribute__((__packed__)). 2003-03-22 00:25:14 +00:00
Matthew N. Dodd
57842a38fd Add a sysctl node allowing the specification of an address mask to use
when replying to ICMP Address Mask Request packets.
2003-03-21 15:43:06 +00:00
Matthew N. Dodd
21150298bb Add comments regarding the ICMP timestamp fields. 2003-03-21 15:28:10 +00:00
Crist J. Clark
010dabb047 Add a 'verrevpath' option that verifies the interface that a packet
comes in on is the same interface that we would route out of to get to
the packet's source address. Essentially automates an anti-spoofing
check using the information in the routing table.

Experimental. The usage and rule format for the feature may still be
subject to change.
2003-03-15 01:13:00 +00:00
Jeffrey Hsu
7792ea2700 Greatly simplify the unlocking logic by holding the TCP protocol lock until
after FIN_WAIT_2 processing.

Helped with debugging:	Doug Barton
2003-03-13 11:46:57 +00:00
Jeffrey Hsu
da3a8a1a4f Add support for RFC 3390, which allows for a variable-sized
initial congestion window.
2003-03-13 01:43:45 +00:00
Jeffrey Hsu
582a954b00 Implement the Limited Transmit algorithm (RFC 3042). 2003-03-12 20:27:28 +00:00
Sam Leffler
4a692a1fc2 correct two more flag misuses; m_tag* use malloc flags 2003-03-12 14:45:22 +00:00