This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.
* Fix a typo in a comment in mbuf.h
* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.
messages send by routers when they deny our traffic, this causes
a timeout when trying to connect to TCP ports/services on a remote
host, which is blocked by routers or firewalls.
rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually
requi re that we treat such a message for a TCP session, that we
treat it like if we had recieved a RST.
quote begin.
A Destination Unreachable message that is received MUST be
reported to the transport layer. The transport layer SHOULD
use the information appropriately; for example, see Sections
4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol
that has its own mechanism for notifying the sender that a
port is unreachable (e.g., TCP, which sends RST segments)
MUST nevertheless accept an ICMP Port Unreachable for the
same purpose.
quote end.
I've written a small extension that implement this, it also create
a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if
this new behaviour is activated.
When it's activated (set to 1) we'll treat a ICMP administratively
prohibited message (icmp type 3 code 9, 10 and 13) for a TCP
sessions, as if we recived a TCP RST, but only if the TCP session
is in SYN_SENT state.
The reason for only reacting when in SYN_SENT state, is that this
will solve the problem, and at the same time minimize the risk of
this being abused.
I suggest that we enable this new behaviour by default, but it
would be a change of current behaviour, so if people prefer to
leave it disabled by default, at least for now, this would be ok
for me, the attached diff actually have the sysctl set to 0 by
default.
PR: 23086
Submitted by: Jesper Skriver <jesper@skriver.dk>
1. ICMP ECHO and TSTAMP replies are now rate limited.
2. RSTs generated due to packets sent to open and unopen ports
are now limited by seperate counters.
3. Each rate limiting queue now has its own description, as
follows:
Limiting icmp unreach response from 439 to 200 packets per second
Limiting closed port RST response from 283 to 200 packets per second
Limiting open port RST response from 18724 to 200 packets per second
Limiting icmp ping response from 211 to 200 packets per second
Limiting icmp tstamp response from 394 to 200 packets per second
Submitted by: Mike Silbersack <silby@silby.com>
before adding/removing packets from the queue. Also, the if_obytes and
if_omcasts fields should only be manipulated under protection of the mutex.
IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on
the queue. An IF_LOCK macro is provided, as well as the old (mutex-less)
versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which
needs them, but their use is discouraged.
Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF,
which takes care of locking/enqueue, and also statistics updating/start
if necessary.
* Some dummynet code incorrectly handled a malloc()-allocated pseudo-mbuf
header structure, called "pkt," and could consequently pollute the mbuf
free list if it was ever passed to m_freem(). The fix involved passing not
pkt, but essentially pkt->m_next (which is a real mbuf) to the mbuf
utility routines.
* Also, for dummynet, in bdg_forward(), made the code copy the ethernet header
back into the mbuf (prepended) because the dummynet code that follows expects
it to be there but it is, unfortunately for dummynet, passed to bdg_forward
as a seperate argument.
PRs: kern/19551 ; misc/21534 ; kern/23010
Submitted by: Thomas Moestl <tmoestl@gmx.net>
Reviewed by: bmilekic
Approved by: luigi
instead.
Also, fix a small set of "avail." If we're setting `avail,' we shouldn't
be re-checking whether m_flags is M_EXT, because we know that it is, as if
it wasn't, we would have already returned several lines above.
Reviewed by: jlemon
only be checked if the system is currently performing New Reno style
fast recovery. However, this value was being checked regardless of the
NR state, with the end result being that the congestion window was never
opened.
Change the logic to check t_dupack instead; the only code path that
allows it to be nonzero at this point is NewReno, so if it is nonzero,
we are in fast recovery mode and should not touch the congestion window.
Tested by: phk
PPTP links are no longer dropped by simple (and inappropriate in this
case) "inactivity timeout" procedure, only when requested through the
control connection.
It is now possible to have multiple PPTP servers running behind NAT.
Just redirect the incoming TCP traffic to port 1723, everything else
is done transparently.
Problems were reported and the fix was tested by:
Michael Adler <Michael.Adler@compaq.com>,
David Andersen <dga@lcs.mit.edu>
<sys/proc.h> to <sys/systm.h>.
Correctly document the #includes needed in the manpage.
Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.
Define __offsetof() in <machine/ansi.h>
Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>
Remove myriad of local offsetof() definitions.
Remove includes of <stddef.h> in kernel code.
NB: Kernelcode should *never* include from /usr/include !
Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.
Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.
Paritials reviews by: various.
Significant brucifications by: bde
of IP datagram. This fixes the problem when firewall denied fragmented
packets whose last fragment was less than minimum protocol header size.
Found by: Harti Brandt <brandt@fokus.gmd.de>
PR: kern/22309
in the code enforces this. So, do not check for and attempt a
false reassembly if only IP_RF is set.
Also, removed the dead code, since we no longer use dtom() on
return from ip_reass().
statistics on a per network address basis.
Teach the IPv4 and IPv6 input/output routines to log packets/bytes
against the network address connected to the flow.
Teach netstat to display the per-address stats for IP protocols
when 'netstat -i' is evoked, instead of displaying the per-interface
stats.
and instead reapply the revision 1.49 of mbuf.h, i.e.
Fixed regression of the type of the `header' member of struct pkthdr from
`void *' to caddr_t in rev.1.51. This mainly caused an annoying warning
for compiling ip_input.c.
Requested by: bde
enough into the mbuf data area. Solve this problem once and for all
by pulling up the entire (standard) header for TCP and UDP, and four
bytes of header for ICMP (enough for type, code and cksum fields).
the IP_FW_IF_IPID rule. (We have recently decided to keep the
ip_id field in network byte order inside the kernel, see revision
1.140 of src/sys/netinet/ip_input.c).
I did not like to have the conversion happen in userland, and I
think that the similar conversions for fw_tcp(seq|ack|win) should
be moved out of userland (src/sbin/ipfw/ipfw.c) into the kernel.
in favor of the new-style per-vif socket.
this does not affect the behavior of the ISI rsvpd but allows
another rsvp implementation (e.g., KOM rsvp) to take advantage
of the new style for particular sockets while using the old style
for others.
in the future, rsvp supporn should be replaced by more generic
router-alert support.
PR: kern/20984
Submitted by: Martin Karsten <Martin.Karsten@KOM.tu-darmstadt.de>
Reviewed by: kjc
but have a network interrupt arrive and deactivate the timeout before
the callout routine runs. Check for this case in the callout routine;
it should only run if the callout is active and not on the wheel.
It causes a panic when/if snd_una is incremented elsewhere (this
is a conservative change, because originally no rollback occurred
for any packets at all).
Submitted by: Vivek Sadananda Pai <vivek@imimic.com>
Update copyrights.
Introduce a new sysctl node:
net.inet.accf
Although acceptfilters need refcounting to be properly (safely) unloaded
as a temporary hack allow them to be unloaded if the sysctl
net.inet.accf.unloadable is set, this is really for developers who want
to work on thier own filters.
A near complete re-write of the accf_http filter:
1) Parse check if the request is HTTP/1.0 or HTTP/1.1 if not dump
to the application.
Because of the performance implications of this there is a sysctl
'net.inet.accf.http.parsehttpversion' that when set to non-zero
parses the HTTP version.
The default is to parse the version.
2) Check if a socket has filled and dump to the listener
3) optimize the way that mbuf boundries are handled using some voodoo
4) even though you'd expect accept filters to only be used on TCP
connections that don't use m_nextpkt I've fixed the accept filter
for socket connections that use this.
This rewrite of accf_http should allow someone to use them and maintain
full HTTP compliance as long as net.inet.accf.http.parsehttpversion is
set.
for them does not belong in the IP_FW_F_COMMAND switch, that mask doesn't even
apply to them(!).
2. You cannot add a uid/gid rule to something that isn't TCP, UDP, or IP.
XXX - this should be handled in ipfw(8) as well (for more diagnostic output),
but this at least protects bogus rules from being added.
Pointy hat: green
datagram embedded into ICMP error message, not with protocol
field of ICMP message itself (which is always IPPROTO_ICMP).
Pointed by: Erik Salander <erik@whistle.com>
fields between host and network byte order. The details:
o icmp_error() now does not add IP header length. This fixes the problem
when icmp_error() is called from ip_forward(). In this case the ip_len
of the original IP datagram returned with ICMP error was wrong.
o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host
byte order, so DTRT and convert these fields back to network byte order
before sending a message. This fixes the problem described in PR 16240
and PR 20877 (ip_id field was returned in host byte order).
o ip_ttl decrement operation in ip_forward() was moved down to make sure
that it does not corrupt the copy of original IP datagram passed later
to icmp_error().
o A copy of original IP datagram in ip_forward() was made a read-write,
independent copy. This fixes the problem I first reported to Garrett
Wollman and Bill Fenner and later put in audit trail of PR 16240:
ip_output() (not always) converts fields of original datagram to network
byte order, but because copy (mcopy) and its original (m) most likely
share the same mbuf cluster, ip_output()'s manipulations on original
also corrupted the copy.
o ip_output() now expects all three fields, ip_len, ip_off and (what is
significant) ip_id in host byte order. It was a headache for years that
ip_id was handled differently. The only compatibility issue here is the
raw IP socket interface with IP_HDRINCL socket option set and a non-zero
ip_id field, but ip.4 manual page was unclear on whether in this case
ip_id field should be in host or network byte order.
not alias `ip_src' unless it comes from the host an original
datagram that triggered this error message was destined for.
PR: 20712
Reviewed by: brian, Charles Mott <cmott@scientech.com>
When this happens, we know for sure that the packet data was not
received by the peer. Therefore, back out any advancing of the
transmit sequence number so that we send the same data the next
time we transmit a packet, avoiding a guaranteed missed packet and
its resulting TCP transmit slowdown.
In most systems ip_output() probably never returns an error, and
so this problem is never seen. However, it is more likely to occur
with device drivers having short output queues (causing ENOBUFS to
be returned when they are full), not to mention low memory situations.
Moreover, because of this problem writers of slow devices were
required to make an unfortunate choice between (a) having a relatively
short output queue (with low latency but low TCP bandwidth because
of this problem) or (b) a long output queue (with high latency and
high TCP bandwidth). In my particular application (ISDN) it took
an output queue equal to ~5 seconds of transmission to avoid ENOBUFS.
A more reasonable output queue of 0.5 seconds resulted in only about
50% TCP throughput. With this patch full throughput was restored in
the latter case.
Reviewed by: freebsd-net
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.
Reviewed by: Peter Wemm,asmodai
reply if the requesting machine isn't on the interface we believe
it should be. Prevents arp wars when you plug cables in the wrong
way around.
PR: 9848
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
Not objected to by: wollman
- Multiple PPTP clients behind NAT to the same or different servers.
- Single PPTP server behind NAT -- you just need to redirect TCP
port 1723 to a local machine. Multiple servers behind NAT is
possible but would require a simple API change.
- No API changes!
For more information on how this works see comments at the start of
the alias_pptp.c.
PacketAliasPptp() is no longer necessary and will be removed soon.
Submitted by: Erik Salander <erik@whistle.com>
Reviewed by: ru
Rewritten by: ru
Reviewed by: Erik Salander <erik@whistle.com>
accept filters are now loadable as well as able to be compiled into
the kernel.
two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)
Reviewed by: jmg
It does mean that it is now possible to run passive-mode FTP
server behind NAT.
- SECURITY: FTP aliasing engine now ensures that:
o the segment preceding a PORT/227 segment terminates with a \r\n;
o the IP address in the PORT/227 matches the source IP address of
the packet;
o the port number in the PORT command or 277 reply is greater than
or equal to 1024.
Submitted by: Erik Salander <erik@whistle.com>
Reviewed by: ru
It also squashes 99% of packet kiddie synflood orgies. For example, to
rate syn packets without MSS,
ipfw pipe 10 config 56Kbit/s queue 10Packets
ipfw add pipe 10 tcp from any to any in setup tcpoptions !mss
Submitted by: Richard A. Steenbergen <ras@e-gerbil.net>
a mbuf, it may return without setting any timers. If no more data is
scheduled to be transmitted (this was a FIN) the system will sit in
LAST_ACK state forever.
Thus, when mbuf allocation fails, set the retransmit timer if neither
the retransmit or persist timer is already pending.
Problem discovered by: Mike Silbersack (silby@silby.com)
Pushed for a fix by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: jayanth
down as a result of a reset. Returning EINVAL in that case makes no
sense at all and just confuses people as to what happened. It could be
argued that we should save the original address somewhere so that
getsockname() etc can tell us what it used to be so we know where the
problem connection attempts are coming from.
integer expression. Otherwise the sizeof() call will force the expression
to be evaluated as unsigned, which is not the intended behavior.
Obtained from: NetBSD (in a different form)
code retransmitting data from the wrong offset.
As a footnote, the newreno code was partially derived from NetBSD
and Tom Henderson <tomh@cs.berkeley.edu>
of the individual drivers and into the common routine ether_input().
Also, remove the (incomplete) hack for matching ethernet headers
in the ip_fw code.
The good news: net result of 1016 lines removed, and this should make
bridging now work with *all* Ethernet drivers.
The bad news: it's nearly impossible to test every driver, especially
for bridging, and I was unable to get much testing help on the mailing
lists.
Reviewed by: freebsd-net
better recovery for multiple packet losses in a single window.
The algorithm can be toggled via the sysctl net.inet.tcp.newreno,
which defaults to "on".
Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
calling in_pcbbind so that in_pcbbind sees a valid address if no
address was specified (since divert sockets ignore them).
PR: 17552
Reviewed by: Brian
to PPTP) with more generic PacketAliasRedirectProto().
Major number is not bumped because it is believed that noone
has started using PacketAliasRedirectPptp() yet.
LSNAT links are first created by either PacketAliasRedirectPort() or
PacketAliasRedirectAddress() and then set up by one or more calls to
PacketAliasAddServer().
Without this fix, all IPv6 TCP RST packet has wrong cksum value,
so IPv6 connect() trial to 5.0 machine won't fail until tcp connect timeout,
when they should fail soon.
Thanks to haro@tk.kubota.co.jp (Munehiro Matsuda) for his much debugging
help and detailed info.
connections, after SYN packets were seen from both ends. Before this,
it would get applied right after the first SYN packet was seen (either
from client or server). With broken TCP connection attempts, when the
remote end does not respond with SYNACK nor with RST, this resulted in
having a useless (ie, no actual TCP connection associated with it) TCP
link with 86400 seconds TTL, wasting system memory. With high rate of
such broken connection attempts (for example, remote end simply blocks
these connection attempts with ipfw(8) without sending RST back), this
could result in a denial-of-service.
PR: bin/17963
but with `dst_port' work for outgoing packets.
This case was not handled properly when I first fixed this
in revision 1.17.
This change is also required for the upcoming improved PPTP
support patches -- that is how I found the problem.
Before this change:
# natd -v -a aliasIP \
-redirect_port tcp localIP:localPORT publicIP:publicPORT 0:remotePORT
Out [TCP] [TCP] localIP:localPORT -> remoteIP:remotePORT aliased to
[TCP] aliasIP:localPORT -> remoteIP:remotePORT
After this change:
# natd -v -a aliasIP \
-redirect_port tcp localIP:localPORT publicIP:publicPORT 0:remotePORT
Out [TCP] [TCP] localIP:localPORT -> remoteIP:remotePORT aliased to
[TCP] publicIP:publicPORT -> remoteIP:remotePORT
INADDR_NONE: Incoming packets go to the alias address (the default)
INADDR_ANY: Incoming packets are not NAT'd (direct access to the
internal network from outside)
anything else: Incoming packets go to the specified address
Change a few inaddr::s_addr == 0 to inaddr::s_addr == INADDR_ANY
while I'm there.
redirected and when no target address has been specified, NAT
the destination address to the alias address rather than
allowing people direct access to your internal network from
outside.
mbuf is marked for delayed checksums, then additionally mark the
packet as having it's checksums computed. This allows us to bypass
computing/checking the checksum entirely, which isn't really needeed
as the packet has never hit the wire.
Reviewed by: green
Reported in Usenet by: locke@mcs.net (Peter Johnson)
While i was at it, prepended a 0x to the %D output, to make it clear that
the printed value is in hex (i assume %D has been chosen over %#x to
obey network byte order).
improperly doing the equivalent of (m = (function() == NULL)) instead
of ((m = function()) == NULL).
This fixes a NULL pointer dereference panic with runt arp packets.
Remove a bogus (redundant, just weird, etc.) key_freeso(so).
There are no consumers of it now, nor does it seem there
ever will be.
in6?_pcb.c:
Add an if (inp->in6?p_sp != NULL) before the call to
ipsec[46]_delete_pcbpolicy(inp). In low-memory conditions
this can cause a crash because in6?_sp can be NULL...
from iso88025.h.
o Add minimal llc support to iso88025_input.
o Clean up most of the source routing code.
* Submitted by: Nikolai Saoukh <nms@otdel-1.org>
Now most big problem of IPv6 is getting IPv6 address
assignment.
6to4 solve the problem. 6to4 addr is defined like below,
2002: 4byte v4 addr : 2byte SLA ID : 8byte interface ID
The most important point of the address format is that an IPv4 addr
is embeded in it. So any user who has IPv4 addr can get IPv6 address
block with 2byte subnet space. Also, the IPv4 addr is used for
semi-automatic IPv6 over IPv4 tunneling.
With 6to4, getting IPv6 addr become dramatically easy.
The attached patch enable 6to4 extension, and confirmed to work,
between "Richard Seaman, Jr." <dick@tar.com> and me.
Approved by: jkh
Reviewed by: itojun
ARP packets. This can incorrectly reject complete frames since the frame
could be stored in more than one mbuf.
The following patches fix the length comparisson, and add several
diagnostic log messages to the interrupt handler for out-of-the-norm ARP
packets. This should make ARP problems easier to detect, diagnose and
fix.
Submitted by: C. Stephen Gunn <csg@waterspout.com>
Approved by: jkh
Reviewed by: rwatson
Without this, kernel will panic at getsockopt() of IPSEC_POLICY.
Also make compilable libipsec/test-policy.c which tries getsockopt() of
IPSEC_POLICY.
Approved by: jkh
Submitted by: sakane@kame.net
KAME put INET6 related stuff into sys/netinet6 dir, but IPv6
standard API(RFC2553) require following files to be under sys/netinet.
netinet/ip6.h
netinet/icmp6.h
Now those header files just include each following files.
netinet6/ip6.h
netinet6/icmp6.h
Also KAME has netinet6/in6.h for easy INET6 common defs
sharing between different BSDs, but RFC2553 requires only
netinet/in.h should be included from userland.
So netinet/in.h also includes netinet6/in6.h inside.
To keep apps portability, apps should not directly include
above files from netinet6 dir.
Ideally, all contents of,
netinet6/ip6.h
netinet6/icmp6.h
netinet6/in6.h
should be moved into
netinet/ip6.h
netinet/icmp6.h
netinet/in.h
but to avoid big changes in this stage, add some hack, that
-Put some special macro define into those files under neitnet
-Let files under netinet6 cause error if it is included
from some apps, and, if the specifal macro define is not
defined.
(which should have been defined if files under netinet is
included)
-And let them print an error message which tells the
correct name of the include file to be included.
Also fix apps which includes invalid header files.
Approved by: jkh
Obtained from: KAME project
at the same time.
When rfc1323 and rfc1644 option are enabled by sysctl,
and tcp over IPv6 is tried, kernel panic happens by the
following check in tcp_output(), because now hdrlen is bigger
in such case than before.
/*#ifdef DIAGNOSTIC*/
if (max_linkhdr + hdrlen > MHLEN)
panic("tcphdr too big");
/*#endif*/
So change the above check to compare with MCLBYTES in #ifdef INET6 case.
Also, allocate a mbuf cluster for the header mbuf, in that case.
Bug reported at KAME environment.
Approved by: jkh
Reviewed by: sumikawa
Obtained from: KAME project
This is fix to usr.sbin/trpt and tcp_debug.[ch]
I think of putting this after 4.0 but,,,
-There was bug that when INET6 is defined,
IPv4 socket is not traced by trpt.
-I received request from a person who distribute a program
which use tcp_debug interface and print performance statistics,
that
-leave comptibility with old program as much as possible
-use same interface with other OSes
So, I talked with itojun, and synced API with netbsd IPv6 extension.
makeworld check, kernel build check(includes GENERIC) is done.
But if there happen to any problem, please let me know and
I soon backout this change.
o Drop all broadcast and multicast source addresses in tcp_input.
o Enable ICMP_BANDLIM in GENERIC.
o Change default to 200/s from 100/s. This will still stop the attack, but
is conservative enough to do this close to code freeze.
This is not the optimal patch for the problem, but is likely the least
intrusive patch that can be made for this.
Obtained from: Don Lewis and Matt Dillon.
Reviewed by: freebsd-security
for IPv4 communication.(IPv4 mapped IPv6 addr.)
Also removed IPv6 hoplimit initialization because it is alway done at
tcp_output.
Confirmed by: Bernd Walter <ticso@cicely5.cicely.de>
include this in all kernels. Declare some const *intrq_present
variables that can be checked by a module prior to using *intrq
to queue data.
Make the if_tun module capable of processing atm, ip, ip6, ipx,
natm and netatalk packets when TUNSIFHEAD is ioctl()d on.
Review not required by: freebsd-hackers
-opt_ipsec.h was missing on some tcp files (sorry for basic mistake)
-made buildable as above fix
-also added some missing IPv4 mapped IPv6 addr consideration into
ipsec4_getpolicybysock
This must be one of the reason why connections over IPsec hangs for
bigger packets.(which was reported on freebsd-current@freebsd.org)
But there still seems to be another bug and the problem is not yet fixed.
is very likely to become consensus as recent ietf/ipng mailing list
discussion. Also recent KAME repository and other KAME patched BSDs
also applied it.
s/__ss_family/ss_family/
s/__ss_len/ss_len/
Makeworld is confirmed, and no application should be affected by this change
yet.
now you can dynamically create rate-limited queues for different
flows using masks on dst/src IP, port and protocols.
Read the ipfw(8) manpage for details and examples.
Restructure the internals of the traffic shaper to use heaps,
so that it manages efficiently large number of queues.
Fix a bug which was present in the previous versions which could
cause, under certain unfrequent conditions, to send out very large
bursts of traffic.
All in all, this new code is much cleaner than the previous one and
should also perform better.
Work supported by Akamba Corp.
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.
desperation measure in low-memory situations), walk the tcpbs and
flush the reassembly queues.
This behaviour is currently controlled by the debug.do_tcpdrain sysctl
(defaults to on).
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: wollman
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
to print out protocol specific pcb info.
A patch submitted by guido@gvr.org, and asmodai@wxs.nl also reported
the problem.
Thanks and sorry for your troubles.
Submitted by: guido@gvr.org
Reviewed by: shin
packet divert at kernel for IPv6/IPv4 translater daemon
This includes queue related patch submitted by jburkhol@home.com.
Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
the old one: an unnecessary define (KLD_MODULE) has been deleted and
the initialisation of the module is done after domaininit was called
to be sure inet is running.
Some slight changed were made to ip_auth.c and ip_state.c in order
to assure including of sys/systm.h in case we make a kld
Make sure ip_fil does nmot include osreldate in kernel mode
Remove mlfk_ipl.c from here: no sources allowed in these directories!
- Implement 'ipfw tee' (finally)
- Divert packets by calling new function divert_packet() directly instead
of going through protosw[].
- Replace kludgey global variable 'ip_divert_port' with a function parameter
to divert_packet()
- Replace kludgey global variable 'frag_divert_port' with a function parameter
to ip_reass()
- style(9) fixes
Reviewed by: julian, green
This results in closer behavior to earlier versions, where the fixed
200ms timer actually resulted in a delay anywhere from 1..200ms, with
the average delay being 100ms.
Pointed out by: dg
for IPv6 yet)
With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
to be dangerous. It will better serve us as a port building a KLD,
ala SKIP.
The hooks are staying although it would be better to port and use
the NetBSD pfil interface rather than have custom hooks.
the link are equal to the default aliasing address. Do not zero them!
This will fix the problem with non-working links added with the source
and/or aliasing address equal to the default aliasing address, but the
default aliasing address is set later, after the link has been set up,
like both natd(8) and ppp(8) do (for objective reasons).
Reviewed by: Brian Somers <brian@FreeBSD.org>,
Eivind Eklund <eivind@FreeBSD.org>,
Charles Mott <cmott@srv.net>
have been there in the first place. A GENERIC kernel shrinks almost 1k.
Add a slightly different safetybelt under nostop for tty drivers.
Add some missing FreeBSD tags
`dst_port') work for outgoing packets.
- Make permanent links whose `alias_addr' matches the primary aliasing
address `aliasAddress' work for incoming packets.
- Typo fixes.
Reviewed by: brian, eivind
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.
In the words of originator:
:If an incoming connection is initiated through natd and deny_incoming is
:not set, then a new alias_link structure is created to handle the link.
:If there is nothing listening for the incoming connection, then the kernel
:responds with a RST for the connection. However, this is not processed
:correctly in libalias/alias.c:TcpMonitor{In,Out} and
:libalias/alias_db.c:SetState{In,Out} as it thinks a connection
:has been established and therefore applies a timeout of 86400 seconds
:to the link.
:
:If many of these half-connections are initiated (during, for example, a
:port scan of the host), then many thousands of unnecessary links are
:created and the resident size of natd balloons to 20MB or more.
PR: 13639
Reviewed by: brian
- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.
Submitted by: jlemon, wollmann
using syslog(3) (log(9)) for its various purposes! This long-awaited
change also includes such nice things as:
* macros expanding into _two_ comma-delimited arguments!
* snprintf!
* more snprintf!
* linting and criticism by more people than you can shake a stick at!
* a slightly more uniform message style than before!
and last but not least
* no less than 5 rewrites!
Reviewed by: committers
drop any segment arriving at a closed port.
tcp.blackhole=1 - only drop SYN without RST
tcp.blackhole=2 - drop everything without RST
tcp.blackhole=0 - always send RST - default behaviour
This confuses nmap -sF or -sX or -sN quite badly.
sysctl knobs.
With these knobs on, refused connection attempts are dropped
without sending a RST, or Port unreachable in the UDP case.
In the TCP case, sending of RST is inhibited iff the incoming
segment was a SYN.
Docs and rc.conf settings to follow.
- Sort xrefs
- FreeBSD.ORG -> FreeBSD.org
- Be consistent with section names as outlines in mdoc(7)
- Other misc mdoc cleanup.
PR: doc/13144
Submitted by: Alexy M. Zelkin <phantom@cris.net>
with a match probability to achieve non-deterministic behaviour of
the firewall. This can be extremely useful for testing purposes
such as simulating random packet drop without having to use dummynet
(which already does the same thing), and simulating multipath effects
and the associated out-of-order delivery (this time in conjunction
with dummynet).
The overhead on normal rules is just one comparison with 0.
Since it would have been trivial to implement this by just adding
a field to the ip_fw structure, I decided to do it in a
backward-compatible way (i.e. struct ip_fw is unchanged, and as a
consequence you don't need to recompile ipfw if you don't want to
use this feature), since this was also useful for -STABLE.
When, at some point, someone decides to change struct ip_fw, please
add a length field and a version number at the beginning, so userland
apps can keep working even if they are out of sync with the kernel.
respectively logging and dropping ICMP REDIRECT packets.
Note that there is no rate limiting on the log messages, so log_redirect
should be used with caution (preferrably only for debugging purposes).
_or_ you may specify "log logamount number" to set logging specifically
the rule.
In addition, "ipfw resetlog" has been added, which will reset the
logging counters on any/all rule(s). ipfw resetlog does not affect
the packet/byte counters (as ipfw reset does), and is the only "set"
command that can be run at securelevel >= 3.
This should address complaints about not being able to set logging
amounts, not being able to restart logging at a high securelevel,
and not being able to just reset logging without resetting all of the
counters in a rule.
would make a difference. However, my previous diff _did_ change the
behavior in some way (not necessarily break it), so I'm fixing it.
Found by: bde
Submitted by: bde
exit on errors.
If we don't, in_pcbrehash() is called without a preceeding
in_pcbinshash(), causing a crash.
There are apparently several conditions that could cause the crash;
PR misc/12256 is only one of these.
PR: misc/12256
This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.
to either enqueue or free their mbuf chains, but tcp_usr_send() was
dropping them on the floor if the tcpcb/inpcb has been torn down in the
middle of a send/write attempt. This has been responsible for a wide
variety of mbuf leak patterns, ranging from slow gradual leakage to rather
rapid exhaustion. This has been a problem since before 2.2 was branched
and appears to have been fixed in rev 1.16 and lost in 1.23/1.28.
Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking
(extensively) into this on a live production 2.2.x system and that it
was the actual cause of the leak and looks like it fixes it. The machine
in question was loosing (from memory) about 150 mbufs per hour under
load and a change similar to this stopped it. (Don't blame Jayanth
for this patch though)
An alternative approach to this would be to recheck SS_CANTSENDMORE etc
inside the splnet() right before calling pru_send() after all the potential
sleeps, interrupts and delays have happened. However, this would mean
exposing knowledge of the tcp stack's reset handling and removal of the
pcb to the generic code. There are other things that call pru_send()
directly though.
Problem originally noted by: John Plevyak <jplevyak@inktomi.com>
The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.
cdevsw_add() will print an message if the d_maj field looks bogus.
Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.
Move bdevsw() and devsw() functions to kern/kern_conf.c
Bump __FreeBSD_version to 400006
This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions
if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.
I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.
Reformat and initialize correctly all "struct cdevsw".
Initialize the d_maj and d_bmaj fields.
The d_reset field was not removed, although it is never used.
I used a program to do most of this, so all the files now use the
same consistent format. Please keep it that way.
Vinum and i4b not modified, patches emailed to respective authors.
(default 1) disables PMTUD globally. Although PMTUD can be disabled in
the standard case by locking the MTU on a static route (including the
default route), this method doesn't work in the face of dynamic routing
protocols like gated.
+ add a missing call to dn_rule_delete() when flushing firewall
rules, thus preventing possible panics due to dangling pointers
(this was already done for single rule deletes).
+ improve "usage" output in ipfw(8)
+ add a few checks to ipfw pipe parameters and make it a bit more
tolerant of common mistakes (such as specifying kbit instead of Kbit)
PR: kern/10889
Submitted by: Ruslan Ermilov
routines. The descriptor contains parameters which could be used
within those routines (eg. ip_output() ).
On passing, add IPPROTO_PGM entry to netinet/in.h
+ plug an mbuf leak when dummynet used with bridging
+ make prototype of dummynet_io consistent with usage
+ code cleanup so that now bandwidth regulation is precise to the
bit/s and not to (8*HZ) bit/s as before.
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1:
s/suser/suser_xxx/
2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.
3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/
The remaining suser_xxx() calls will be scrutinized and dealt with
later.
There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.
More changes to the suser() API will come along with the "jail" code.
- unifdef -DCOMPAT_IPFW (this was on by default already)
- remove traces of in-kernel ip_nat package, it was never committed.
- Make IPFW and DUMMYNET initialize themselves rather than depend on
compiled-in hooks in ip_init(). This means they initialize the same
way both in-kernel and as kld modules. (IPFW initializes now :-)
the IP header (this would not work for bridged packets).
This has been fixed long ago in the 2.2 branch.
Problem noticed by: a few people
Fix suggested by: Remy Nonnenmacher
also rely less on other modules clearing static values, and clear them
in a few cases we missed before.
Submitted by: Matthew Reimer <mreimer@vpop.net>
Move the Olicom token ring driver to the officially sanctionned location of
/sys/contrib. Also fix some brokenness in the generic token ring support.
Be warned that if_dl.h has been changed and SOME programs might
like recompilation.
- Transparent proxying support added.
- PPTP redirecting support added based on patches
contributed by Dru Nelson <dnelson@redwoodsoft.com>.
Submitted by: Charles Mott <cmott@srv.net>
their ttl). This can be used - in combination with the proper ipfw
incantations - to make a firewall or router invisible to traceroute
and other exploration tools.
This behaviour is controlled by a sysctl variable (net.inet.ip.stealth)
and hidden behind a kernel option (IPSTEALTH).
Reviewed by: eivind, bde
This is for various Olicom cards. An IBM driver is following.
This patch also adds support to tcpdump to decode packets on tokenring.
Congratulations to the proud father.. (below)
Submitted by: Larry Lile <lile@stdio.com>
This makes it possible to change the sysctl tree at runtime.
* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.
Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
convince myself that nothing will break if we permit IP input while
interface addresses are unconfigured. (At worst, they will hit some
ULP's PCB scan and fail if nobody is listening.) So, remove the restriction
that addresses must be configured before packets can be input. Assume
that any unicast packet we receive while unconfigured is potentially ours.
Divert was not feeding clean data to ifa_ifwithaddr() so it was
giving bad results.
Submitted by: kseel <kseel@utcorp.com>, Ruslan Ermilov <ru@ucb.crimea.ua>
This was missed in the 4.4-Lite2 merge.
Noticed by: Mohan Parthasarathy <Mohan.Parthasarathy@eng.Sun.COM> and
jayanth@loc201.tandem.com (vijayaraghavan_jayanth)
on the tcp-impl mailing list.
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.
Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's
fix for this problem.
state.
Note: this requires a recompilation of netstat (but netstat has been
broken since rev 1.52 of ip_mroute.c anyway)
Obtained from: Significantly based on Steve McCanne's
<mccanne@cs.berkeley.edu> work for BSD/OS
arplookup() to try again. This gets rid of at least one user's
"arpresolve: can't allocate llinfo" errors, and arplookup() gives
better error messages to help track down the problem if there really
is a problem with the routing table.
problems reported recently (the rtentry pointer in the dummynet
queue was not initialized in all cases, resulting in spurious
rt_refcnt decreases in the lucky cases, and memory trashing in
other cases.
have all fields in network order, whereas ipfw expects some to be
in host order. This resulted in some incorrect matching, e.g. some
packets being identified as fragments, or bandwidth not being
correctly enforced.
NOTE: this only affects bridge+ipfw, normal ipfw usage was already
correct).
Reported-By: Dave Alden and others.
Add bounds checking to netbios NS packet resolving code. This should
prevent natd from crashing on badly formed netbios packets (as might be
heard when the machine is sitting on a cable modem or certain DSL
networks), and also closes potential security holes that might have
exploited the lack of bounds checking in the previous version of the
code.
If timer calculation results in degenerate value (0), force it to 1
to avoid divide-by-zero panic later on in calls to IGMP_RANDOM_DELAY().
I considered simply adding 1 to the timer calculation, but was unsure
if the calculation was part of the IGMP standard or not so did not want
to mess with it for all cases.
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.
These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.
Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>
option not defined the sysctl int value is set to -1 and read-only.
#ifdef KERNEL's added appropriately to wall off visibility of kernel
routines from user code.
Add ICMP_BANDLIM option and 'net.inet.icmp.icmplim' sysctl. If option
is specified in kernel config, icmplim defaults to 100 pps. Setting it
to 0 will disable the feature. This feature limits ICMP error responses
for packets sent to bad tcp or udp ports, which does a lot to help the
machine handle network D.O.S. attacks.
The kernel will report packet rates that exceed the limit at a rate of
one kernel printf per second. There is one issue in regards to the
'tail end' of an attack... the kernel will not output the last report
until some unrelated and valid icmp error packet is return at some
point after the attack is over. This is a minor reporting issue only.
when a TCP "stealth" scan is directed at a *BSD box by ensuring the window
is 0 for all RST packets generated through tcp_respond()
Reviewed by: Don Lewis <Don.Lewis@tsc.tdk.com>
Obtained from: Bugtraq (from: Darren Reed <avalon@COOMBS.ANU.EDU.AU>)
This is the bulk of the support for doing kld modules. Two linker_sets
were replaced by SYSINIT()'s. VFS's and exec handlers are self registered.
kld is now a superset of lkm. I have converted most of them, they will
follow as a seperate commit as samples.
This all still works as a static a.out kernel using LKM's.
- Don't bother checking for conflicting sockets if we're binding to a
multicast address.
- Don't return an error if we're binding to INADDR_ANY, the conflicting
socket is bound to INADDR_ANY, and the conflicting socket has SO_REUSEPORT
set.
PR: kern/7713
addresses by default.
Add a knob "icmp_bmcastecho" to "rc.network" to allow this
behaviour to be controlled from "rc.conf".
Document the controlling sysctl variable "net.inet.icmp.bmcastecho"
in sysctl(3).
Reviewed by: dg, jkh
Reminded on -hackers by: Steinar Haug <sthaug@nethelp.no>
4.1.4. Experimental Protocol
A system should not implement an experimental protocol unless it
is participating in the experiment and has coordinated its use of
the protocol with the developer of the protocol.
Pointed out by: Steinar Haug <sthaug@nethelp.no>
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.
several new features are added:
- support vc/vp shaping
- support pvc shadow interface
code cleanup:
- remove WMAYBE related code. ENI WMAYBE DMA doen't work.
- remove updating if_lastchange for every packet.
- BPF related code is moved to midway.c as it should be.
(bpfwrite should work if atm_pseudohdr and LLC/SNAP are
prepended.)
- BPF link type is changed to DLT_ATM_RFC1483.
BPF now understands only LLC/SNAP!! (because bpf can't
handle variable link header length.)
It is recommended to use LLC/SNAP instead of NULL
encapsulation for various reasons. (BPF, IPv6,
interoperability, etc.)
the code has been used for months in ALTQ and KAME IPv6.
OKed by phk long time ago.
`len = min(so->so_snd.sb_cc, win) - off;'. min() has type u_int
and `off' has type int, so when min() is 0 and `off' is 1, the RHS
overflows to 0U - 1 = UINT_MAX. `len' has type long, so when
sizeof(long) == sizeof(int), the LHS normally overflows to to the
correct value of -1, but when sizeof(long) > sizeof(int), the LHS
is UINT_MAX.
Fixed some u_long's that should have been fixed-sized types.
mismatches exposed by this (the prototype for tcp_respond() didn't
match the function definition lexically, and still depends on a
gcc feature to match if ints have more than 32 bits).
Any packet that can be matched by a ipfw rule can be redirected
transparently to another port or machine. Redirection to another port
mostly makes sense with tcp, where a session can be set up
between a proxy and an unsuspecting client. Redirection to another machine
requires that the other machine also be expecting to receive the forwarded
packets, as their headers will not have been modified.
/sbin/ipfw must be recompiled!!!
Reviewed by: Peter Wemm <peter@freebsd.org>
Submitted by: Chrisy Luke <chrisy@flix.net>
Remove lots'o'hacks.
looutput is now static.
Other callers who want to use loopback to allow shortcutting
should call the special entrypoint for this, if_simloop(), which is
specifically designed for this purpose. Using looutput for this purpose
was problematic, particularly with bpf and trying to keep track
of whether one should be using the charateristics of the loopback interface
or the interface (e.g. if_ethersubr.c) that was requesting the loopback.
There was a whole class of errors due to this mis-use each of which had
hacks to cover them up.
Consists largly of hack removal :-)
or unsigned int (this doesn't change the struct layout, size or
alignment in any of the files changed in this commit, at least for
gcc on i386's. Using bitfields of type u_char may affect size and
alignment but not packing)).
ifdef tangle. The previous commit to ip_fil.h didn't change the
one that actually applies to the current FreeBSD kernel, of course.
Fixed.
Fixed style bugs in previous commit to ip_fil.h.
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
so that the new behaviour is now default.
Solves the "infinite loop in diversion" problem when more than one diversion
is active.
Man page changes follow.
The new code is in -stable as the NON default option.
net.inet.ip.icmp.bmcastecho = 0 by removing the extra check for the
address being a multicast address. The test now relies on the link
layer flags that indicate it was received via multicast. The previous
logic was broken and replied to ICMP echo/timestamp broadcasts even
when the sysctl option disallowed them.
Reviewed by: wollman
Prior to this change, Accidental recursion protection was done by
the diverted daemon feeding back the divert port number it got
the packet on, as the port number on a sendto(). IPFW knew not to
redivert a packet to this port (again). Processing of the ruleset
started at the beginning again, skipping that divert port.
The new semantic (which is how we should have done it the first time)
is that the port number in the sendto() is the rule number AFTER which
processing should restart, and on a recvfrom(), the port number is the
rule number which caused the diversion. This is much more flexible,
and also more intuitive. If the user uses the same sockaddr received
when resending, processing resumes at the rule number following that
that caused the diversion. The user can however select to resume rule
processing at any rule. (0 is restart at the beginning)
To enable the new code use
option IPFW_DIVERT_RESTART
This should become the default as soon as people have looked at it a bit
passed to the user process for incoming packets. When the sockaddr_in
is passed back to the divert socket later, use thi sas the primary
interface lookup and only revert to the IP address when the name fails.
This solves a long standing bug with divert sockets:
When two interfaces had the same address (P2P for example) the interface
"assigned" to the reinjected packet was sometimes incorect.
Probably we should define a "sockaddr_div" to officially hold this
extended information in teh same manner as sockaddr_dl.
of the TCP payload. See RFC1122 section 4.2.2.6 . This allows
Path MTU discovery to be used along with IP options.
PR: problem discovered by Kevin Lahey <kml@nas.nasa.gov>
so it must be adjusted (minus 1) before using it to do the length check.
I'm not sure who to give the credit to, but the bug was reported by
Jennifer Dawn Myers <jdm@enteract.com>, who also supplied a patch. It
was also fixed in OpenBSD previously by andreas.gunnarsson@emw.ericsson.se,
and of course I did the homework to verify that the fix was correct per
the specification.
PR: 6738
NetBSD, ported to FreeBSD by Pierre Beyssac <pb@fasterix.freenix.org> and
minorly tweaked by me.
This is a standard part of FreeBSD, but must be enabled with:
"sysctl -w net.inet.ip.fastforwarding=1" ...and of course forwarding must
also be enabled. This should probably be modified to use the zone
allocator for speed and space efficiency. The current algorithm also
appears to lose if the number of active paths exceeds IPFLOW_MAX (256),
in which case it wastes lots of time trying to figure out which cache
entry to drop.
Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.
Convert PF_LOCAL PCB structures to be type-stable and add a version number.
Define an external format for infomation about socket structures and use
it in several places.
Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.
It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.
Quality testing was done with testvn, and lat_fs from the lmbench suite.
Some NFS client testing courtesy of Patrik Kudo.
vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------
Submitted by: Michael Hancock <michaelh@cet.co.jp>
is believed to have been broken with the Brakmo/Peterson srtt
calculation changes. The result of this bug is that TCP connections
could time out extremely quickly (in 12 seconds).
Also backed out jdp's partial fix for this problem in rev 1.17 of
tcp_timer.c as it is obsoleted by this commit.
Bug was pointed out by Kevin Lehey <kml@roller.nas.nasa.gov>.
PR: 6068
ftp://ftp.isi.edu/in-notes/iana/assignments/port-numbers
port numbers are divided into three ranges:
0 - 1023 Well Known Ports
1024 - 49151 Registered Ports
49152 - 65535 Dynamic and/or Private Ports
This patch changes the "local port range" from 40000-44999
to the range shown above (plus fix the comment in in_pcb.c).
WARNING: This may have an impact on firewall configurations!
PR: 5402
Reviewed by: phk
Submitted by: Stephen J. Roznowski <sjr@home.net>
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.
Most uses of time.tv_sec now uses the new variable time_second instead.
gettime() changed to getmicrotime(0.
Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).
A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.
Add a new nfs_curusec() function.
Mark a couple of bogosities involving the now disappeard time variable.
Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.
Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.
Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.
Reviewed by: bde
all the LKM load/unload junk, and don't forget to register the SYSINIT
so that the cdevsw entry is attached.
BTW: I think the way it builds it's /dev nodes on the fly as an LKM with
vnode ops is kinda cute - I guess that'd be one way to solve the devfs
persistance problems.. :-) (ie: have the drivers make the nodes in /dev
on disk directly if they are missing, but leave them alone if present).
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation. (Some hackery ensures that the tcpcb is
reasonably aligned.) Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
use.
connect. This check was added as part of the defense against the "land"
attack, to prevent attacks which guess the ISS from going into ESTABLISHED.
The "src == dst" check will still prevent the single-homed case of the
"land" attack, and guessing ISS's should be hard anyway.
Submitted by: David Borman <dab@bsdi.com>
since there might be permanent entries still left after
calls to DeleteLink (it will be nullified by DeleteLink
if all entries are deleted, won't it ?)
2) in PacketAliasSetAddress, set the aliasing address
even when PKT_ALIAS_RESET_ON_ADDR_CHANGE is in effect.
Just don't clean up links in this case.
Submitted by: Ari Suutari <ari@suutari.iki.fi>
via: Charles Mott <cmott@srv.net>
PR: 5041
It controls if the system is to accept source routed packets.
It used to be such that, no matter if the setting of net.inet.ip.sourceroute,
source routed packets destined at us would be accepted. Now it is
controllable with eth default set to NOT accept those.
offset is non-zero:
- Do not match fragmented packets if the rule specifies a port or
TCP flags
- Match fragmented packets if the rule does not specify a port and
TCP flags
Since ipfw cannot examine port numbers or TCP flags for such packets,
it is now illegal to specify the 'frag' option with either ports or
tcpflags. Both kernel and ipfw userland utility will reject rules
containing a combination of these options.
BEWARE: packets that were previously passed may now be rejected, and
vice versa.
Reviewed by: Archie Cobbs <archie@whistle.com>
This means that a FreeBSD will only forward source routed packets
when both net.inet.ip.forwarding and net.inet.ip.sourceroute are set
to 1.
You can hit me now ;-)
Submitted by: Thomas Ptacek
TCP and UDP port numbers in fragmented packets when IP offset != 0.
2.2.6 candidate.
Discovered by: Marc Slemko <marcs@znep.com>
Submitted by: Archie Cobbs <archie@whistle.com> w/fix from me
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
rev 1.66. This fix contains both belt and suspenders.
Belt: ignore packets where src == dst and srcport == dstport in TCPS_LISTEN.
These packets can only legitimately occur when connecting a socket to itself,
which doesn't go through TCPS_LISTEN (it goes CLOSED->SYN_SENT->SYN_RCVD->
ESTABLISHED). This prevents the "standard" "land" attack, although doesn't
prevent the multi-homed variation.
Suspenders: send a RST in response to a SYN/ACK in SYN_RECEIVED state.
The only packets we should get in SYN_RECEIVED are
1. A retransmitted SYN, or
2. An ack of our SYN/ACK.
The "land" attack depends on us accepting our own SYN/ACK as an ACK;
in SYN_RECEIVED state; this should prevent all "land" attacks.
We also move up the sequence number check for the ACK in SYN_RECEIVED.
This neither helps nor hurts with respect to the "land" attack, but
puts more of the validation checking in one spot.
PR: kern/5103
This will not make any of object files that LINT create change; there
might be differences with INET disabled, but hardly anything compiled
before without INET anyway. Now the 'obvious' things will give a
proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The
only thing that _should_ work (but can't be made to compile reasonably
easily) is sppp :-(
This commit move struct arpcom from <netinet/if_ether.h> to
<net/if_arp.h>.
consequence, ipfw's list command now adjusts its output at runtime
based on the largest packet/byte counter values.
NOTE:
o The ipfw struct has changed requiring a recompile of both kernel
and userland ipfw utility.
o This probably should not be brought into 2.2.
PR: 3738
fix PR#3618 weren't sufficient since malloc() can block - allowing the
net interrupts in and leading to the same problem mentioned in the
PR (a panic). The order of operations has been changed so that this
is no longer a problem.
Needs to be brought into the 2.2.x branch.
PR: 3618
where if you are using the "reset tcp" firewall command,
the kernel would write ethernet headers onto random kernel stack locations.
Fought to the death by: terry, julian, archie.
fix valid for 2.2 series as well.
The #ifdef IPXIP in netipx/ipx_if.h is OK (used from ipx_usrreq.c and
ifconfig.c only).
I also fixed a typo IPXTUNNEL -> IPTUNNEL (and #ifdef'ed out the code
inside, as it never could have compiled - doh.)
close small security hole where an atacker could sendpackets with
IPDIVERT protocol, and select how it would be diverted thus bypassing
the ipfirewall. Discovered by inspection rather than attack.
(you'd have to know how the firewall was configured (EXACTLY) to
make use of this but..)
hope i've found out all files that actually depend on this dependancy.
IMHO, it's not very good practice to change the size of internal
structs depending on kernel options.
Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.
A couple of finer points by: bde
represent in the TCP header. The old code did effectively:
win = min(win, MAX_ALLOWED);
win = max(win, what_i_think_i_advertised_last_time);
so if what_i_think_i_advertised_last_time is bigger than can be
represented in the header (e.g. large buffers and no window scaling)
then we stuff a too-big number into a short. This fix reverses the
order of the comparisons.
PR: kern/4712
RST's being ignored, keeping a connection around until it times out, and
thus has the opposite effect of what was intended (which is to make the
system more robust to DoS attacks).
you don't want this (and the documentation explains why), but if you
use ipfw as an as-needed casual filter as needed which normally runs as
'allow all' then having the kernel and /sbin/ipfw get out of sync is a
*MAJOR* pain in the behind.
PR: 4141
Submitted by: Heikki Suonsivu <hsu@mail.clinet.fi>
potential problems with other automatic-reply ICMPs, but some of them may
depend on broadcast/multicast to operate. (This code can simply be
moved to the `reflect' label to generalize it.)
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.
accommodate the expanded name, the ICMP types bitmap has been
reduced from 256 bits to 32.
A recompile of kernel and user level ipfw is required.
To be merged into 2.2 after a brief period in -current.
PR: bin/4209
Reviewed by: Archie Cobbs <archie@whistle.com>
be dropped when it has an unusual traffic pattern. For full details
as well as a test case that demonstrates the failure, see the
referenced PR.
Under certain circumstances involving the persist state, it is
possible for the receive side's tp->rcv_nxt to advance beyond its
tp->rcv_adv. This causes (tp->rcv_adv - tp->rcv_nxt) to become
negative. However, in the code affected by this fix, that difference
was interpreted as an unsigned number by max(). Since it was
negative, it was taken as a huge unsigned number. The effect was
to cause the receiver to believe that its receive window had negative
size, thereby rejecting all received segments including ACKs. As
the test case shows, this led to fruitless retransmissions and
eventually to a dropped connection. Even connections using the
loopback interface could be dropped. The fix substitutes the signed
imax() for the unsigned max() function.
PR: closes kern/3998
Reviewed by: davidg, fenner, wollman
these are quite extensive additions to the ipfw code.
they include a change to the API because the old method was
broken, but the user view is kept the same.
The new code allows a particular match to skip forward to a particular
line number, so that blocks of rules can be
used without checking all the intervening rules.
There are also many more ways of rejecting
connections especially TCP related, and
many many more ...
see the man page for a complete description.
switch. I needed 'LINT' to compile for other reasons so I kinda got the
blood on my hands. Note: I don't know how to test this, I don't know if
it works correctly.
Don't search for interface addresses matching interface "NULL"
it's likely to cause a page fault..
this can be triggered by the ipfw code rejecting a locally generated
packet (e.g. you decide to make some network unreachable by local users)
ppp (or will be shortly). Natd can now be updated to use
this library rather than carrying its own version of the code.
Submitted by: Charles Mott <cmott@srv.net>
in_setsockaddr and in_setpeeraddr.
Handle the case where the socket was disconnected before the network
interrupts were disabled.
Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
to fill in the nfs_diskless structure, at the cost of some kernel
bloat. The advantage is that this code works on a wider range of
network adapters than netboot. Several new kernel options are
documented in LINT.
Obtained from: parts of the code comes from NetBSD.
This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.
2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.
3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.
4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.
5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.
As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.
Use the name argument almost the same in all LKM types. Maintain
the current behavior for the external (e.g., modstat) name for DEV,
EXEC, and MISC types being #name ## "_mod" and SYCALL and VFS only
#name. This is a candidate for change and I vote just the name without
the "_mod".
Change the DISPATCH macro to MOD_DISPATCH for consistency with the
other macros.
Add an LKM_ANON #define to eliminate the magic -1 and associated
signed/unsigned warnings.
Add MOD_PRIVATE to support wcd.c's poking around in the lkm structure.
Change source in tree to use the new interface.
Reviewed by: Bruce Evans
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).
Discussed-with: wollman
the quality of the hash distribution. This does not fix a problem dealing
with poor distribution when using lots of IP aliases and listening
on the same port on every one of them...some other day perhaps; fixing
that requires significant code changes.
The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>
connect in TCP while sending urgent data. It is not clear what
purpose is served by doing this, but there's no good reason why it
shouldn't work.
Submitted by: tjevans@raleigh.ibm.com via wpaul
pr_usrreqs. Collapse duplicates with udp_usrreq.c and
tcp_usrreq.c (calling the generic routines in uipc_socket2.c and
in_pcb.c). Calling sockaddr()_ or peeraddr() on a detached
socket now traps, rather than harmlessly returning an error; this
should never happen. Allow the raw IP buffer sizes to be
controlled via sysctl.
in the route. This allows us to remove the unconditional setting of the
pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF
should actually work again. While we're at it:
- Convert udp_usrreq from `mondo switch statement from Hell' to new-style.
- Delete old TCP mondo switch statement from Hell, which had previously
been diked out.
is administratively downed, all routes to that interface (including the
interface route itself) which are not static will be deleted. When
it comes back up, and addresses remaining will have their interface routes
re-added. This solves the problem where, for example, an Ethernet interface
is downed by traffic continues to flow by way of ARP entries.
affect programs that sit on top of divert(4) sockets. The
multicast routing code already unconditionally zeros the sum
before recalculating.
Any code that unconditionaly sums a packet without first zeroing
the sum (assuming that it's already zero'd) will break. No such
code seems to exist.
set it in the first place, independent of whether sin->sin_port
is set.
The result is that diverted packets that are being forwarded
will be diverted once and only once on the way in (ip_input())
and again, once and only once on the way out (ip_output()) -
twice in total. ICMP packets that don't contain a port will
now also be diverted.
to -current.
Thanks goes to Ulrike Nitzsche <ulrike@ifw-dresden.de> for giving me
a chance to test this. Only the PCI driver is tested though.
One final patch will follow in a separate commit. This is so that
everything up to here can be dragged into 2.2, if we decide so.
Reviewed by: joerg
Submitted by: Matt Thomas <matt@3am-software.com>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
previous hackery involving struct in_ifaddr and arpcom. Get rid of the
abominable multi_kludge. Update all network interfaces to use the
new machanism. Distressingly few Ethernet drivers program the multicast
filter properly (assuming the hardware has one, which it usually does).
to TAILQs. Fix places which referenced these for no good reason
that I can see (the references remain, but were fixed to compile
again; they are still questionable).
The rest of the code was treating it as a header mbuf, but it was
allocated as a normal mbuf.
This fixes the panic: ip_output no HDR when you have a multicast
tunnel configured.
duplicate ip address 204.162.228.7! sent from ethernet address: 08:00:20:09:7b:1d
changed to
arp: 08:00:20:09:7b:1d is using my IP address 204.162.228.7!
and
arp info overwritten for 204.162.228.2 by 08:00:20:09:7b:1d
changed to
arp: 204.162.228.2 moved from 08:00:20:07:b6:a0 to 08:00:20:09:7b:1d
I think the new wordings are more clear and could save some support
questions.
using a sockaddr_dl.
Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).
"high" and "secure"), we can't use a single variable to track the most
recently used port in all three ranges.. :-] This caused the next
transient port to be allocated from the start of the range more often than
it should.
<net/if_arp.h> and fixed the things that depended on it. The nested
include just allowed unportable programs to compile and made my
simple #include checking program report that networking code doesn't
need to include <sys/socket.h>.
(yes I had tested the hell out of this).
I've also temporarily disabled the code so that it behaves as it previously
did (tail drop's the syns) pending discussion with fenner about some socket
state flags that I don't fully understand.
Submitted by: fenner
callers of it to take advantage of this. This reduces new connection
request overhead in the face of a large number of PCBs in the system.
Thanks to David Filo <filo@yahoo.com> for suggesting this and providing
a sample implementation (which wasn't used, but showed that it could be
done).
Reviewed by: wollman
drop the oldest entry in the queue.
There was a fair bit of discussion as to whether or not the
proper action is to drop a random entry in the queue. It's
my conclusion that a random drop is better than a head drop,
however profiling this section of code (done by John Capo)
shows that a head-drop results in a significant performance
increase.
There are scenarios where a random drop is more appropriate.
If I find one in reality, I'll add the random drop code under
a conditional.
Obtained from: discussions and code done by Vernon Schryver (vjs@sgi.com).
time, in seconds, that state for non-established TCP sessions stays about)
a sysctl modifyable variable.
[part 1 of two commits, I just realized I can't play with the indices as
I was typing this commit message.]
to "keepidle". this should not occur unless the connection has
been established via the 3-way handshake which requires an ACK
Submitted by: jmb
Obtained from: problem discussed in Stevens vol. 3
IPPORT_RESERVED that is used for selection when bind() is told to allocate
a reserved port.
Also, implement simple sanity checking for all the addresses set, to make
it a little harder for a user/sysadmin to shoot themselves in the feet.
pulled up already. This bug can cause the first packet from a source
to a group to be corrupted when it is delivered to a process listening
on the mrouter.
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.
This stuff should not be too destructive if the IPDIVERT is not compiled in..
be aware that this changes the size of the ip_fw struct
so ipfw needs to be recompiled to use it.. more changes coming to clean this up.
- State when we've reached the limit on a particular rule in the kernel logfile
- State when a rule or all rules have been zero'd.
This gives a log of all actions that occur w/regard to the firewall
occurances, and can explain why a particular break-in attempt might not
get logged due to the limit being reached.
Reviewed by: alex
Reviewed by: phk
Reject the addition of rules that will never match (for example,
1.2.3.4:255.255.255.0). User level utilities specify the policy by either
masking the IP address for the user (as ipfw(8) does) or rejecting the
entry with an error. In either case, the kernel should not modify chain
entries to make them work.
LKM'ness. ACTUALLY_LKM_NOT_KERNEL is supposed to be so ugly that it
only gets used until <machine/conf.h> goes away. bsd.kmod.mk should
define a better-named general macro for this. Some places use
PSEUDO_LKM. This is another bad name.
Makefile:
Added IPFIREWALL_VERBOSE_LIMIT option (commented out).
broadcast and multicast routes, otherwise they will be expired by
arptimeout after a few minutes, reverting to " (incomplete)". This makes
the work done by rev 1.27 stay around until the route itself is deleted.
This is mainly cosmetic for 'arp' and 'netstat -r'.
for ARP requests.
The NetBSD version of this patch (see NetBSD PR kern/2381) has this change
already. This should close our PR kern/1140 .
Although it's not quite what he submitted, I got the idea from him so
Submitted by: Jin Guojun <jin@george.lbl.gov>
/kernel: in_rtqtimo: adjusted rtq_reallyold to 1066
/kernel: in_rtqtimo: adjusted rtq_reallyold to 710
inside of #ifdef DIAGNOSTIC to avoid the support questions from folks
asking what this means.
- Log ICMP type during verbose output.
- Added IPFIREWALL_VERBOSE_LIMIT option to prevent denial of service
attacks via syslog flooding.
- Filter based on ICMP type.
- Timestamp chain entries when they are matched.
- Interfaces can now be matched with a wildcard specification (i.e.
will match any interface unit for a given name).
- Prevent the firewall chain from being manipulated when securelevel
is greater than 2.
- Fixed bug that allowed the default policy to be deleted.
- Ability to zero individual accounting entries.
- Remove definitions of old_chk_ptr and old_ctl_ptr when compiling
ipfw as a lkm.
- Remove some redundant code shared between ip_fw_init and ipfw_load.
Closes PRs: 1192, 1219, and 1267.
gcc only inlines memcpy()'s whose count is constant and didn't inline
these. I want memcpy() in the kernel go away so that it's obvious that
it doesn't need to be optimized. Now it is only used for one struct
copy in si.c.
circumstances, caused perfectly good connections to be dropped. This
happened for connections over a LAN, where the retransmit timer
calculation TCP_REXMTVAL(tp) returned 0. If sending was blocked by flow
control for long enough, the old code dropped the connection, even
though timely replies were being received for all window probes.
Reviewed by: W. Richard Stevens <rstevens@noao.edu>
unconventionally:
If COMPAT_IPFW is not defined, or if it is defined to 1, enable;
otherwise, disable.
This means that these changes actually have no effect on anyone at the
moment. (It just makes it easier for me to keep my code in sync.)
In the future, the `not defined' part of the hack should be eliminated,
but doing this now would require everyone to change their config files.
The same conditionals need to be made in ip_input.c as well for this to
ave any useful effect, but I'm not ready to do that right now.
(PR #1178).
Define a new SO_TIMESTAMP socket option for datagram sockets to return
packet-arrival timestamps as control information (PR #1179).
Submitted by: Louis Mamakos <loiue@TransSys.com>
the destination represents. For IP:
- Iff it is a host route, RTF_LOCAL and RTF_BROADCAST indicate local
(belongs to this host) and broadcast addresses, respectively.
- For all routes, RTF_MULTICAST is set if the destination is multicast.
The RTF_BROADCAST flag is used by ip_output() to eliminate a call to
in_broadcast() in a common case; this gives about 1% in our packet-generation
experiments. All three flags might be used (although they aren't now)
to determine whether a packet can be forwarded; a given host route can
represent a forwardable address if:
(rt->rt_flags & (RTF_HOST | RTF_LOCAL | RTF_BROADCAST | RTF_MULTICAST))
== RTF_HOST
Obviously, one still has to do all the work if a host route is not present,
but this code allows one to cache the results of such a lookup if rtalloc1()
is called without masking RTF_PRCLONING.
1) Require all callers to pass a valid route pointer to ip_output()
so that we don't have to check and allocate one off the stack
as was done before. This eliminates one test and some stack
bloat from the common (UDP and TCP) case.
2) Perform the IP header checksum in-line if it's of the usual length.
This results in about a 5% speed-up in my packet-generation test.
3) Use ip_vhl field rather than ip_v and ip_hl bitfields.
1) Set the persist timer to help time-out connections in the CLOSING state.
2) Honor the keep-alive timer in the CLOSING state.
This fixes problems with connections getting "stuck" due to incompletion
of the final connection shutdown which can be a BIG problem on busy WWW
servers.
common labels for LINT. There are still some common declarations for the
!KERNEL case in tcp_debug.h and spx_debug.h. trpt depends on the ones in
tcp_debug.h.
This fixes a panic that occurs when ifconfig ioctl(s) were interrupted
by IP traffic at the wrong time - resulting in a NULL pointer dereference.
This was originally noticed on a FreeBSD 1.0 system, but the problem still
exists in current sources.
keepalive on all tcp sessions. Setsockopt(2) cannot override this setting.
Maybe another one is needed that just changes the default for SO_KEEPALIVE ?
Requested by: Joe Greco <jgreco@brasil.moneng.mei.com>
Move ipip_input() and rsvp_input() prototypes to ip_var.h
Remove unused prototype for rip_ip_input() from ip_var.h
Remove unused variable *opts from rip_output()
Get rid of ac->ac_ipaddr and arpwhohas() since they assume that
an interface has only one address.
Obtained from: BSD/OS 2.1, via Rich Stevens <rstevens@noao.edu>
from Larry Peterson &co. at Arizona:
- Header prediction for ACKs did not exclude Fast Retransmit/Recovery.
- srtt calculation tended to get ``stuck'' and could never decrease
when below 8. It still can't, but the scaling factors are adjusted
so that this artifact does not cause as bad an effect on the RTO
value as it used to.
The paper also points out the incr/8 error that has been long since fixed,
and the problems with ACKing frequency resulting from the use of options
which I suspect to be fixed already as well (as part of the T/TCP work).
Obtained from: Brakmo & Peterson, ``Performance Problems in BSD4.4 TCP''
between ignoring options specified in the setsockopt call if IP_HDRINCL is set
(the UCB choice when VJ's code was brought in) vs allowing them (what everyone
else did, and what is assumed by programs everywhere...sigh).
Also perform some checking of the passed down packet to avoid running off
the end of a mbuf chain.
Reviewed by: fenner
Make a copy of the header of a packet that gets queued due to
lack of forwarding cache entry, so that nobody else can step
on it. Thanks to Mike Karels <karels@bsdi.com> for pointing
this one out.
Close the ip-fragment hole.
Waste less memory.
Rewrite to contemporary more readable style.
Kill separate IPACCT facility, use "accept" rules in IPFIREWALL.
Filter incoming >and< outgoing packets.
Replace "policy" by sticky "deny all" rule.
Rules have numbers used for ordering and deletion.
Remove "rerorder" code entirely.
Count packet & bytecount matches for rules.
Code in -current & -stable is now the same.
systems (my last change did not mix well with some firewall
configurations). As much as I dislike firewalls, this is one thing I
I was not prepared to break by default.. :-)
Allow the user to nominate one of three ranges of port numbers as
candidates for selecting a local address to replace a zero port number.
The ranges are selected via a setsockopt(s, IPPROTO_IP, IP_PORTRANGE, &arg)
call. The three ranges are: default, high (to bypass firewalls) and
low (to get a port below 1024).
The default and high port ranges are sysctl settable under sysctl
net.inet.ip.portrange.*
This code also fixes a potential deadlock if the system accidently ran out
of local port addresses. It'd drop into an infinite while loop.
The secure port selection (for root) should reduce overheads and increase
reliability of rlogin/rlogind/rsh/rshd if they are modified to take
advantage of it.
Partly suggested by: pst
Reviewed by: wollman
when a connection enters the ESTBLS state using T/TCP, then window
scaling wasn't properly handled. The fix is twofold.
1) When the 3WHS completes, make sure that we update our window
scaling state variables.
2) When setting the `virtual advertized window', then make sure
that we do not try to offer a window that is larger than the maximum
window without scaling (TCP_MAXWIN).
Reviewed by: davidg
Reported by: Jerry Chen <chen@Ipsilon.COM>
to 20000 through 30000. These numbers are used for local IP port numbers
when an explicit address is not specified.
The values are sysctl modifiable under: net.inet.ip.port_{first|last}_auto
These numbers do not overlap with any known server addresses, without going
above 32768 which are "negative" on some other implementations.
20000 through 30000 is 2.5 times larger than the old range, but some have
suggested even that may not be enough... (gasp!) Setting a low address
of 10000 should be plenty.. :-)
local address, that was assigned with ifconfig alias and netmask
0xffffffff, would receive duplictae udp packets.
This behaviour can easily be seen by having named run, and using the alias
address as the name server.
This solution is not the pretiest one, but after talk with Garreth, it
is seen as the most easy one.
to enable IP forwarding, use sysctl(8). Also did the same for IPX,
which involved inventing a completely new MIB from whole cloth (which
I may not quite have correct); be aware of this if you use IPX forwarding.
(The two should never have been controlled by the same option anyway.)
than separate ip_v and ip_hl members. Should have no effect on current code,
but I'd eventually like to get rid of those obnoxious bitfields completely.
others: start to populate the link-layer branch of the net mib, by
moving ARP to its proper place. (ARP is not a protocol family, it's an
interface layer between a medium-access layer and a protocol family.)
sysctl(8) needs to be taught about the structure of this branch, unless
Poul-Henning implements dynamic MIB exploration soon.
*' instead of caddr_t and it isn't optional (it never was). Most of the
netipx (and netns) pr_ctlinput functions abuse the second arg instead of
using the third arg but fixing this is beyond the scope of this round
of changes.
Add five sysctl variables that you should probably never tweak.
net.arp.t_prune: 300
net.arp.t_keep: 1200
net.arp.t_down: 20
net.arp.maxtries: 5
net.arp.useloopback: 1
net.arp.proxyall: 0
(It's net.arp because arp isn't limited to inet, though our present
implementation surely is).
Removed ifnet.if_init and ifnet.if_reset as they are generally unused.
Change the parameter passed to if_watchdog to be a ifnet * rather than
a unit number. All of this is an attempt to move toward not needing an
array of softc pointers (which is usually static in size) to point to
the driver softc.
if_ed.c:
Changed some of the argument passing to some functions to make a little
more sense.
if_ep.c, if_vx.c:
Killed completely bogus use of if_timer. It was being set in such a way
that the interface was being reset once per second (blech!).
- remove a redundant condition;
- complete all validity checks on segment before calling
soisconnected(so).
Reviewed by: Richard Stevens, davidg, wollman
have to decide whether to send a CC or CCnew option in our SYN segment
depending on the contents of our TAO cache. This decision has to be
made once when the connection starts. The earlier code delayed this
decision until the segment was assembled in tcp_output() and
retransmitted SYN segments could have different CC options.
Reviewed by: Richard Stevens, davidg, wollman
net.inet.ip.intr-queue-maxlen (=== ipintrq.ifq_maxlen)
and net.inet.ip.intr-queue-drops (=== ipintrq.ifq_drops)
There should probably be a standard way of getting the same information
going the other way.
in the FIN_WAIT_2 state in order to prevent the conn. hanging there
forever.
Reviewed by: davidg, olah
Submitted by: Arne Henrik Juul <arnej@imf.unit.no>
Obtained from: bugs@netbsd.org
Submitted by: Mike Mitchell, supervisor@alb.asctmd.com
This is a bulk mport of Mike's IPX/SPX protocol stacks and all the
related gunf that goes with it..
it is not guaranteed to work 100% correctly at this time
but as we had several people trying to work on it
I figured it would be better to get it checked in so
they could all get teh same thing to work on..
Mikes been using it for a year or so
but on 2.0
more changes and stuff will be merged in from other developers now that this is in.
Mike Mitchell, Network Engineer
AMTECH Systems Corporation, Technology and Manufacturing
8600 Jefferson Street, Albuquerque, New Mexico 87113 (505) 856-8000
supervisor@alb.asctmd.com
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle. ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.
capacity of the link, even if the route's MTU indicates that we cannot
send that much in their direction. (This might actually make it possible
to test Path MTU discovery in a useful variety of cases.)
turned out not to be necessary; simply watching for MTU decreases (which
we already did) automagically eliminates all the cases we were trying to
protect against.
middle of a fully-open window. Also, keep track of how many retransmits
we do as a result of MTU discovery. This may actually do more work than
necessary, but it's an unusual condition...
Suggested by: Janey Hoe <janey@lcs.mit.edu>
we're at it, eliminate obsolete exposure of `struct llinfo_arp' to
the world. (This dates back to when ARP entries were not stored in
the routing table, and there was no other way for the `arp' program
to read the whole table than to grovel around in /dev/kmem.)
to be no ill effects, and so far as Iknow none of the variables in
question depend on 16-bit wraparound behavior. (The sizes are in
many cases relics from when a PCB had to fit inside a 128-byte mbuf. PCBs
are no longer stored in that way, and the old structure would not have
fit, either.)
matching IP options..Check and test this - i made only a couple
of rough tests and this could be buggy.. Ipaccounting can't use
IP Options (and i don't see any need to cound packets with specific
options either..)
More to come...
time ago. I left in Garrett's one, because his was in the 4.4-Lite-2
location, making any diffs just that little bit smaller.
I presume this choice means that netstat needs to be recompiled before
"netstat -s" will give a meaningful answer on tcp stats.
and gated on `options MTUDISC' in the source. It is also practically
untested becausse (sniff!) I don't have easy access to a network with
an MTU of less than an Ethernet. If you have a small MTU network,
please try it and tell me if it works!
to be sent, just clean up and return ENOBUFS rather than silently
proceeding without sending any of the data. This makes it consistent
with the `#ifdef notyet' case immediately above.
Reviewed by: Andras Olah <olah@freebsd.org>
Obtained from: Lite-2
Garrett,
Here are some patches for the rate limiting code. It should be faster,
and in particular it doesn't leak malloc'd memory any more when rate_limit'ing
a phyint.
It now uses an mbuf chain at each vif, instead of the static queue array.
This means that the MAXQSIZE is now variable per vif (although there is no
interface to change it other than a debugger); this is an area for more
experimentation.
Bill
Submitted by: Bill Fenner <fenner@parc.xerox.com>
case, multicast options are not passed to ip_mforward().) The previous
version had a wrong test, thus causing RSVP mrouters to forward RSVP messages
in violation of the spec.
or ssthresh that we were able to use
tcp_var.h - declare tcpstat entries for above; declare tcp_{send,recv}space
in_rmx.c - fill in the MTU and pipe sizes with the defaults TCP would have
used anyway in the absence of values here
incorrect indents, a variety of poor coding practices such as comparing
pointers to constants ('0'), poor code structuring, etc, etc. This brings
the code up to the minimum standards for inclusion in FreeBSD.
2) Rewrote "bad_packet" code to be less buggy and more readable.
3) Removed a pile of goto's; the code is now somewhat less reminiscent
of a certain Italian pasta.
4) Changed all boolean returns of "0" and "1" to FALSE/TRUE.
know better when to cache values in the route, rather than relying on a
heuristic involving sequence numbers that broke when tcp_sendspace
was increased to 16k.