kernel module. LibAlias is not aware about checksum offloading,
so the caller should provide checksum calculation. (The only
current consumer is ng_nat(4)). When TCP packet internals has
been changed and it requires checksum recalculation, a cookie
is set in th_x2 field of TCP packet, to inform caller that it
needs to recalculate checksum. This ugly hack would be removed
when LibAlias is made more kernel friendly.
Incremental checksum updates are left as is, since they don't
conflict with offloading.
Approved by: re (scottl)
a DLT_NULL interface. In particular:
1) Consistently use type u_int32_t for the header of a
DLT_NULL device - it continues to represent the address
family as always.
2) In the DLT_NULL case get bpf_movein to store the u_int32_t
in a sockaddr rather than in the mbuf, to be consistent
with all the DLT types.
3) Consequently fix a bug in bpf_movein/bpfwrite which
only permitted packets up to 4 bytes less than the MTU
to be written.
4) Fix all DLT_NULL devices to have the code required to
allow writing to their bpf devices.
5) Move the code to allow writing to if_lo from if_simloop
to looutput, because it only applies to DLT_NULL devices
but was being applied to other devices that use if_simloop
possibly incorrectly.
PR: 82157
Submitted by: Matthew Luckie <mjl@luckie.org.nz>
Approved by: re (scottl)
to the statement in ip_mroute.h, as well as being the same as what
OpenBSD has done with this file. It matches the copyright in NetBSD's
1.1 through 1.14 versions of the file as well, which they subsequently
added back.
It appears to have been lost in the 4.4-lite1 import for FreeBSD 2.0,
but where and why I've not investigated further. OpenBSD had the same
problem. NetBSD had a copyright notice until Multicast 3.5 was
integrated verbatim back in 1995. This appears to be the version that
made it into 4.4-lite1.
Approved by: re (scottl)
MFC after: 3 days
- do not use static memory as we are under a shared lock only
- properly rtfree routes allocated with rtalloc
- rename to verify_path6()
- implement the full functionality of the IPv4 version
Also make O_ANTISPOOF work with IPv6.
Reviewed by: gnn
Approved by: re (blanket)
on an IPv4 packet as these variables are uninitialized if not. This used to
allow arbitrary IPv6 packets depending on the value in the uninitialized
variables.
Some opcodes (most noteably O_REJECT) do not support IPv6 at all right now.
Reviewed by: brooks, glebius
Security: IPFW might pass IPv6 packets depending on stack contents.
Approved by: re (blanket)
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.
This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.
Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.
Reviewed by: sobomax, sam
do the subsequent ip_output() in IPFW. In ipfw_tick(), the keep-alive
packets must be generated from the data that resides under the
stateful lock, but they must not be sent at that time, as this would
cause a lock order reversal with the normal ordering (interface's
lock, then locks belonging to the pfil hooks).
In practice, this caused deadlocks when using IPFW and if_bridge(4)
together to do stateful transparent filtering.
MFC after: 1 week
the tail (in tcp_sack_option()). The bug was caused by incorrect
accounting of the retransmitted bytes in the sackhint.
Reported by: Kris Kennaway.
Submitted by: Noritoshi Demizu.
policy. It may be used to provide more detailed classification of
traffic without actually having to decide its fate at the time of
classification.
MFC after: 1 week
- Walks the scoreboard backwards from the tail to reduce the number of
comparisons for each sack option received.
- Introduce functions to add/remove sack scoreboard elements, making
the code more readable.
Submitted by: Noritoshi Demizu
Reviewed by: Raja Mukerji, Mohan Srinivasan
This is the last requirement before we can retire ip6fw.
Reviewed by: dwhite, brooks(earlier version)
Submitted by: dwhite (manpage)
Silence from: -ipfw
if_ioctl routine. This should fix a number of code paths through
soo_ioctl() that could call into Giant-locked network drivers without
first acquiring Giant.
Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach()
Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop()
MFC after: 7 days
that if we sort the incoming SACK blocks, we can update the scoreboard
in one pass of the scoreboard. The added overhead of sorting upto 4
sack blocks is much lower than traversing (potentially) large
scoreboards multiple times. The code was updating the scoreboard with
multiple passes over it (once for each sack option). The rewrite fixes
that, reducing the complexity of the main loop from O(n^2) to O(n).
Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.
real next hole to retransmit from the scoreboard, caused by a bug
which did not update the "nexthole" hint in one case in
tcp_sack_option().
Reported by: Daniel Eriksson
Submitted by: Mohan Srinivasan
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.
This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.
The debug code that sanity checks the hints against the computed
value will be removed eventually.
Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer
first.
Security: FreeBSD-SA-05:08.kmem
- kernel module declarations and handler.
- macros to map malloc(3) calls to malloc(9) ones.
- malloc(9) declarations.
- call finishoff() from module handler MOD_UNLOAD case
instead of atexit(3).
- use panic(9) instead of abort(3)
- take time from time_second instead of gettimeofday(2)
- define INADDR_NONE
look up the packet size of the packet that generated the
response, step down the MTU by one step through ip_next_mtu()
and try again.
Suggested by: dwmalone
needed only for implicit connect cases. Under load, especially on SMP,
this can greatly reduce contention on the tcbinfo lock.
NB: Ambiguities about the state of so_pcb need to be resolved so that
all use of the tcbinfo lock in non-implicit connection cases can be
eliminated.
Submited by: Kazuaki Oda <kaakun at highway dot ne dot jp>
fields of an ICMP packet.
Use this to allow ipfw to pullup only these values since it does not use
the rest of the packet and it was failed on ICMP packets because they
were not long enough.
struct icmp should probably be modified to use these at some point, but
that will break a fair bit of code so it can wait for another day.
On the off chance that adding this struct breaks something in ports,
bump __FreeBSD_version.
Reported by: Randy Bush <randy at psg dot com>
Tested by: Randy Bush <randy at psg dot com>
If TCP Signatures are enabled, the maximum allowed sack blocks aren't
going to fit. The fix is to compute how many sack blocks fit and tack
these on last. Also on SYNs, defer padding until after the SACK
PERMITTED option has been added.
Found by: Mohan Srinivasan.
Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.
code readability and facilitates some anticipated optimizations in
tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.
Submitted by: Raja Mukerji.
Reviewed by: Mohan Srinivasan, Noritoshi Demizu.
- If the peer sends the Signature option in the SYN, use of Timestamps
and Window Scaling were disabled (even if the peer supports them).
- The sender must not disable signatures if the option is absent in
the received SYN. (See comment in syncache_add()).
Found, Submitted by: Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>.
Reviewed by: Mohan Srinivasan <mohans at yahoo-inc dot com>.
tcp_ctlinput() and subject it to active tcpcb and sequence
number checking. Previously any ICMP unreachable/needfrag
message would cause an update to the TCP hostcache. Now only
ICMP PMTU messages belonging to an active TCP session with
the correct src/dst/port and sequence number will update the
hostcache and complete the path MTU discovery process.
Note that we don't entirely implement the recommended counter
measures of Section 7.2 of the paper. However we close down
the possible degradation vector from trivially easy to really
complex and resource intensive. In addition we have limited
the smallest acceptable MTU with net.inet.tcp.minmss sysctl
for some time already, further reducing the effect of any
degradation due to an attack.
Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.2
MFC after: 3 days
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.
Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.
Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after: 3 days
IPv6 support. The header in IPv6 is more complex then in IPv4 so we
want to handle skipping over it in one location.
Submitted by: Mariano Tortoriello and Raffaele De Lorenzo (via luigi)
in flight in SACK recovery.
Found by: Noritoshi Demizu
Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com>
Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
Raja Mukerji <raja at moselle dot com>
setting ts_recent to an arbitrary value, stopping further
communication between the two hosts.
- If the Echoed Timestamp is greater than the current time,
fall back to the non RFC 1323 RTT calculation.
Submitted by: Raja Mukerji (raja at moselle dot com)
Reviewed by: Noritoshi Demizu, Mohan Srinivasan
a reassembly queue state structure, don't update (receiver) sack
report.
- Similarly, if tcp_drain() is called, freeing up all items on the
reassembly queue, clean the sack report.
Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).
(Fix for kern/78226).
Submitted by : Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by : Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).
try to reasseble the packet from the fragments queue with the only
fragment, finish with the first fragment as soon as we create a queue.
Spotted by: Vijay Singh
o Drop the fragment if maxfragsperpacket == 0, no chances we
will be able to reassemble the packet in future.
Reviewed by: silby
ip.portrange.last and there is the only port for that because:
a) it is not wise; b) it leads to a panic in the random ip port
allocation code. In general we need to disable ip port allocation
randomization if the last - first delta is ridiculous small.
PR: kern/79342
Spotted by: Anjali Kulkarni
Glanced at by: silby
MFC after: 2 weeks
we have a non-NULL args.rule. If the same packet later is subject to "tee"
rule, its original is sent again into ipfw_chk() and it reenters at the same
rule. This leads to infinite loop and frozen router.
Assign args.rule to NULL, any time we are going to send packet back to
ipfw_chk() after a tee rule. This is a temporary workaround, which we
will leave for RELENG_5. In HEAD we are going to make divert(4) save
next rule the same way as dummynet(4) does.
PR: kern/79546
Submitted by: Oleg Bulyzhin
Reviewed by: maxim, andre
MFC after: 3 days
libalias.
In /usr/src/lib/libalias/alias.c, the functions LibAliasIn and
LibAliasOutTry call the legacy PacketAliasIn/PacketAliasOut instead
of LibAliasIn/LibAliasOut when the PKT_ALIAS_REVERSE option is set.
In this case, the context variable "la" gets lost because the legacy
compatibility routines expect "la" to be global. This was obviously
an oversight when rewriting the PacketAlias* functions to the
LibAlias* functions.
The fix (as shown in the patch below) is to remove the legacy
subroutine calls and replace with the new ones using the "la" struct
as the first arg.
Submitted by: Gil Kloepfer <fgil@kloepfer.org>
Confirmed by: <nicolai@catpipe.net>
PR: 76839
MFC after: 3 days
carp_carpdev_state_locked() is called every time carp interface is attached.
The first call backs up flags of the first interface, and the second
call backs up them again, erasing correct values.
To solve this, a carp_sc_state_locked() function is introduced. It is
called when interface is attached to parent, instead of calling
carp_carpdev_state_locked. carp_carpdev_state_locked() calls
carp_sc_state_locked() for each sc in chain.
Reported by: Yuriy N. Shkandybin, sem
returns error. In this case mbuf has already been freed. [1]
- Remove redundant declaration.
PR: kern/78893 [1]
Submitted by: Liang Yi [1]
Reviewed by: sam
MFC after: 1 day
Add two another workarounds for carp(4) interfaces:
- do not add connected route when address is assigned to carp(4) interface
- do not add connected route when other interface goes down
Embrace workarounds with #ifdef DEV_CARP
per-connection and globally. This eliminates potential DoS attacks
where SACK scoreboard elements tie up too much memory.
Submitted by: Raja Mukerji (raja at moselle dot com).
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com).
a libalias application (e.g. natd, ppp, etc.) to crash. Note: Skinny support
is not enabled in natd or ppp by default.
Approved by: secteam (nectar)
MFC after: 1 day
Secuiryt: This fixes a remote DoS exploit
attached to a parent interface we use its mutex to lock the softc. This
means that in several places like carp_ioctl() we lock softc conditionaly.
This should be redesigned.
To avoid LORs when MII announces us a link state change, we schedule
a quick callout and call carp_carpdev_state_locked() from it.
Initialize callouts using NET_CALLOUT_MPSAFE.
Sponsored by: Rambler
Reviewed by: mlaier
- In carp_send_ad_all() walk through list of all carp interfaces
instead of walking through list of all interfaces.
Sponsored by: Rambler
Reviewed by: mlaier
- Use our loop DLT type, not OpenBSD. [1]
- The fields that are converted to network byte order are not 32-bit
fields but 16-bit fields, so htons should be used in htonl. [1]
- Secondly, ip_input changes ip->ip_len into its value without
the ip-header length. So, restore the length to make bpf happy. [1]
- Use bpf_mtap2(), use temporary af1, since bpf_mtap2 doesn't
understand uint8_t af identifier.
Submitted by: Frank Volf [1]
ignore the sack options in that segment. Else we'd end up
corrupting the scoreboard.
Found by: Raja Mukerji (raja at moselle dot com)
Submitted by: Mohan Srinivasan
- Simplify CARP_LOG() and making it working (we don't have addlog in FreeBSD).
- Introduce CARP_DEBUG() which logs with LOG_DEBUG severity when
net.inet.carp.log > 1
- Use CARP_DEBUG to log state changes of carp interfaces.
After CARP_LOG() cleanup it appeared that carp_input_c() does not need sc
argument. Remove it.
Sponsored by: Rambler
mastering on all other interfaces:
- call carp_carpdev_state() on initialize instead of just setting to INIT
- in carp_carpdev_state() check that interface is UP, instead of checking
that it is not DOWN, because a rebooted machine may have interface in
UNKNOWN state.
Sponsored by: Rambler
Obtained from: OpenBSD (partially)
with the kernel compile time option:
options IPFIREWALL_FORWARD_EXTENDED
This option has to be specified in addition to IPFIRWALL_FORWARD.
With this option even packets targeted for an IP address local
to the host can be redirected. All restrictions to ensure proper
behaviour for locally generated packets are turned off. Firewall
rules have to be carefully crafted to make sure that things like
PMTU discovery do not break.
Document the two kernel options.
PR: kern/71910
PR: kern/73129
MFC after: 1 week
so that parent interface is not left in promiscous mode after carp
interface is destroyed.
This is not perfect, since promisc counter is added when carp
interface is assigned an IP address. However, when address is removed
parent interface is still in promiscuous mode. Only removal of
carp interface removes promisc from parent. Same way in OpenBSD.
Sponsored by: Rambler
hosts to share an IP address, providing high availability and load
balancing.
Original work on CARP done by Michael Shalayeff, with many
additions by Marco Pfatschbacher and Ryan McBride.
FreeBSD port done solely by Max Laier.
Patch by: mlaier
Obtained from: OpenBSD (mickey, mcbride)
address is not supplied, then jail IP is choosed and in_pcbbind() is called.
Since udp_output() does not save local addr after call to in_pcbconnect_setup(),
in_pcbbind() is called for each packet, and this is incorrect.
So, we shall treat jailed sockets specially in udp_output(), we will save
their local address.
This fixes a long standing bug with broken sendto() system call in jails.
PR: kern/26506
Reviewed by: rwatson
MFC after: 2 weeks
loopback interface. Nobody have explained me sense of this check.
It breaks connect() system call to a destination address which is
loopback routed (e.g. blackholed).
Reviewed by: silence on net@
MFC after: 2 weeks
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
reported to the sender - in the case where the sender sends data
outside the window (as WinXP does :().
Reported by: Sam Jensen <sam at wand dot net dot nz>
Submitted by: Mohan Srinivasan
Remove the SACK "initburst" sysctl.
- Fix bugs in SACK dupack and partialack handling that can cause
large bursts while in SACK recovery.
Submitted by: Mohan Srinivasan
o Use SYSCTL_IN() macro instead of direct call of copyin(9).
Submitted by: ume
o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.
Suggested by: ume
utility:
The tcpdrop command drops the TCP connection specified by the
local address laddr, port lport and the foreign address faddr,
port fport.
Obtained from: OpenBSD
Reviewed by: rwatson (locking), ru (man page), -current
MFC after: 1 month
a UMA zone instead. This should eliminate a bit of the locking
overhead associated with with malloc and reduce the memory
consumption associated with each new state.
Reviewed by: rwatson, andre
Silence on: ipfw@
MFC after: 1 week
second; since the default hz has changed to 1000 times a second,
this resulted in unecessary work being performed.
MFC after: 2 weeks
Discussed with: phk, cperciva
General head nod: silby
- ip_fw_chk() returns action as function return value. Field retval is
removed from args structure. Action is not flag any more. It is one
of integer constants.
- Any action-specific cookies are returned either in new "cookie" field
in args structure (dummynet, future netgraph glue), or in mbuf tag
attached to packet (divert, tee, some future action).
o Convert parsing of return value from ip_fw_chk() in ipfw_check_{in,out}()
to a switch structure, so that the functions are more readable, and a future
actions can be added with less modifications.
Approved by: andre
MFC after: 2 months
of len in tcp_output(), in the case where the FIN has already been
transmitted. The mis-computation of len is because of a gcc
optimization issue, which this change works around.
Submitted by: Mohan Srinivasan
that the RFC 793 specification for accepting RST packets should be
following. When followed, this makes one vulnerable to the attacks
described in "slipping in the window", but it may be necessary in
some odd circumstances.
connection rates, which is causing problems for some users.
To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.
Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds. These thresholds may be tuned via sysctl.
Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.
MFC After: 20 seconds
cases for tcp_input():
While it is true that the pcbinfo lock provides a pseudo-reference to
inpcbs, both the inpcb and pcbinfo locks are required to free an
un-referenced inpcb. As such, we can release the pcbinfo lock as
long as the inpcb remains locked with the confidence that it will not
be garbage-collected. This leads to a less conservative locking
strategy that should reduce contention on the TCP pcbinfo lock.
Discussed with: sam
multiple MIB entries using sysctl in short order, which might
result in unexpected values for tcp_maxidle being generated by
tcp_slowtimo. In practice, this will not happen, or at least,
doesn't require an explicit comment.
MFC after: 2 weeks
Andre:
First lets get major new features into the kernel in a clean and nice way,
and then start optimizing. In this case we don't have any obfusication that
makes later profiling and/or optimizing difficult in any way.
Requested by: csjp, sam
mechanism used by pfil. This shared locking mechanism will remove
a nasty lock order reversal which occurs when ucred based rules
are used which results in hard locks while mpsafenet=1.
So this removes the debug.mpsafenet=0 requirement when using
ucred based rules with IPFW.
It should be noted that this locking mechanism does not guarantee
fairness between read and write locks, and that it will favor
firewall chain readers over writers. This seemed acceptable since
write operations to firewall chains protected by this lock tend to
be less frequent than reads.
Reviewed by: andre, rwatson
Tested by: myself, seanc
Silence on: ipfw@
MFC after: 1 month
tcpip_fillheaders()
tcp_discardcb()
tcp_close()
tcp_notify()
tcp_new_isn()
tcp_xmit_bandwidth_limit()
Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).
MFC after: 2 weeks
inp->inp_moptions pointer, so that ip_getmoptions() can perform
necessary locking when doing non-atomic reads.
Lock the inpcb by default to copy any data to local variables, then
unlock before performing sooptcopyout().
MFC after: 2 weeks
modifications to the inpcb IP options mbuf:
- Lock the inpcb before passing it into ip_pcbopts() in order to prevent
simulatenous reads and read-modify-writes that could result in races.
- Pass the inpcb reference into ip_pcbopts() instead of the option chain
pointer in the inpcb.
- Assert the inpcb lock in ip_pcbots.
- Convert one or two uses of a pointer as a boolean or an integer
comparison to a comparison with NULL for readability.
pointer updates: test available space while holding the socket buffer
mutex, and continue to hold until until the pointer update has been
performed.
MFC after: 2 weeks
This socket option allows processes query a TCP socket for some low
level transmission details, such as the current send, bandwidth, and
congestion windows. Linux provides a 'struct tcpinfo' structure
containing various variables, rather than separate socket options;
this makes the API somewhat fragile as it makes it dificult to add
new entries of interest as requirements and implementation evolve.
As such, I've included a large pad at the end of the structure.
Right now, relatively few of the Linux API fields are filled in, and
some contain no logical equivilent on FreeBSD. I've include __'d
entries in the structure to make it easier to figure ou what is and
isn't omitted. This API/ABI should be considered unstable for the
time being.
window was 0 bytes in size. This may have been the cause of unsolved
"connection not closing" reports over the years.
Thanks to Michiel Boland for providing the fix and providing a concise
test program for the problem.
Submitted by: Michiel Boland
MFC after: 2 weeks
contents of the tcpcb are read and modified in volume.
In tcp_input(), replace th comparison with 0 with a comparison with
NULL.
At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'. Try to improve consistency between
various assertions regarding headlocked to be more informative.
MFC after: 2 weeks
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it. Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.
In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.
In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.
Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).
In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.
In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.
In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.
MFC after: 2 weeks
but unlikely races that could be corrected by having tcp_keepcnt
and tcp_keepintvl modifications go through handler functions via
sysctl, but probably is not worth doing. Updates to multiple
sysctls within evaluation of a single addition are unlikely.
Annotate that tcp_canceltimers() is currently unused.
De-spl tcp_timer_delack().
De-spl tcp_timer_2msl().
MFC after: 2 weeks
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().
Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller. This is a bug.
Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.
Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.
MFC after: 2 weeks
protects access to the ISN state variables.
Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.
Staticize internal ISN variables since they're not used outside of
tcp_subr.c.
MFC after: 2 weeks
from divert sockets.
- Remove div_disconnect() method, since it shouldn't be called now.
- Remove div_abort() method. It was never called directly, since protocol
doesn't have listen queue. It was called only from div_disconnect(),
which is removed now.
Reviewed by: rwatson, maxim
Approved by: julian (mentor)
MT5 after: 1 week
MT4 after: 1 month
after allowing more than one address with the same prefix.
Reported by: Vladimir Grebenschikov <vova NO fbsd SPAM ru>
Submitted by: ru (also NetBSD rev. 1.83)
Pointyhat to: mlaier
and has been broken twice:
- in the beginning of div_output() replace KASSERT with assignment, as
it was in rev. 1.83. [1] [to be MFCed]
- refactor changes introduced in rev. 1.100: do not prepend a new tag
unconditionally. Before doing this check whether we have one. [2]
A small note for all hacking in this area:
when divert socket is not a real userland, but ng_ksocket(4), we receive
_the same_ mbufs, that we transmitted to socket. These mbufs have rcvif,
the tags we've put on them. And we should treat them correctly.
Discussed with: mlaier [1]
Silence from: green [2]
Reviewed by: maxim
Approved by: julian (mentor)
MFC after: 1 week
This makes it possible to have more than one address with the same prefix.
The first address added is used for the route. On deletion of an address
with IFA_ROUTE set, we try to find a "fallback" address and hand over the
route if possible.
I plan to MFC this in 4 weeks, hence I keep the - now obsolete - argument to
in_ifscrub as it must be considered KAPI as it is not static in in.c. I will
clean this after the MFC.
Discussed on: arch, net
Tested by: many testers of the CARP patches
Nits from: ru, Andrea Campi <andrea+freebsd_arch webcom it>
Obtained from: WIDE via OpenBSD
MFC after: 1 month
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled. Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed. Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.
MFC after: 3 weeks
destined for a blackhole route.
This also means that blackhole routes do not need to be bound to lo(4)
or disc(4) interfaces for the net.inet.ip.fastforwarding=1 case.
Submitted by: james at towardex dot com
Sponsored by: eXtensible Open Router Project <URL:http://www.xorp.org/>
MFC after: 3 weeks
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked(). While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port. To correct this and
related races:
- Eliminate udp_ip6, which is believed to be generated but then never
used. Eliminate ip_2_ip6_hdr() as it is now unneeded.
- Eliminate setting, testing, and existence of 'init' status fields
for the IPv6 structures. While with multiple UDP delivery this
could lead to amortization of IPv4 -> IPv6 conversion when
delivering an IPv4 UDP packet to an IPv6 socket, it added
substantial complexity and side effects.
- Move global structures into the stack, declaring udp_in in
udp_input(), and udp_in6 in udp_append() to be used if a conversion
is required. Pass &udp_in into udp_append().
- Re-annotate comments to reflect updates.
With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism. This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.
Discovered by: kris (Bug Magnet)
MFC after: 3 weeks
A complete rationale and discussion is given in this message
and the resulting discussion:
http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706
Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.
Discussed on: -arch
under high load: only set function state to loop and continuing sending
if there is no data left to send.
RELENG_5_3 candidate.
Feet provided: Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by: Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by: mohan <mohans at yahoo-inc dot com>
protocols: it is possible for sockets to be created and attached
to the divert protocol between the test for sockets present and
successful unload of the registration handler. We will need to
explore more mature APIs for unregistering the protocol and then
draining consumers, or an atomic test-and-unregister mechanism.
of protocols. The call to divert_packet() is done through a function pointer. All
semantics of IPDIVERT remain intact. If IPDIVERT is not loaded ipfw will refuse to
install divert rules and natd will complain about 'protocol not supported'. Once
it is loaded both will work and accept rules and open the divert socket. The module
can only be unloaded if no divert sockets are open. It does not close any divert
sockets when an unload is requested but will return EBUSY instead.
protocols in inetsw[] and define initially eight spacer slots.
Remove conflicting declaration 'struct pr_usrreqs nousrreqs'. It is
now declared and initialized in kern/uipc_domain.c.
With pr_proto_register() it has become possible to dynamically load protocols
within the PF_INET domain. However the PF_INET domain has a second important
structure called ip_protox[] that is derived from the 'struct protosw inetsw[]'
and takes care of the de-multiplexing of the various protocols that ride on
top of IP packets.
The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[]
array mux in a consistent and easy way. To register a protocol within
ip_protox[] the existence of a corresponding and matching protocol definition
in inetsw[] is required. The function does not allow to overwrite an already
registered protocol. The unregister function simply replaces the mux slot with
the default index pointer to IPPROTO_RAW as it was previously.
(sorele()/sotryfree()):
- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.
- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.
This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.
RELENG_5_3 candidate.
MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>
in udp_input(), since the udbinfo lock is used to prevent removal of
the inpcb while in use (i.e., as a form of reference count) in the
in-bound path.
RELENG_5 candidate.
it isn't printed if the IP address in question is '0.0.0.0', which is
used by nodes performing DHCP lookup, and so constitute a false
positive as a report of misconfiguration.
processes in jail could create raw sockets, additional access control
checks were added to raw IP sockets to limit the ways in which those
sockets could be used. Specifically, only the socket option IP_HDRINCL
was permitted in rip_ctloutput(). Other socket options were protected
by a call to suser(). This change was required to prevent processes
in a Jail from modifying system properties such as multicast routing
and firewall rule sets.
However, it also introduced a regression: processes that create a raw
socket with root privilege, but then downgraded credential (i.e., a
daemon giving up root, or a setuid process switching back to the real
uid) could no longer issue other unprivileged generic IP socket option
operations, such as IP_TOS, IP_TTL, and the multicast group membership
options, which prevented multicast routing daemons (and some other
tools) from operating correctly.
This change pushes the access control decision down to the granularity
of individual socket options, rather than all socket options, on raw
IP sockets. When rip_ctloutput() doesn't implement an option, it will
now pass the request directly to in_control() without an access
control check. This should restore the functionality of the generic
IP socket options for raw sockets in the above-described scenarios,
which may be confirmed with the ipsockopt regression test.
RELENG_5 candidate.
Reviewed by: csjp
reaching into the socket buffer. This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it). Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although haven't
seen it reported).
RELENG_5 candidate.
callouts as non-CALLOUT_MPSAFE. Otherwise, they may trigger an
assertion regarding Giant if they enter other parts of the stack from
the callout.
MFC after: 3 days
Reported by: Dikshie < dikshie at ppk dot itb dot ac dot id >
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.
Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.
This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.
Suggested by: rwatson
A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by: rwatson, csjp
Tested by: -pf, -ipfw, LINT, csjp and myself
MFC after: 3 days
LOR IDs: 14 - 17 (not fixed yet)
When net.inet.ip.check_interface was MFCed to RELENG_4 3+ years ago in
rev. 1.130.2.17 ip_input.c it was 1 by default but shortly changed to
0 (accidently?) in rev. 1.130.2.20 in RELENG_4 only. Among with the
fact this knob is not documented it breaks POLA especially in bridge
environment.
OK'ed by: andre
Reviewed by: -current
family to the ip_protox[] array. The protocol number of IPPROTO_DIVERT is
larger than IPPROTO_MAX and was initializing memory beyond the array.
Catch all these kinds of errors by ignoring protocols that are higher than
IPPROTO_MAX or 0 (zero).
Add more comments ip_init().
it travels through the IP stack. This wasn't much of a problem because IP
source routing is disabled by default but when enabled together with SMP and
preemption it would have very likely cross-corrupted the IP options in transit.
The IP source route options of a packet are now stored in a mtag instead of the
global variable.
filters). After the ipfw to pfil move ip_input() expects M_FASTFWD_OURS
tagged packets to have ip_len and ip_off in host byte order instead of
network byte order.
PR: kern/71652
Submitted by: mlaier (patch)
and sent to the DIVERT socket while the original packet continues with the
next rule. Unlike a normally diverted packet no IP reassembly attemts are
made on tee'd packets and they are passed upwards totally unmodified.
Note: This will not be MFC'd to 4.x because of major infrastucture changes.
PR: kern/64240 (and many others collapsed into that one)
its users.
netisr_queue() now returns (0) on success and ERRNO on failure. At the
moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full)
are supported.
Previously it would return (1) on success but the return value of IF_HANDOFF()
was interpreted wrongly and (0) was actually returned on success. Due to this
schednetisr() was never called to kick the scheduling of the isr. However this
was masked by other normal packets coming through netisr_dispatch() causing the
dequeueing of waiting packets.
PR: kern/70988
Found by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 3 days
to point to a local IP address; and the packet was sourced from this host
we fill in the m_pkthdr.rcvif with a pointer to the loopback interface.
Before the function ifunit("lo0") was used to obtain the ifp. However
this is sub-optimal from a performance point of view and might be dangerous
if the loopback interface has been renamed. Use the global variable 'loif'
instead which always points to the loopback interface.
Submitted by: brooks
compile option. All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.
If no hooks are connected the entire packet filter hooks section and related
activities are jumped over. This removes any performance impact if no hooks
are active.
Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.
On a system with huge number of pipes, M_NOWAIT failes almost always,
because of memory fragmentation.
My fix is different than the patch proposed by Pawel Malachowski,
because in FreeBSD 5.x we cannot sleep while holding dummynet mutex
(in 4.x there is no such lock).
My fix is also ugly, but there is no easy way to prepare nice and clean fix.
PR: kern/46557
Submitted by: Eugene Grosbein <eugen@grosbein.pp.ru>
Reviewed by: mlaier
Previously the early drop was disabled unconditionally for ALTQ-enabled
kernels.
This should give some benefit for the normal gateway + LAN-server case with
a busy LAN leg and an ALTQ managed uplink.
Reviewed and style help from: cperciva, pjd
as m_len, or the pkthdr length will be inconsistent with the actual
length of data in the mbuf chain. The symptom of this occuring was
"out of data" warnings from in_cksum_skip() on large UDP packets sent
via the loopback interface.
Foot shot: green
security.jail.allow_raw_sockets sysctl MIB is set to 1) where privileged
access to jails is given out, it is possible for prison root to manipulate
various network parameters which effect the host environment. This commit
plugs a number of security holes associated with the use of raw sockets
and prisons.
This commit makes the following changes:
- Add a comment to rtioctl warning developers that if they add
any ioctl commands, they should use super-user checks where necessary,
as it is possible for PRISON root to make it this far in execution.
- Add super-user checks for the execution of the SIOCGETVIFCNT
and SIOCGETSGCNT IP multicast ioctl commands.
- Add a super-user check to rip_ctloutput(). If the calling cred
is PRISON root, make sure the socket option name is IP_HDRINCL,
otherwise deny the request.
Although this patch corrects a number of security problems associated
with raw sockets and prisons, the warning in jail(8) should still
apply, and by default we should keep the default value of
security.jail.allow_raw_sockets MIB to 0 (or disabled) until
we are certain that we have tracked down all the problems.
Looking forward, we will probably want to eliminate the
references to curthread.
This may be a MFC candidate for RELENG_5.
Reviewed by: rwatson
Approved by: bmilekic (mentor)
UDP/IP header, make sure that space is also allocated for the link
layer header. If an mbuf must be allocated to hold the UDP/IP header
(very likely), then this will avoid an additional mbuf allocation at
the link layer. This trick is also used by TCP and other protocols to
avoid extra calls to the mbuf allocator in the ethernet (and related)
output routines.
the ipfw KLD.
For IPFIREWALL_FORWARD this does not have any side effects. If the module
has it but not the kernel it just doesn't do anything.
For IPDIVERT the KLD will be unloadable if the kernel doesn't have IPDIVERT
compiled in too. However this is the least disturbing behaviour. The user
can just recompile either module or the kernel to match the other one. The
access to the machine is not denied if ipfw refuses to load.
This provides greater context for the locking and allows us to avoid
locking the pcbinfo structure if not binding operations will take
place (i.e., already bound, connected, and no expliti sendto()
address).
and preserves the ipfw ABI. The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.
However there are many changes how ipfw is and its add-on's are handled:
In general ipfw is now called through the PFIL_HOOKS and most associated
magic, that was in ip_input() or ip_output() previously, is now done in
ipfw_check_[in|out]() in the ipfw PFIL handler.
IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to
be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
reassembly. If not, or all fragments arrived and the packet is complete,
divert_packet is called directly. For 'tee' no reassembly attempt is made
and a copy of the packet is sent to the divert socket unmodified. The
original packet continues its way through ip_input/output().
ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet
with the new destination sockaddr_in. A check if the new destination is a
local IP address is made and the m_flags are set appropriately. ip_input()
and ip_output() have some more work to do here. For ip_input() the m_flags
are checked and a packet for us is directly sent to the 'ours' section for
further processing. Destination changes on the input path are only tagged
and the 'srcrt' flag to ip_forward() is set to disable destination checks
and ICMP replies at this stage. The tag is going to be handled on output.
ip_output() again checks for m_flags and the 'ours' tag. If found, the
packet will be dropped back to the IP netisr where it is going to be picked
up by ip_input() again and the directly sent to the 'ours' section. When
only the destination changes, the route's 'dst' is overwritten with the
new destination from the forward m_tag. Then it jumps back at the route
lookup again and skips the firewall check because it has been marked with
M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with
'option IPFIREWALL_FORWARD' to enable it.
DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for
a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will
then inject it back into ip_input/ip_output() after it has served its time.
Dummynet packets are tagged and will continue from the next rule when they
hit the ipfw PFIL handlers again after re-injection.
BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS.
More detailed changes to the code:
conf/files
Add netinet/ip_fw_pfil.c.
conf/options
Add IPFIREWALL_FORWARD option.
modules/ipfw/Makefile
Add ip_fw_pfil.c.
net/bridge.c
Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw
is still directly invoked to handle layer2 headers and packets would
get a double ipfw when run through PFIL_HOOKS as well.
netinet/ip_divert.c
Removed divert_clone() function. It is no longer used.
netinet/ip_dummynet.[ch]
Neither the route 'ro' nor the destination 'dst' need to be stored
while in dummynet transit. Structure members and associated macros
are removed.
netinet/ip_fastfwd.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.
netinet/ip_fw.h
Removed 'ro' and 'dst' from struct ip_fw_args.
netinet/ip_fw2.c
(Re)moved some global variables and the module handling.
netinet/ip_fw_pfil.c
New file containing the ipfw PFIL handlers and module initialization.
netinet/ip_input.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code. ip_forward() does not longer require
the 'next_hop' struct sockaddr_in argument. Disable early checks
if 'srcrt' is set.
netinet/ip_output.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.
netinet/ip_var.h
Add ip_reass() as general function. (Used from ipfw PFIL handlers
for IPDIVERT.)
netinet/raw_ip.c
Directly check if ipfw and dummynet control pointers are active.
netinet/tcp_input.c
Rework the 'ipfw forward' to local code to work with the new way of
forward tags.
netinet/tcp_sack.c
Remove include 'opt_ipfw.h' which is not needed here.
sys/mbuf.h
Remove m_claim_next() macro which was exclusively for ipfw 'forward'
and is no longer needed.
Approved by: re (scottl)
- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs
This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.
Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>
have already done this, so I have styled the patch on their work:
1) introduce a ip_newid() static inline function that checks
the sysctl and then decides if it should return a sequential
or random IP ID.
2) named the sysctl net.inet.ip.random_id
3) IPv6 flow IDs and fragment IDs are now always random.
Flow IDs and frag IDs are significantly less common in the
IPv6 world (ie. rarely generated per-packet), so there should
be smaller performance concerns.
The sysctl defaults to 0 (sequential IP IDs).
Reviewed by: andre, silby, mlaier, ume
Based on: NetBSD
MFC after: 2 months
Since the only thing truly unique about a prison is it's ID, I figured
this would be the most granular way of handling this.
This commit makes the following changes:
- Adds tokenizing and parsing for the ``jail'' command line option
to the ipfw(8) userspace utility.
- Append the ipfw opcode list with O_JAIL.
- While Iam here, add a comment informing others that if they
want to add additional opcodes, they should append them to the end
of the list to avoid ABI breakage.
- Add ``fw_prid'' to the ipfw ucred cache structure.
- When initializing ucred cache, if the process is jailed,
set fw_prid to the prison ID, otherwise set it to -1.
- Update man page to reflect these changes.
This change was a strong motivator behind the ucred caching
mechanism in ipfw.
A sample usage of this new functionality could be:
ipfw add count ip from any to any jail 2
It should be noted that because ucred based constraints
are only implemented for TCP and UDP packets, the same
applies for jail associations.
Conceptual head nod by: pjd
Reviewed by: rwatson
Approved by: bmilekic (mentor)
The first one was going to 'dropfrag', which unlocks the IPQ, before the lock
was aquired; The second one doing a unlock and then a 'goto dropfrag' which
led to a double-unlock.
Tripped over by: des
for structures with timers in them. It might be that a timer might fire
even when the associated structure has already been free'd. Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.
Discussed with: bosko, tegge, rwatson
For incoming packets, the packet's source address is checked if it
belongs to a directly connected network. If the network is directly
connected, then the interface the packet came on in is compared to
the interface the network is connected to. When incoming interface
and directly connected interface are not the same, the packet does
not match.
Usage example:
ipfw add deny ip from any to any not antispoof in
Manpage education by: ru
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events. This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.
Reported by: kuriyama
Tested by: kuriyama
or multicast packet, we don't need to acquire the inpcb mutex
unless we are actually using inpcb fields other than the bound port
and address. Since we hold the pcbinfo lock already, these can't
change. Defer acquiring the inpcb mutex until we have a high
chance of a match. This avoids about 120 mutex operations per UDP
broadcast packet received on one of my work systems.
Reviewed by: sam
lock assertions even if IPv6 is compiled into the kernel. Previously,
inclusion of IPv6 and locking assertions would result in a rapid
assertion failure as IPv6 was not properly locking inpcbs.
functions. Basically, the ip_next() function was used to get the PPTP and
Skinny headers when tcp_next() should have been used instead. Symptoms of
this included a segfault in natd when trying to process a PPTP or Skinny
packet.
Approved by: des
make it fully self-contained.
o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len
including the IP header.
o Computation of the delayed checksum is moved into divert_packet().
Reviewed by: silby
Alice is too lazy to write a server application in PF-independent
manner. Therefore she knocks up the server using PF_INET6 only
and allows the IPv6 socket to accept mapped IPv4 as well. An evil
hacker known on IRC as cheshire_cat has an account in the same
system. He starts a process listening on the same port as used
by Alice's server, but in PF_INET. As a consequence, cheshire_cat
will distract all IPv4 traffic supposed to go to Alice's server.
Such sort of port theft was initially enabled by copying the code that
implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for
the implied case of the same owner for both connections. After this
change, the above scenario will be impossible. In the same setting,
the user who attempts to start his server last will get EADDRINUSE.
Of course, using IPv4 mapped to IPv6 leads to security complications
in the first place, but there is no reason to make it even more unsafe.
This change doesn't apply to KAME since it affects a FreeBSD-specific
part of the code. It doesn't modify the out-of-box behaviour of the
TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default.
MFC after: 1 month
with the FIN bit set for all segments, if a FIN has already been sent before.
The fix will allow the FIN bit to be set for only the last segment, in case
it has to be retransmitted.
Fix another bug that would have caused snd_nxt to be pulled by len if
there was an error from ip_output. snd_nxt should not be touched
during sack retransmissions.
when inpcb is NULL, this is no longer invalid since jlemon added the
tcp_twstart function... this prevents close "failing" w/ EINVAL when it
really was successful...
Reviewed by: jeremy (NetBSD)
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
RTF_BLACKHOLE as well.
To quote the submitter:
The uRPF loose-check implementation by the industry vendors, at least on Cisco
and possibly Juniper, will fail the check if the route of the source address
is pointed to Null0 (on Juniper, discard or reject route). What this means is,
even if uRPF Loose-check finds the route, if the route is pointed to blackhole,
uRPF loose-check must fail. This allows people to utilize uRPF loose-check mode
as a pseudo-packet-firewall without using any manual filtering configuration --
one can simply inject a IGP or BGP prefix with next-hop set to a static route
that directs to null/discard facility. This results in uRPF Loose-check failing
on all packets with source addresses that are within the range of the nullroute.
Submitted by: James Jun <james@towardex.com>
1) data to be sent to the right of snd_recover.
2) send more data then whats in the send buffer.
The fix is to postpone sack retransmit to a subsequent recovery episode
if the current retransmit pointer is beyond snd_recover.
Thanks to Mohan Srinivasan for helping fix the bug.
Submitted by:Daniel Lang
for the SYN|ACK packet and then letting in6_pcbconnect set the
flowlabel later. Arange for the syncache/syncookie code to set and
recall the flow label so that the flowlabel used for the SYN|ACK
is consistent. This is done by using some of the cookie (when tcp
cookies are enabeled) and by stashing the flowlabel in syncache.
Tested and Discovered by: Orla McGann <orly@cnri.dit.ie>
Approved by: ume, silby
MFC after: 1 month
icmp_error() packets. While here retire PACKET_TAG_PF_GENERATED (which
served the same purpose) and use M_SKIP_FIREWALL in pf as well. This should
speed up things a bit as we get rid of the tag allocations.
Discussed with: juli
using M_PROTO6 and possibly shooting someone's foot, as well as allowing the
firewall to be used in multiple passes, or with a packet classifier frontend,
that may need to explicitly allow a certain packet. Presently this is handled
in the ipfw_chk code as before, though I have run with it moved to upper
layers, and possibly it should apply to ipfilter and pf as well, though this
has not been investigated.
Discussed with: luigi, rwatson
for unknown events.
A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
bootp -> BOOTP
bootp.nfsroot -> BOOTP_NFSROOT
bootp.nfsv3 -> BOOTP_NFSV3
bootp.compat -> BOOTP_COMPAT
bootp.wired_to -> BOOTP_WIRED_TO
- i.e. back out the previous commit. It's already possible to
pxeboot(8) with a GENERIC kernel.
Pointed out by: dwmalone
BOOTP -> bootp
BOOTP_NFSROOT -> bootp.nfsroot
BOOTP_NFSV3 -> bootp.nfsv3
BOOTP_COMPAT -> bootp.compat
BOOTP_WIRED_TO -> bootp.wired_to
This lets you PXE boot with a GENERIC kernel by putting this sort of thing
in loader.conf:
bootp="YES"
bootp.nfsroot="YES"
bootp.nfsv3="YES"
bootp.wired_to="bge1"
or even setting the variables manually from the OK prompt.
{ip,udp,tcp} header and return a void * pointing to the payload (i.e. the
first byte past the end of the header and any required padding). Use them
consistently throughout libalias to a) reduce code duplication, b) improve
code legibility, c) get rid of a bunch of alignment warnings.
a short pointer. The previous implementation seems to be in a gray zone
of the C standard, and GCC generates incorrect code for it at -O2 or
higher on some platforms.
named link, foo_link or link_foo to lnk, foo_lnk or lnk_foo, fixing
signed / unsigned comparisons, and shoving unused function arguments
under the carpet.
I was hoping WARNS?=6 might reveal more serious problems, and perhaps
the source of the -O2 breakage, but found no smoking gun.
Fix this problem by separating out the SACK and the newreno cases. Also, check
if we are in FASTRECOVERY for the sack case and if so, turn off dupacks.
Fix an issue where the congestion window was not being incremented by ssthresh.
Thanks to Mohan Srinivasan for finding this problem.
associated with performing a wakeup on the socket buffer:
- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
acquire the socket buffer lock and use the _locked() variants of both
calls. Note that the _locked() sowakeup() versions unlock the mutex on
return. This is done in uipc_send(), divert_packet(), mroute
socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().
- When the socket buffer lock is dropped before a sowakeup(), remove the
explicit unlock and use the _locked() sowakeup() variant. This is done
in soisdisconnecting(), soisdisconnected() when setting the can't send/
receive flags and dropping data, and in uipc_rcvd() which adjusting
back-pressure on the sockets.
For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.
Update socket locking annotations
Giant if debug.mpsafenet=0, as any points that require synchronization
in the SMPng world also required it in the Giant-world:
- inpcb locks (including IPv6)
- inpcbinfo locks (including IPv6)
- dummynet subsystem lock
- ipfw2 subsystem lock
the socket buffer having its limits adjusted. sbreserve() now acquires
the lock before calling sbreserve_locked(). In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order. In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.
originated on RELENG_4 and was ported to -CURRENT.
The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.
You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.
Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
ip_ctloutput(), as it may need to perform blocking memory allocations.
This also improves consistency with locking relative to other points
that call into ip_ctloutput().
Bumped into by: Grover Lines <grover@ceribus.net>
was received on a broadcast address on the input path. Under certain
circumstances this could result in a panic, notably for locally-generated
packets which do not have m_pkthdr.rcvif set.
This is a similar situation to that which is solved by
src/sys/netinet/ip_icmp.c rev 1.66.
PR: kern/52935
encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer
when registering trace records. 'ipov' is specific to IPv4, and
will therefore be uninitialized.
[This fandango is only necessary in the first place because of our
host-byte-order IP field pessimization.]
PR: kern/60856
Submitted by: Galois Zheng
unless the segment really contains the last of the data for the stream.
PR: kern/34619
Obtained from: OpenBSD (tcp_output.c rev 1.47)
Noticed by: Joseph Ishac
Reviewed by: George Neville-Neil
Version 3.5 brings:
- Atomic commits of ruleset changes (reduce the chance of ending up in an
inconsistent state).
- A 30% reduction in the size of state table entries.
- Source-tracking (limit number of clients and states per client).
- Sticky-address (the flexibility of round-robin with the benefits of
source-hash).
- Significant improvements to interface handling.
- and many more ...
- Remove pflog and pfsync modules. Things will change in such a fashion
that there will be one module with pf+pflog that can be loaded into
GENERIC without problems (which is what most people want). pfsync is no
longer possible as a module.
- Add multicast address for in-kernel multicast pfsync protocol. Protocol
glue will follow once the import is done.
- Add one more mbuf tag
do not pick up the first local ip address for the source
ip address, return ENETUNREACH instead.
Submitted by: Gleb Smirnoff
Reviewed by: -current (silence)
mode tunnel, take the per-route MTU into account, *if* and *only if* it
is non-zero (as found in struct rt_metrics/rt_metrics_lite).
PR: kern/42727
Obtained from: NetBSD (ip_input.c rev 1.151)
fixes the problem of UDP sockets getting wedged in a connected state (and
bound to their destination) under heavy load.
Temporary bind/connect should probably be deleted in future
as an optimization, as described in "A Faster UDP" [Partridge/Pink 1993].
Notes:
- INP_LOCK() is already held in udp_output(). The connection is in effect
happening at a layer lower than the socket layer, therefore in theory
socket locking should not be needed.
- Inlining the in_pcbdisconnect() operation buys us nothing (in the case
of the current state of the code), as laddr is not part of the
inpcb hash or the udbinfo hash. Therefore there should be no need
to rehash after restoring laddr in the error case (this was a
concern of the original author of the patch).
PR: kern/41765
Requested by: gnn
Submitted by: Jinmei Tatuya (with cleanups)
Tested by: spray(8)
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:
SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)
Rename respectively to:
SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)
This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.
__FreeBSD_version bump will follow.
Tested-by: (i386)LINT
conform to the rfc2734 and rfc3146 standard for IP over firewire and
should eventually supercede the fwe driver. Right now the broadcast
channel number is hardwired and we don't support MCAP for multicast
channel allocation - more infrastructure is required in the firewire
code itself to fix these problems.
SOCK_LOCK(so):
- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.
- Assert socket lock in MAC entry point implementations.
- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.
reference count:
- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().
- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().
- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.
- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.
- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
ruleset, the pcb is looked up once per ipfw_chk() activation.
This is done by extracting the required information out of the PCB
and caching it to the ipfw_chk() stack. This should greatly reduce
PCB looking contention and speed up the processing of UID/GID based
firewall rules (especially with large UID/GID rulesets).
Some very basic benchmarks were taken which compares the number
of in_pcblookup_hash(9) activations to the number of firewall
rules containing UID/GID based contraints before and after this patch.
The results can be viewed here:
o http://people.freebsd.org/~csjp/ip_fw_pcb.png
Reviewed by: andre, luigi, rwatson
Approved by: bmilekic (mentor)
versions of various routers seen:
- Introduce igmp_mtx.
- Protect global variable 'router_info_head' and list fields
in struct router_info with this mutex, as well as
igmp_timers_are_running.
- find_rti() asserts that the caller acquires igmp_mtx.
- Annotate a failure to check the return value of
MALLOC(..., M_NOWAIT).
that m_prepend() is not called with possibility to wait while the
pcb lock is held. What still needs revisiting is whether the
ripcbinfo lock is really required here.
Discussed with: rwatson
process is a non-prison root. The security.jail.allow_raw_sockets
sysctl variable is disabled by default, however if the user enables
raw sockets in prisons, prison-root should not be able to interact
with firewall rule sets.
Approved by: rwatson, bmilekic (mentor)
unless it's in the closed or listening state (remote address
== INADDR_ANY).
If a TCP inpcb is in any other state, it's impossible to steal
its local port or use it for port theft. And if there are
both closed/listening and connected TCP inpcbs on the same
localIP:port couple, the call to in_pcblookup_local() will
find the former due to the design of that function.
No objections raised in: -net, -arch
MFC after: 1 month
of IP options.
net.inet.ip.process_options=0 Ignore IP options and pass packets unmodified.
net.inet.ip.process_options=1 Process all IP options (default).
net.inet.ip.process_options=2 Reject all packets with IP options with ICMP
filter prohibited message.
This sysctl affects packets destined for the local host as well as those
only transiting through the host (routing).
IP options do not have any legitimate purpose anymore and are only used
to circumvent firewalls or to exploit certain behaviours or bugs in TCP/IP
stacks.
Reviewed by: sam (mentor)
labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed. In
particular:
- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().
While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research
o New function ip_findroute() to reduce code duplication for the
route lookup cases. (luigi)
o Store ip_len in host byte order on the stack instead of using
it via indirection from the mbuf. This allows to defer the host
byte conversion to a later point and makes a quicker fallback to
normal ip_input() processing. (luigi)
o Check if route is dampned with RTF_REJECT flag and drop packet
already here when ARP is unable to resolve destination address.
An ICMP unreachable is sent to inform the sender.
o Check if interface output queue is full and drop packet already
here. No ICMP notification is sent because signalling source quench
is depreciated.
o Check if media_state is down (used for ethernet type interfaces)
and drop the packet already here. An ICMP unreachable is sent to
inform the sender.
o Do not account sent packets to the interface address counters. They
are only for packets with that 'ia' as source address.
o Update and clarify some comments.
Submitted by: luigi (most of it)
uncommitted):
Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h
alongside all of the other macros that work ok mbuf's and tag's.
jail, which is less restrictive but allows for more flexible
jail usage (for those who are willing to make the sacrifice).
The default is off, but allowing raw sockets within jails can
now be accomplished by tuning security.jail.allow_raw_sockets
to 1.
Turning this on will allow you to use things like ping(8)
or traceroute(8) from within a jail.
The patch being committed is not identical to the patch
in the PR. The committed version is more friendly to
APIs which pjd is working on, so it should integrate
into his work quite nicely. This change has also been
presented and addressed on the freebsd-hackers mailing
list.
Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
PR: kern/65800
possible while maintaining compatibility with the widest range of TCP stacks.
The algorithm is as follows:
---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.
For connections in all other states, a reset anywhere in the window
will cause the connection to be reset. All other segments will be
silently dropped.
---
The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.
Idea by: Darren Reed
Reviewed by: truckman, jlemon, others(?)
1. rt_check() cleanup:
rt_check() is only necessary for some address families to gain access
to the corresponding arp entry, so call it only in/near the *resolve()
routines where it is actually used -- at the moment this is
arpresolve(), nd6_storelladdr() (the call is embedded here),
and atmresolve() (the call is just before atmresolve to reduce
the number of changes).
This change will make it a lot easier to decouple the arp table
from the routing table.
There is an extra call to rt_check() in if_iso88025subr.c to
determine the routing info length. I have left it alone for
the time being.
The interface of arpresolve() and nd6_storelladdr() now changes slightly:
+ the 'rtentry' parameter (really a hint from the upper level layer)
is now passed unchanged from *_output(), so it becomes the route
to the final destination and not to the gateway.
+ the routines will return 0 if resolution is possible, non-zero
otherwise.
+ arpresolve() returns EWOULDBLOCK in case the mbuf is being held
waiting for an arp reply -- in this case the error code is masked
in the caller so the upper layer protocol will not see a failure.
2. arpcom untangling
Where possible, use 'struct ifnet' instead of 'struct arpcom' variables,
and use the IFP2AC macro to access arpcom fields.
This mostly affects the netatalk code.
=== Detailed changes: ===
net/if_arcsubr.c
rt_check() cleanup, remove a useless variable
net/if_atmsubr.c
rt_check() cleanup
net/if_ethersubr.c
rt_check() cleanup, arpcom untangling
net/if_fddisubr.c
rt_check() cleanup, arpcom untangling
net/if_iso88025subr.c
rt_check() cleanup
netatalk/aarp.c
arpcom untangling, remove a block of duplicated code
netatalk/at_extern.h
arpcom untangling
netinet/if_ether.c
rt_check() cleanup (change arpresolve)
netinet6/nd6.c
rt_check() cleanup (change nd6_storelladdr)
from tcp_hostcache would have overridden a (now) lower MTU of
an interface or route that changed since first PMTU discovery.
The bug would have caused TCP to redo the PMTU discovery when
not strictly necessary.
Make a comment about already pre-initialized default values
more clear.
Reviewed by: sam
source address of a packet exists in the routing table. The
default route is ignored because it would match everything and
render the check pointless.
This option is very useful for routers with a complete view of
the Internet (BGP) in the routing table to reject packets with
spoofed or unrouteable source addresses.
Example:
ipfw add 1000 deny ip from any to any not versrcreach
also known in Cisco-speak as:
ip verify unicast source reachable-via any
Reviewed by: luigi
implementation taken directly from OpenBSD.
I've resisted committing this for quite some time because of concern over
TIME_WAIT recycling breakage (sequential allocation ensures that there is a
long time before ports are recycled), but recent testing has shown me that
my fears were unwarranted.
TIME_WAIT recycling cases I was able to generate with http testing tools.
In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number. As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.
I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.
Except in such contrived benchmarking situations, this problem should never
come up in normal usage... until networks get faster.
No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.
in favour of rtalloc_ign(), which is what would end up being called
anyways.
There are 25 more instances of rtalloc() in net*/ and
about 10 instances of rtalloc_ign()
we convert ip_len into a network byte order; in_delayed_cksum() still
expects it in host byte order.
The symtom was the ``in_cksum_skip: out of data by %d'' complaints
from the kernel.
To add to the previous commit log. These fixes make tcpdump(1) happy
by not complaining about UDP/TCP checksum being bad for looped back
IP multicast when multicast router is deactivated.
Reported by: Vsevolod Lobko
to implement this mistake.
Fixed some nearby style bugs (initialization in declaration, misformatting
of this initialization, missing blank line after the declaration, and
comparision of the non-boolean result of the initialization with 0 using
"!". In KNF, "!" is not even used to compare booleans with 0).
It was fixed by moving problemetic checks, as well as checks that
doesn't need locking before locks are acquired.
Submitted by: Ryan Sommers <ryans@gamersimpact.com>
In co-operation with: cperciva, maxim, mlaier, sam
Tested by: submitter (previous patch), me (current patch)
Reviewed by: cperciva, mlaier (previous patch), sam (current patch)
Approved by: sam
Dedicated to: enough!
+ struct ifnet: remove unused fields, move ipv6-related field close
to each other, add a pointer to l3<->l2 translation tables (arp,nd6,
etc.) for future use.
+ struct route: remove an unused field, move close to each
other some fields that might likely go away in the future
Previously, Giant would be grabbed at entry to the IP local delivery code
when debug.mpsafenet was set to true, as that implied Giant wouldn't be
grabbed in the driver path. Now, we will use this primitive to
conditionally grab Giant in the event the entire network stack isn't
running MPSAFE (debug.mpsafenet == 0).
Compute the payload checksum for a locally originated IP multicast where
God intended, in ip_mloopback(), rather than doing it in ip_output() and
only when multicast router is active. This is more correct as we do not
fool ip_input() that the packet has the correct payload checksum when in
fact it does not (when multicast router is inactive). This is also more
efficient if we don't join the multicast group we send to, thus allowing
the hardware to checksum the payload.
- Add gre_mtx to protect global softc list.
- Hold gre_mtx over various list operations (insert, delete).
- Centralize if_gre interface teardown in gre_destroy(), and call this
from modevent unload and gre_clone_destroy().
- Export gre_mtx to ip_gre.c, which walks the gre list to look up gre
interfaces during encapsulation. Add a wonking comment on how we need
some sort of drain/reference count mechanism to keep gre references
alive while in use and simultaneous destroy.
This commit does not lockdown softc data, which follows in a future
commit.
- Add encapmtx to protect ip_encap.c global variables (encapsulation
list).
- Unifdef #ifdef 0 pieces of encap_init() which was (and now really
is) basically a no-op.
- Lock encapmtx when walking encaptab, modifying it, comparing
entries, etc.
- Remove spl's.
Note that currently there's no facilite to make sure outstanding
use of encapsulation methods on a table entry have drained bfore
we allow a table entry to be removed. As such, it's currently the
caller's responsibility to make sure that draining takes place.
Reviewed by: mlaier
ifp is now passed explicitly to ether_demux; no need to look it up again.
Make mtag a global var in ip_input.
Noticed by: rwatson
Approved by: bms(mentor)
to NET_UNLOCK_GIANT(). While they are used in similar ways, the
semantics are quite different -- NET_LOCK_GIANT() and NET_UNLOCK_GIANT()
directly wrap mutex lock and unlock operations, whereas drop/pickup
special case the handling of Giant recursion. Add a comment saying
as much.
Add NET_ASSERT_GIANT(), which conditionally asserts Giant based
on the value of debug_mpsafenet.
This enables pf to track dynamic address changes on interfaces (dailup) with
the "on (<ifname>)"-syntax. This also brings hooks in anticipation of
tracking cloned interfaces, which will be in future versions of pf.
Approved by: bms(mentor)
pf/pflog/pfsync as modules. Do not list them in NOTES or modules/Makefile
(i.e. do not connect it to any (automatic) builds - yet).
Approved by: bms(mentor)
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no
longer used.
Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the
error return.
Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.
Reviewed by: bms
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly. Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.
Approved by: jlemon (years ago, for a more invasive fix)
amount of segments it will hold.
The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:
net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).
net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).
net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.
net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.
Tested by: bms, silby
Reviewed by: bms, silby
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).
This is (mostly) work from: sam
Silence from: -arch
Approved by: bms(mentor), sam, rwatson
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
used for the ICMP reply source in reponse to packets which are not
directly addressed to us. By default continue with with normal
source selection.
Reviewed by: bms
at packet arrival.
For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead. Simultaneous
use of the two options is possible and they will return consistent
timestamps.
This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
ifconfig(8) flag since header for version 2 is the same but IP payload
is prepended with additional 4-bytes field.
Inspired by: Roman Synyuk <roman@univ.kiev.ua>
MFC after: 2 weeks
recwin and sendwin. This removes a big source of confusion and makes
following the code much easier.
Reviewed by: sam (mentor)
Obtained from: DragonFlyBSD rev 1.6 (hsu)
Makes it possible to have multiple packet aliasing instances in a
single process by moving all static and global variables into an
instance structure called "struct libalias".
Redefine a new API based on s/PacketAlias/LibAlias/g
Add new "instance" argument to all functions in the new API.
Implement old API in terms of the new API.
tcp6_usr_bind(), tcp_usr_connect(), and tcp6_usr_connect() before checking
to see whether the address is multicast so that the proper errno value
will be returned if sa_len is incorrect. The checks are identical to the
ones in in_pcbbind_setup(), in6_pcbbind(), and in6_pcbladdr(), which are
called after the multicast address check passes.
MFC after: 30 days
resource exhaustion attacks.
For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.
The resource exhaustion works in two ways:
o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.
For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.
This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).
We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.
o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.
For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).
This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.
We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.
MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day
restore the general pre-randomid behaviour.
Setting the ip_id to zero causes several problems with
packet reassembly when a device along the path removes
the DF bit for some reason.
Other BSD and Linux have found and fixed the same issues.
PR: kern/60889
Tested by: Richard Wendland <richard@wendland.org.uk>
Approved by: re (scottl)
rfc3042 Limited retransmit
rfc3390 Increasing TCP's initial congestion Window
inflight TCP inflight bandwidth limiting
All my production server have it enabled and there have been no
issues. I am confident about having them on by default and it gives
us better overall TCP performance.
Reviewed by: sam (mentor)
are acting as router (ipforwarding enabled).
This doesn't fix the problem that host routes from ICMP redirects
are never removed from the kernel routing table but removes the
problem for machines doing packet forwarding.
Reviewed by: sam (mentor)
if_gre.c rev.1.41-1.49
o Spell output with two ts.
o Remove assigned-to but not used variable.
o fix grammatical error in a diagnostic message.
o u_short -> u_int16_t.
o gi_len is ip_len, so it has to be network byteorder.
if_gre.h rev.1.11-1.13
o prototype must not have variable name.
o u_short -> u_int16_t.
o Spell address with two d's.
ip_gre.c rev.1.22-1.29
o KNF - return is not a function.
o The "osrc" variable in gre_mobile_input() is only ever set but not
referenced; remove it.
o correct (false) assumptions on mbuf chain. not sure if it really helps, but
anyways, it is necessary to perform m_pullup.
o correct arg to m_pullup (need to count IP header size as well).
o remove redundant adjustment of m->m_pkthdr.len.
o clear m_flags just for safety.
o tabify.
o u_short -> u_int16_t.
MFC after: 2 weeks
a new bpf_mtap2 routine that does the right thing for an mbuf
and a variable-length chunk of data that should be prepended.
o while we're sweeping the drivers, use u_int32_t uniformly when
when prepending the address family (several places were assuming
sizeof(int) was 4)
o return M_ASSERTVALID to BPF_MTAP* now that all stack-allocated
mbufs have been eliminated; this may better be moved to the bpf
routines
Reviewed by: arch@ and several others
otherwise they are initialized twice when the code is statically
configured in the kernel because the module load method gets
invoked before the user application calls ip_mrouter_init
o add a mutex to synchronize the module init/done operations; this
sort of was done using the value of ip_mroute but X_ip_mrouter_done
sets it to NULL very early on which can lead to a race against
ip_mrouter_init--using the additional mutex means this is safe now
o don't call ip_mrouter_reset from ip_mrouter_init; this now happens
once at module load and X_ip_mrouter_done does the appropriate
cleanup work to insure the data structures are in a consistent
state so that a subsequent init operation inherits good state
Reviewed by: juli
wait, rather than the socket label. This avoids reaching up to
the socket layer during connection close, which requires locking
changes. To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer(). Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf. Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Before committing the initial tcp_hostcache I changed them from memcpy()
to conform with FreeBSD style without realizing the difference in argument
definition.
This fixes hostcache operation for IPv6 (in general and explicitly IPv6
path mtu discovery) and T/TCP (RFC1644).
Submitted by: Taku YAMAMOTO <taku@cent.saitama-u.ac.jp>
Approved by: re (rwatson)
code is compiled in to support the O_IPSEC operator. Previously no
support was included and ipsec rules were always matching. Note that
we do not return an error when an ipsec rule is added and the kernel
does not have IPsec support compiled in; this is done intentionally
but we may want to revisit this (document this in the man page).
PR: 58899
Submitted by: Bjoern A. Zeeb
Approved by: re (rwatson)
When the hostcache bucket limit is reached the last bucket wasn't
removed from the bucket row but inserted a few lines later at the
bucket row head again. This leads to infinite loop when the same
bucket row is accessed the next time for a lookup/insert or purge
action.
Tested by: imp, Matt Smith
Approved by: re (rwatson)
zeroed. Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.
Reviewed by: sam (mentor)
Approved by: re (scottl)
for ipfw processing w/o an indication the packets were generated
by ipfw--and so should not be processed (this manifested itself
as a LOR.) The flag bit in the mbuf that was used to mark the
packets was not listed in M_COPYFLAGS so if a packet had a header
prepended (as done by IPsec) the flag was lost. Correct this by
defining a new M_PROTO6 flag and use it to mark packets that need
this processing.
Reviewed by: bms
Approved by: re (rwatson)
MFC after: 2 weeks
rtalloc_ign() in in_pcbconnect_setup() before it is filled out.
Otherwise, stack junk would be left in sin_zero, which could
cause host routes to be ignored because they failed the comparison
in rn_match().
This should fix the wrong source address selection for connect() to
127.0.0.1, among other things.
Reviewed by: sam
Approved by: re (rwatson)
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.
It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.
tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.
It removes significant locking requirements from the tcp stack with
regard to the routing table.
Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
accordingly. The define is left intact for ABI compatibility
with userland.
This is a pre-step for the introduction of tcp_hostcache. The
network stack remains fully useable with this change.
Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
do not have mh_nextpkt initialized. Somtimes what's there is "1", and the
ip_input() code pukes trying to m_free() it, rendering divert sockets and
such broken.
This really underscores the need to get rid of MT_TAG.
Reviewed by: rwatson
and remove two unneccessary variable initializations.
Make the introduction comment more clear with regard which parts of
the packet are touched.
Requested by: luigi
complex locking and rework ip_rtaddr() to do its own rtlookup.
Adopt all its callers to this and make ip_output() callable
with NULL rt pointer.
Reviewed by: sam (mentor)
Short description of ip_fastforward:
o adds full direct process-to-completion IPv4 forwarding code
o handles ip fragmentation incl. hw support (ip_flow did not)
o sends icmp needfrag to source if DF is set (ip_flow did not)
o supports ipfw and ipfilter (ip_flow did not)
o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
o returns anything it can't handle back to normal ip_input
Enable with sysctl -w net.inet.ip.fastforwarding=1
Reviewed by: sam (mentor)
o correct a read-lock assert in in_pcblookup_local that should be
a write-lock assert (since time wait close cleanups may alter state)
Supported by: FreeBSD Foundation
preemption two CPUs can be in the same function at the same time
and clobber each others variables. Remove register declaration
from local variables.
Reviewed by: sam (mentor)
This switch toggles between strict multicast delivery, and traditional
multicast delivery.
The traditional (default) behaviour is to deliver multicast datagrams to all
sockets which are members of that group, regardless of the network interface
where the datagrams were received.
The strict behaviour is to deliver multicast datagrams received on a
particular interface only to sockets whose membership is bound to that
interface.
Note that as a matter of course, multicast consumers specifying INADDR_ANY
for their interface get joined on the interface where the default route
happens to be bound. This switch has no effect if the interface which the
consumer specifies for IP_ADD_MEMBERSHIP is not UP and RUNNING.
The original patch has been cleaned up somewhat from that submitted. It has
been tested on a multihomed machine with multiple QuickTime RTP streams
running over the local switch, which doesn't do IGMP snooping.
PR: kern/58359
Submitted by: William A. Carrel
Reviewed by: rwatson
MFC after: 1 week
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.
This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.
While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.
NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.
Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
it is marked as RTF_UP. This appears to fix a crash that was sometimes
triggered when dhclient(8) tried to send a packet after an interface
had been detatched.
Reviewed by: sam
o pickup Giant in divert_packet to protect sbappendaddr since it
can be entered through MPSAFE callouts or through ip_input when
mpsafenet is 1
o add missing locking on output
o add locking to abort and shutdown
o add a ctlinput handler to invalidate held routing table references
on an ICMP redirect (may not be needed)
Supported by: FreeBSD Foundation
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond
Supported by: FreeBSD Foundation
whether or not the isr needs to hold Giant when running; Giant-less
operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive
Supported by: FreeBSD Foundation
has a LOR between IPFW inpcb locks but I'm committing it now as the
lesser of two evils (the other being unlocked use of in_pcblookup).
Supported by: FreeBSD Foundation
to a routing table entry w/o bumping the reference count or locking
against the entry being free'd. This caused major havoc (for some
reason it appeared most frequently for folks running natd). Fix
is to bump the reference count whenever we copy the route cache
contents into a private copy so the entry cannot be reclaimed out
from under us. This is a short term fix as the forthcoming routing
table changes will eliminate this cache entirely.
Supported by: FreeBSD Foundation
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.
Tested by: nork
Obtained from: KAME
header copy made on input path: this is now handled differently.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
directly on the radix tree and does not hold any routing table refernces.
This fixes the reference counting problems that manifested itself as a
panic during unmount of filesystems that were mounted by NFS over an
interface that had been removed.
Supported by: FreeBSD Foundation
previously only considered the send sequence space. Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.
The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.
- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.
This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.
This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.
Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE). This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).
Supported by: FreeBSD Foundation
the module. Previously we grabbed the mutex used by the callouts,
then stopped the callout with callout_stop, but if the callout
was already active and blocked by the mutex then it would continue
later and reference the mutex after it was destroyed. Instead
stop the callout first then lock.
Supported by: FreeBSD Foundation
o add some more debugging help for figuring out why folks are
getting complaints about releasing routing table entries with
a zero refcnt
o fix comment that talked about spl's
o remove duplicate define of DUMMYNET_DEBUG
Supported by: FreeBSD Foundation
- implement the tunnel egress rule in ip_ecn_egress() in ip_ecn.c.
make ip{,6}_ecn_egress() return integer to tell the caller that
this packet should be dropped.
- handle ECN at fragment reassembly in ip_input.c and frag6.c.
Obtained from: KAME
with an mbuf until it is reclaimed. This is in contrast to tags that
vanish when an mbuf chain passes through an interface. Persistent tags
are used, for example, by MAC labels.
Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp. This fixes problems
with "tag leakage".
Pointed out by: Jonathan Stone
Reviewed by: Robert Watson
(aka RFC2292bis). Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.
Obtained from: KAME
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.
No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.
This should reduce the need for sysadmins to lower the default msl on
busy servers.
when loaded as a module
o cleanup data structures on module unload when no application has
been started (i.e. kldload, kldunload w/o mrtd)
o remove extraneous unlocks immediately prior to destroying them
Supported by: FreeBSD Foundation
address has been changed when PFIL_HOOKS is enabled and, if it has,
arrange for the proper action by ip*_forward.
Supported by: FreeBSD Foundation
Submitted by: Pyun YongHyeon
trashed after being freed. This has caused several panics including
kern/42277 related to soft updates. Jim Kuhn tracked the problem
down to ipfw limit rule processing. In the expiry of dynamic rules,
it is possible for an O_LIMIT_PARENT rule to be removed when it still
has live children. When the children eventually do expire, a pointer
to the (long gone) parent is dereferenced and a count decremented.
Since this memory can, and is, allocated for other purposes (in the
case of kern/42277 an inodedep structure), chaos ensues. The offset
in question in inodedep is the offset of the 16 bit count field in
the ipfw2 ipfw_dyn_rule.
Submitted by: Jim Kuhn <jkuhn@sandvine.com>
Reviewed by: "Evgueni V. Gavrilov" <aquatique@rusunix.org>
Reviewed by: Ben Pfountz <netprince@vt.edu>
MFC after: 1 week
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.
Other/related changes:
o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts
Notes:
1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)
RTF_STATIC routes. Do not check for RTF_HOST so as to avoid being DoSed
when an RTF_GENMASK route exists in the table.
Add a more verbose comment about exactly what this code does.
Submitted by: ru
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules
Heavy lifting by: "Max Laier" <max@love2party.net>
Supported by: FreeBSD Foundation
Obtained from: NetBSD (bits of pfil.h and pfil.c)
attached network could exhaust kernel memory, and cause a system
panic, by sending a flood of spoofed ARP requests.
Approved by: jake (mentor)
Reported by: Apple Product Security <product-security@apple.com>
Skinny is the protocol used by Cisco IP phones to talk to Cisco Call
Managers. With this code, one can use a Cisco IP phone behind a FreeBSD
NAT gateway.
Currently, having the Call Manager behind the NAT gateway is not supported.
More information on enabling Skinny support in libalias, natd, and ppp
can be found in those applications' manpages.
PR: 55843
Reviewed by: ru
Approved by: ru
MFC after: 30 days
sending an ICMP packet doesn't cause a panic. A better solution is needed;
possibly defering the transmit to a dedicated thread.
Observed by: "Aaron Wohl" <freebsd@soith.com>
o change timeout to MPSAFE callout
o restructure rule deletion to deal with locking requirements
o replace static buffer used for ipfw control operations with malloc'd storage
Sponsored by: FreeBSD Foundation
o change time to MPSAFE callout
o make debug printfs conditional on DUMMYNET_DEBUG and runtime controllable
by net.inet.ip.dummynet.debug
o make boot-time printf dependent on bootverbose
Sponsored by: FreeBSD Foundation
Special thanks to Pavlin Radoslavov <pavlin@icir.org> for testing and
fixing numerous problems.
Sponsored by: FreeBSD Foundation
Reviewed by: Pavlin Radoslavov <pavlin@icir.org>
Changes from the original implementation:
- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed. Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.
- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments. While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.
- Add two modes of random fragmentation. Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.
NB: There is a known LOR on the forwarding path; this needs to be resolved
together with a similar issue in the bridge. For the moment it is
believed to be benign.
Sponsored by: FreeBSD Fondation
returned mbuf can be NULL. Check for NULL in rip_output() when
prepending an IP header. This prevents mbuf exhaustion from
causing a local kernel panic when sending raw IP packets.
PR: kern/55886
Reported by: Pawel Malachowski <pawmal-posting@freebsd.lublin.pl>
MFC after: 3 days
mac_reflect_mbuf_icmp()
mac_reflect_mbuf_tcp()
These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket. For example, in respond to a ping or a RST
packet to a SYN on a closed port.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
from queue(3).
Improve vertical compactness by using a IGMP_PRINTF() macro rather
than #ifdefing IGMP_DEBUG a large number of debugging printfs.
Reviewed by: mdodd (SLIST changes)
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.
Suggested by: fenestro
Obtained from: BSD/OS
Reviewed by: mini, sam
Approved by: jake (mentor)
of bw_meter entries were processed up to one second ahead.
After an unappropriate rescheduling of some of the bw_meter
entries, the upcalls weren't delivered.
* pim_register_prepare() uses the appropriate sw_csum flag to
call ip_fragment() so the IP checksum is computed properly.
* Modify pim_register_prepare() to take care of IP packets that
don't need fragmentation.
* Add-back in_delayed_cksum() to encap_send(), because it seems it
should be there.
Submitted by: Pavlin Radoslavov <pavlin@icir.org>
binaries in /bin and /sbin installed in /lib. Only the versioned files
reside in /lib, the .so symlink continues to live /usr/lib so the
toolchain doesn't need to be modified.
segments are lost for the application. This broke, for example,
ports/benchmarks/dbs which needs the SYN segment to filter the
contents of the trace buffer for the connection it is interested in.
This patch makes the SYN segments available again. Unfortunately they
are now associated with the listening socket instead of the new one, so
a change to applications is required, but without this patch it wouldn't
work altogether.
PR: kern/45966
code has rotten a bit so that the header length is not correct at
the point when tcp_trace is called. Temporarily compute the correct
value before the call and restore the old value after. This makes
ports/benchmarks/dbs to almost work.
This is a NOP unless you compile with TCPDEBUG.
in tcp_input that leave the function before hitting the tcp_trace
function call for the TCPDEBUG option. This has made TCPDEBUG mostly
useless (and tools like ports/benchmarks/dbs not working). Add
tcp_trace calls to the return paths that could be identified in this
maze.
This is a NOP unless you compile with TCPDEBUG.
new ATMIOCOPENVCC/CLOSEVCC. This allows us to not only use UBR channels
for IP over ATM, but also CBR, VBR and ABR. Change the format of the
link layer address to specify the channel characteristics. The old
format is still supported and opens UBR channels.
Disabled by default. To enable it, the new "options PIM" must be
added to the kernel configuration file (in addition to MROUTING):
options MROUTING # Multicast routing
options PIM # Protocol Independent Multicast
2. Add support for advanced multicast API setup/configuration and
extensibility.
3. Add support for kernel-level PIM Register encapsulation.
Disabled by default. Can be enabled by the advanced multicast API.
4. Implement a mechanism for "multicast bandwidth monitoring and upcalls".
Submitted by: Pavlin Radoslavov <pavlin@icir.org>
This change allows one to specify almost the complete traffic parameters
for IPoverATM channels through the routing table. Up to now we used
4 byte DL addresses (flag, vpi, vciH, vciL). This format is still allowed.
If the address is longer, however, the 5th byte is interpreted as the
traffic class (UBR, CBR, VBR or ABR) and the remaining bytes are the
parameters for this traffic class:
UBR: 0 byte or 3 byte PCR
CBR: 3 byte PCR
VBR: 3 byte PCR, 3 byte SCR, 3 byte MBS
ABR: 3 byte PCR, 3 byte MCR, 3 byte ICR, 3 byte TBE, 1 byte NRM,
1 byte TRM, 2 bytes ADTF, 1 byte RIF, 1 byte RDF and 1 byte CDF
A script to generate the corresponding 'route add' arguments will follow soon.
sysctl:
- sysctlbyname("net.inet.ip.mfctable", ...)
- sysctlbyname("net.inet.ip.viftable", ...)
This change is needed so netstat can use sysctlbyname() to read
the data from those tables.
Otherwise, in some cases "netstat -g" may fail to report the
multicast forwarding information (e.g., if we run a multicast
router on PicoBSD).
* Bug fix: when sending IGMPMSG_WRONGVIF upcall to the multicast
routing daemon, set properly "im->im_vif" to the receiving
incoming interface of the packet that triggered that upcall
rather than to the expected incoming interface of that packet.
* Bug fix: add missing increment of counter "mrtstat.mrts_upcalls"
* Few formatting nits (e.g., replace extra spaces with TABs)
Submitted by: Pavlin Radoslavov <pavlin@icir.org>
code used to call rtrequest(RTM_DELETE, ...). This is a problem, because
the function that just has called us (route_output)
is not really happy with the route it just is creating beeing ripped out
from under it. Unfortunately we also cannot return an error from
ifa_rtrequest. Therefore mark the route just as RTF_REJECT.
check for raw IP system management operations is often (although
not always) implicit due to the namespacing of raw IP sockets. I.e.,
you have to have privilege to get a raw IP socket, so much of the
management code sitting on raw IP sockets assumes that any requests
on the socket should be granted privilege.
Obtained from: TrustedBSD Project
Product of: France
Set 31 is still special because rules belonging to it are not deleted
by the "ipfw flush" command, but must be deleted explicitly with
"ipfw delete set 31" or by individual rule numbers.
This implement a flexible form of "persistent rules" which you might
want to have available even after an "ipfw flush".
Note that this change does not violate POLA, because you could not
use set 31 in a ruleset before this change.
sbin/ipfw changes to allow manipulation of set 31 will follow shortly.
Suggested by: Paul Richards
lastest rev of the spec. Use an explicit flag for Fast Recovery. [1]
Fix bug with exiting Fast Recovery on a retransmit timeout
diagnosed by Lu Guohan. [2]
Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com>
Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2]
Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>,
Sally Floyd <floyd@acm.org> [1]
Since we already had 'O_NOP' instructions which always match, all
I needed to do is allow the NOP command to have arbitrary length
(i.e. move its label in a different part of the switch() which
validates instructions).
The kernel must know nothing about comments, everything else is
done in userland (which will be described in the upcoming ipfw2.c
commit).
support matching a list of addr/mask pairs so one can write
more efficient rulesets which were not possible before e.g.
add 100 skipto 1000 not src-ip 10.0.0.0/8,127.0.0.1/8,192.168.0.0/16
The change is fully backward compatible.
ipfw2 and manpage commit to follow.
MFC after: 3 days
Should work with both regular and fast ipsec (mutually exclusive).
See manpage for more details.
Submitted by: Ari Suutari (ari.suutari@syncrontech.com)
Revised by: sam
MFC after: 1 week
"ipid" options. This feature has been requested by several users.
On passing, fix some minor bugs in the parser. This change is fully
backward compatible so if you have an old /sbin/ipfw and a new
kernel you are not in trouble (but you need to update /sbin/ipfw
if you want to use the new features).
Document the changes in the manpage.
Now you can write things like
ipfw add skipto 1000 iplen 0-500
which some people were asking to give preferential treatment to
short packets.
The 'MFC after' is just set as a reminder, because I still need
to merge the Alpha/Sparc64 fixes for ipfw2 (which unfortunately
change the size of certain kernel structures; not that it matters
a lot since ipfw2 is entirely optional and not the default...)
PR: bin/48015
MFC after: 1 week
this makes connect act more sensibly in these cases.
PR: 50839
Submitted by: Barney Wolff <barney@pit.databus.com>
Patch delayed by laziness of: silby
MFC after: 1 week
is not called, and no static rules match an outgoing packet, the
latter retains its source IP address. This is in support of the
"static NAT only" mode.
only meaningful for fragments. Also don't bother to byte-swap the
ip_id when we do generate it; it is only used at the receiver as a
nonce. I tried several different permutations of this code with no
measurable difference to each other or to the unmodified version, so
I've settled on the one for which gcc seems to generate the best code.
(If anyone cares to microoptimize this differently for an architecture
where it actually matters, feel free.)
Suggested by: Steve Bellovin's paper in IMW'02
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.
Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.
Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
copying for mbuf headers now works properly in m_dup_pkthdr(), so
we don't need to do an explicit copy.
Approved by: re (jhb)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
state. Those changed attempted to work around the changed invariant
that inp->in_socket was sometimes now NULL, but the logic wasn't
quite right, meaning that inp->in_socket would be dereferenced by
cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC
were in use. Attempt to clarify and correct the logic.
Note: the work-around originally introduced with the reduced TCP
wait state handling to use cr_cansee() instead of cr_canseesocket()
in this case isn't really right, although it "Does the right thing"
for most of the cases in the base system. We'll need to address
this at some point in the future.
Pointed out by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
of asserting that an mbuf has a packet header. Use it instead of hand-
rolled versions wherever applicable.
Submitted by: Hiten Pandya <hiten@unixdaemons.com>
doing Limited Transmit. Only artificially inflate the congestion
window by 1 segment instead of the usual 3 to take into account
the 2 already sent by Limited Transmit.
Approved in principle by: Mark Allman <mallman@grc.nasa.gov>,
Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>
(See: ftp://ftp.rfc-editor.org/in-notes/rfc3514.txt)
This fulfills the host requirements for userland support by
way of the setsockopt() IP_EVIL_INTENT message.
There are three sysctl tunables provided to govern system behavior.
net.inet.ip.rfc3514:
Enables support for rfc3514. As this is an
Informational RFC and support is not yet widespread
this option is disabled by default.
net.inet.ip.hear_no_evil
If set the host will discard all received evil packets.
net.inet.ip.speak_no_evil
If set the host will discard all transmitted evil packets.
The IP statistics counter 'ips_evil' (available via 'netstat') provides
information on the number of 'evil' packets recieved.
For reference, the '-E' option to 'ping' has been provided to demonstrate
and test the implementation.
Quote from kern/37573:
There is an obvious race in netinet/ip_dummynet.c:config_pipe().
Interrupts are not blocked when changing the params of an
existing pipe. The specific crash observed:
... -> config_pipe -> set_fs_parms -> config_red
malloc a new w_q_lookup table but take an interrupt before
intializing it, interrupt handler does:
... -> dummynet_io -> red_drops
red_drops dereferences the uninitialized (zeroed) w_q_lookup
table.
o Flush accumulated credits for idle pipes.
o Flush accumulated credits when change pipe characteristics.
o Change dn_flow_queue.numbytes type to unsigned long.
Overlapping dn_flow_queue->numbytes in ready_event() leads to
numbytes becomes negative and SET_TICKS() macro returns a very
big value. heap_insert() overlaps dn_key again and inserts a
queue to a ready heap with a sched_time points to the past.
That leads to an "infinity" loop.
PR: kern/33234, kern/37573, misc/42459, kern/43133,
kern/44045, kern/48099
Submitted by: Mike Hibler <mike@cs.utah.edu> (kern/37573)
MFC after: 6 weeks
additional flags argument to indicate blocking disposition, and
pass in M_NOWAIT from the IP reassembly code to indicate that
blocking is not OK when labeling a new IP fragment reassembly
queue. This should eliminate some of the WITNESS warnings that
have started popping up since fine-grained IP stack locking
started going in; if memory allocation fails, the creation of
the fragment queue will be aborted.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
- Don't try to fragment the packet if it's smaller than mbuf_frag_size.
- Preserve the size of the mbuf chain which is modified by m_split().
- Check that m_split() didn't return NULL.
- Make it so we don't end up with two M_PKTHDR mbuf in the chain.
- Use m->m_pkthdr.len instead of m->m_len so that we fragment the whole
chain and not just the first mbuf.
- Fix a nearby style bug and rework the logic of the loops so that it's
more clear.
This is still not quite right, because we're clearly abusing m_split() to
do something it was not designed for, but at least it works now. We
should probably move this code into a m_fragment() function when it's
correct.
allows you to tell ip_output to fragment all outgoing packets
into mbuf fragments of size net.inet.ip.mbuf_frag_size bytes.
This is an excellent way to test if network drivers can properly
handle long mbuf chains being passed to them.
net.inet.ip.mbuf_frag_size defaults to 0 (no fragmentation)
so that you can at least boot before your network driver dies. :)
comes in on is the same interface that we would route out of to get to
the packet's source address. Essentially automates an anti-spoofing
check using the information in the routing table.
Experimental. The usage and rule format for the feature may still be
subject to change.
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.
Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs
that matches snd_max, then do not respond with an ack, just drop the
segment. This fixes a problem where a simultaneous close results in
an ack loop between two time-wait states.
Test case supplied by: Tim Robbins <tjr@FreeBSD.ORG>
Sponsored by: DARPA, NAI Labs
tcpcb is NULL, but also its connected inpcb, since we now allow
elements of a TCP connection to hang around after other state, such
as the socket, has been recycled.
Tested by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Security improvements:
- Increase the size of each syncookie secret from 32 to 128 bits
in order to make brute force attacks on the secrets much more
difficult.
- Always return the lowest order dword from the MD5 hash; this
allows us to expose 2 more bits of the cookie and makes ACK
floods which seek to guess the cookie value more difficult.
Performance improvements:
- Increase the lifetime of each syncookie from 4 seconds to 16
seconds. This increases the usefulness of syncookies during
an attack.
- From Yahoo!: Reduce the number of calls to MD5Update; this
results in a ~17% increase in cookie generation time here.
Reviewed by: hsu, jayanth, jlemon, nectar
MFC After: 15 seconds
packets coming out of a GIF tunnel are re-processed by ipfw, et. al.
By default they are not reprocessed. With the option they are.
This reverts 1.214. Prior to that change packets were not re-processed.
After they were which caused problems because packets do not have
distinguishing characteristics (like a special network if) that allows
them to be filtered specially.
This is really a stopgap measure designed for immediate MFC so that
4.8 has consistent handling to what was in 4.7.
PR: 48159
Reviewed by: Guido van Rooij <guido@gvr.org>
MFC after: 1 day
and enable it by default, with a limit of 16.
At the same time, tweak maxfragpackets downward so that in the worst
possible case, IP reassembly can use only 1/2 of all mbuf clusters.
MFC after: 3 days
Reviewed by: hsu
Liked by: bmah
OSes has probably caused more problems than it ever solved. Allow the
user to retire the old behavior by specifying their own privileged
range with,
net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1
net.inet.ip.portrange.reservedlo default = 0
Now you can run that webserver without ever needing root at all. Or
just imagine, an ftpd that can really drop privileges, rather than
just set the euid, and still do PORT data transfers from 20/tcp.
Two edge cases to note,
# sysctl net.inet.ip.portrange.reservedhigh=0
Opens all ports to everyone, and,
# sysctl net.inet.ip.portrange.reservedhigh=65535
Locks all network activity to root only (which could actually have
been achieved before with ipfw(8), but is somewhat more
complicated).
For those who stick to the old religion that 0-1023 belong to root and
root alone, don't touch the knobs (or even lock them by raising
securelevel(8)), and nothing changes.
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.
Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs
routine does not require a tcpcb to operate. Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.
Sponsored by: DARPA, NAI Labs
- delay acks for T/TCP regardless of delack setting
- fix bug where a single pass through tcp_input might not delay acks
- use callout_active() instead of callout_pending()
Sponsored by: DARPA, NAI Labs
cr_uid.
Note: we do not have socheckuid() in RELENG_4, ip_fw2.c uses its
own macro for a similar purpose that is why ipfw2 in RELENG_4 processes
uid rules correctly. I will MFC the diff for code consistency.
Reported by: Oleg Baranov <ol@csa.ru>
Reviewed by: luigi
MFC after: 1 month
ipsec4_process_packet; they happen when a packet is dropped because
an SA acquire is initiated
Submitted by: Doug Ambrisko <ambrisko@verniernetworks.com>
you still don't want to use the two together, but it's ok to have
them in the same kernel (the problem that initiated this bandaid
has long since been fixed)
initialized until after a syncookie was generated. As a result,
all connections resulting from a returned cookie would end up using
a MSS of ~512 bytes. Now larger packets will be used where possible.
MFC after: 5 days
- Honor the previous behavior of maxfragpackets = 0 or -1
- Take a better stab at fragment statistics
- Move / correct a comment
Suggested by: maxim@
MFC after: 7 days
functions implemented approximately the same limits on fragment memory
usage, but in different fashions.)
End user visible changes:
- Fragment reassembly queues are freed in a FIFO manner when maxfragpackets
has been reached, rather than all reassembly stopping.
MFC after: 5 days
in addition to secure level 1. The mask supports up to a secure level of 8
but only add defines through CTLFLAG_SECURE3 for now.
As per the missif in the log entry for 1.11 of ip_fw2.c which added the
secure flag to the IPFW sysctl's in the first place, change the secure
level requirement from 1 to 3 now that we have support for it.
Reviewed by: imp
With Design Suggestions by: imp
were sometimes propagated using M_COPY_PKTHDR which actually did
something between a "move" and a "copy" operation. This is replaced
by M_MOVE_PKTHDR (which copies the pkthdr contents and "removes" it
from the source mbuf) and m_dup_pkthdr which copies the packet
header contents including any m_tag chain. This corrects numerous
problems whereby mbuf tags could be lost during packet manipulations.
These changes also introduce arguments to m_tag_copy and m_tag_copy_chain
to specify if the tag copy work should potentially block. This
introduces an incompatibility with openbsd which we may want to revisit.
Note that move/dup of packet headers does not handle target mbufs
that have a cluster bound to them. We may want to support this;
for now we watch for it with an assert.
Finally, M_COPYFLAGS was updated to include M_FIRSTFRAG|M_LASTFRAG.
Supported by: Vernier Networks
Reviewed by: Robert Watson <rwatson@FreeBSD.org>
Note that the original RFC 1323 (PAWS) says in 4.2.1 that the out of
order / reverse-time-indexed packet should be acknowledged as specified
in RFC-793 page 69 then dropped. The original PAWS code in FreeBSD (1994)
simply acknowledged the segment unconditionally, which is incorrect, and
was fixed in 1.183 (2002). At the moment we do not do checks for SYN or FIN
in addition to (tlen != 0), which may or may not be correct, but the
worst that ought to happen should be a retry by the sender.
in network byte order, but icmp_error() expects the IP header to
be in host order and the code here did not perform the necessary
swapping for the bridged case. This bug causes an "icmp_error: bad
length" panic when certain length IP packets (e.g. ip_len == 0x100)
are rejected by the firewall with an ICMP response.
MFC after: 3 days
associated with the syncache entry: in case tcp_close() has been
called on the corresponding listening socket, the lock has been
destroyed as a side effect of in_pcbdetach(), causing a panic when
we attempt to lock on it.
Reviewed by: hsu
the mbuf allocator flags {M_TRYWAIT, M_DONTWAIT}.
o Fix a bpf_compat issue where malloc() was defined to just call
bpf_alloc() and pass the 'canwait' flag(s) along. It's been changed
to call bpf_alloc() but pass the corresponding M_TRYWAIT or M_DONTWAIT
flag (and only one of those two).
Submitted by: Hiten Pandya <hiten@unixdaemons.com> (hiten->commit_count++)
apparent ack-on-ack problem with FreeBSD. Prof. Jacobson noticed a
case in our TCP stack which would acknowledge a received ack-only packet,
which is not legal in TCP.
Submitted by: Van Jacobson <van@packetdesign.com>,
bmah@packetdesign.com (Bruce A. Mah)
MFC after: 7 days
bridge.c nor if_ethersubr.c depend on IPFIREWALL.
Restore the use of fw_one_pass in if_ethersubr.c
ipfw.8 will be updated with a separate commit.
Approved by: re
so that it can be reused elsewhere (there is a number of places
where it can be useful). This also trims some 200 lines from
the body of ip_output(), which helps readability a bit.
(This change was discussed a few weeks ago on the mailing lists,
Julian agreed, silence from others. It is not a functional change,
so i expect it to be ok to commit it now but i am happy to back it
out if there are objections).
While at it, fix some function headers and replace m_copy() with
m_copypacket() where applicable.
MFC after: 1 week
Replace m_copy() with m_copypacket() where applicable.
Replace "if (a.s_addr ...)" with "if (a.s_addr != INADDR_ANY ...)"
to make it clear what the code means.
While at it, fix some function headers and remove 'register' from
variable declarations.
MFC after: 3 days
No functional changes, but:
+ the mrouting module now should behave the same as the compiled-in
version (it did not before, some of the rsvp code was not loaded
properly);
+ netinet/ip_mroute.c is now truly optional;
+ removed some redundant/unused code;
+ changed many instances of '0' to NULL and INADDR_ANY as appropriate;
+ removed several static variables to make the code more SMP-friendly;
+ fixed some minor bugs in the mrouting code (mostly, incorrect return
values from functions).
This commit is also a prerequisite to the addition of support for PIM,
which i would like to put in before DP2 (it does not change any of
the existing APIs, anyways).
Note, in the process we found out that some device drivers fail to
properly handle changes in IFF_ALLMULTI, leading to interesting
behaviour when a multicast router is started. This bug is not
corrected by this commit, and will be fixed with a separate commit.
Detailed changes:
--------------------
netinet/ip_mroute.c all the above.
conf/files make ip_mroute.c optional
net/route.c fix mrt_ioctl hook
netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here
together with other rsvp code, and a couple
of indentation fixes.
netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks
netinet/ip_var.h rsvp function hooks
netinet/raw_ip.c hooks for mrouting and rsvp functions, plus
interface cleanup.
netinet/ip_mroute.h remove an unused and optional field from a struct
Most of the code is from Pavlin Radoslavov and the XORP project
Reviewed by: sam
MFC after: 1 week
ipfw_flow_id structure actual size and bcmp(3) may fail to compare
them properly. Compare members of these structures instead.
PR: kern/44078
Submitted by: Oleg Bulyzhin <oleg@rinet.ru>
Reviewed by: luigi
MFC after: 2 weeks
o fix #ifdef typo
o must use "bounce functions" when dispatched from the protosw table
don't know how this stuff was missed in my testing; must've committed
the wrong bits
Pointy hat: sam
Submitted by: "Doug Ambrisko" <ambrisko@verniernetworks.com>
prediction code. Previously, 2GB worth of header predicted data
could leave these variables too far out of sequence which would cause
problems after receiving a packet that did not match the header
prediction.
Submitted by: Bill Baumann <bbaumann@isilon.com>
Sponsored by: Isilon Systems, Inc.
Reviewed by: hsu, pete@isilon.com, neal@isilon.com, aaronp@isilon.com
This allows socket() to return an error when the kernel is not built
with IPDIVERT, and doesn't prevent future applications from using the
"borrowed" IP protocol number. The sysctl net.inet.raw.olddiverterror
controls whether opening a socket with the "borrowed" IP protocol
fails with an accompanying kernel printf; this code should last only a
couple of releases.
Approved by: re
Quoting luigi:
In order to make the userland code fully 64-bit clean it may
be necessary to commit other changes that may or may not cause
a minor change in the ABI.
Reviewed by: luigi
to the primary local IP address when doing a TCP connect(). The
tcp_connect() code was relying on in_pcbconnect (actually in_pcbladdr)
modifying the passed-in sockaddr, and I failed to notice this in
the recent change that added in_pcbconnect_setup(). As a result,
tcp_connect() was ending up using the unmodified sockaddr address
instead of the munged version.
There are two cases to handle: if in_pcbconnect_setup() succeeds,
then the PCB has already been updated with the correct destination
address as we pass it pointers to inp_faddr and inp_fport directly.
If in_pcbconnect_setup() fails due to an existing but dead connection,
then copy the destination address from the old connection.
a server process bound to a wildcard UDP socket to select the IP
address from which outgoing packets are sent on a per-datagram
basis. When combined with IP_RECVDSTADDR, such a server process can
guarantee to reply to an incoming request using the same source IP
address as the destination IP address of the request, without having
to open one socket per server IP address.
Discussed on: -net
Approved by: re
to send datagrams from an unconnected socket, we used to first block
input, then connect the socket to the sendmsg/sendto destination,
send the datagram, and finally disconnect the socket and unblock
input.
We now use in_pcbconnect_setup() to check if a connect() would have
succeeded, but we never record the connection in the PCB (local
anonymous port allocation is still recorded, though). The result
from in_pcbconnect_setup() authorises the sending of the datagram
and selects the local address and port to use, so we just construct
the header and call ip_output().
Discussed on: -net
Approved by: re
in_pcbconnect() called in_pcbconnect_setup(). This version performs
all of the functions of in_pcbconnect() except for the final
committing of changes to the PCB. In the case of an EADDRINUSE error
it can also provide to the caller the PCB of the duplicate connection,
avoiding an extra in_pcblookup_hash() lookup in tcp_connect().
This change will allow the "temporary connect" hack in udp_output()
to be removed and is part of the preparation for adding the
IP_SENDSRCADDR control message.
Discussed on: -net
Approved by: re
Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.
Add a CTASSERT protecting sizeof(struct ip) == 20.
Don't let the size of struct ipq depend on the IPDIVERT option.
This is a functional no-op commit.
Approved by: re
called in_pcbbind_setup() that does everything except commit the
changes to the PCB. There should be no functional change here, but
in_pcbbind_setup() will be used by the soon-to-appear IP_SENDSRCADDR
control message implementation to check or allocate the source
address and port.
Discussed on: -net
Approved by: re
- set IFF_UP on SIOCSIFADDR. be consistent with others.
- set if_addrlen explicitly (just in case)
- multi destination mode is long gone.
- missing break statement
- add gif_set_tunnel(), so that we can set tunnel address from within the
kernel at ease.
- encap_attach/detach dynamically on ioctls
- move encap_attach() to dedicated function in in*_gif.c
Obtained from: KAME
MFC after: 3 weeks
supposed to be checked by the firewall rules twice. However, because the
various ipsec handlers never call ip_input(), this never happens anyway.
This fixes the situation where a gif tunnel is encrypted with IPsec. In
such a case, after IPsec processing, the unencrypted contents from the
GIF tunnel are fed back to the ipintrq and subsequently handeld by
ip_input(). Yet, since there still is IPSec history attached, the
packets coming out from the gif device are never fed into the filtering
code.
This fix was sent to Itojun, and he pointed towartds
http://www.netbsd.org/Documentation/network/ipsec/#ipf-interaction.
This patch actually implements what is stated there (specifically:
Packet came from tunnel devices (gif(4) and ipip(4)) will still
go through ipf(4). You may need to identify these packets by
using interface name directive in ipf.conf(5).
Reviewed by: rwatson
MFC after: 3 weeks
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).
As noted previously, don't use FAST_IPSEC with INET6 at the moment.
Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks
o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version
Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month
timestamped TCP packets where FreeBSD will send DATA+FIN and
A W2K box will ack just the DATA portion. If this occurs
after FreeBSD has done a (NewReno) fast-retransmit and is
recovering it (dupacks > threshold) it triggers a case in
tcp_newreno_partial_ack() (tcp_newreno() in stable) where
tcp_output() is called with the expectation that the retransmit
timer will be reloaded. But tcp_output() falls through and
returns without doing anything, causing the persist timer to be
loaded instead. This causes the connection to hang until W2K gives up.
This occurs because in the case where only the FIN must be acked, the
'len' calculation in tcp_output() will be 0, a lot of checks will be
skipped, and the FIN check will also be skipped because it is designed
to handle FIN retransmits, not forced transmits from tcp_newreno().
The solution is to simply set TF_ACKNOW before calling tcp_output()
to absolute guarentee that it will run the send code and reset the
retransmit timer. TF_ACKNOW is already used for this purpose in other
cases.
For some unknown reason this patch also seems to greatly reduce
the number of duplicate acks received when Guido runs his tests over
a lossy network. It is quite possible that there are other
tcp_newreno{_partial_ack()} cases which were not generating the expected
output which this patch also fixes.
X-MFC after: Will be MFC'd after the freeze is over
o Move len initialization closer to place of its first usage.
o Compare len with 0 to improve readability.
o Explicitly zero out phlen in ip_insertoptions() in failure case.
Suggested by: jhb
Reviewed by: jhb
MFC after: 2 weeks
Windows 2000 box and a FreeBSD box could stall. The problem turned out
to be a timestamp reply bug in the W2K TCP stack. FreeBSD sends a
timestamp with the SYN, W2K returns a timestamp of 0 in the SYN+ACK
causing FreeBSD to calculate an insane SRTT and RTT, resulting in
a maximal retransmit timeout (60 seconds). If there is any packet
loss on the connection for the first six or so packets the retransmit
case may be hit (the window will still be too small for fast-retransmit),
causing a 60+ second pause. The W2K box gives up and closes the
connection.
This commit works around the W2K bug.
15:04:59.374588 FREEBSD.20 > W2K.1036: S 1420807004:1420807004(0) win 65535 <mss 1460,nop,wscale 2,nop,nop,timestamp 188297344 0> (DF) [tos 0x8]
15:04:59.377558 W2K.1036 > FREEBSD.20: S 4134611565:4134611565(0) ack 1420807005 win 17520 <mss 1460,nop,wscale 0,nop,nop,timestamp 0 0> (DF)
Bug reported by: Guido van Rooij <guido@gvr.org>
packets in addition to IPPROTO_IPV4 and IPPROTO_IPV6, explicitly specify
IPPROTO_IPV4 or IPPROTO_IPV6 instead of -1 when calling encap_attach().
MFC after: 28 days
(along with other if_gre changes)
- use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't
a prerequisite.
- don't include <sys/uma.h>.
Namespace pollution makes "opaque" types like uma_zone_t perfectly
non-opaque. Such types should never be used (see style(9)).
Fixed subsequently grwon dependencies of this header on its own pollution:
- include <sys/_mutex.h> and its prerequisite <sys/_lock.h> instead of
depending on namespace pollution 2 layers deep in <sys/uma.h>.
firewall logging on and off when at elevated securelevel(8). It would
be nice to be able to only lock these at securelevel >= 3, like rules
are, but there is no such functionality at present. I don't see reason
to be adding features to securelevel(8) with MAC being merged into 5.0.
PR: kern/39396
Reviewed by: luigi
MFC after: 1 week
called <machine/_types.h>.
o <machine/ansi.h> will continue to live so it can define MD clock
macros, which are only MD because of gratuitous differences between
architectures.
o Change all headers to make use of this. This mainly involves
changing:
#ifdef _BSD_FOO_T_
typedef _BSD_FOO_T_ foo_t;
#undef _BSD_FOO_T_
#endif
to:
#ifndef _FOO_T_DECLARED
typedef __foo_t foo_t;
#define _FOO_T_DECLARED
#endif
Concept by: bde
Reviewed by: jake, obrien
in6_v4mapsin6_sockaddr() which allocate the appropriate sockaddr_in*
structure and initialize it with the address and port information passed
as arguments. Use calls to these new functions to replace code that is
replicated multiple times in in_setsockaddr(), in_setpeeraddr(),
in6_setsockaddr(), in6_setpeeraddr(), in6_mapped_sockaddr(), and
in6_mapped_peeraddr(). Inline COMMON_END in tcp_usr_accept() so that
we can call in_sockaddr() with temporary copies of the address and port
after the PCB is unlocked.
Fix the lock violation in tcp6_usr_accept() (caused by calling MALLOC()
inside in6_mapped_peeraddr() while the PCB is locked) by changing
the implementation of tcp6_usr_accept() to match tcp_usr_accept().
Reviewed by: suz
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.
net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit
MFC after: 1 week
Implement the M_SKIP_FIREWALL bit in m_flags to avoid loops
for firewall-generated packets (the constant has to go in sys/mbuf.h).
Better comments on keepalive generation, and enforce dyn_rst_lifetime
and dyn_fin_lifetime to be less than dyn_keepalive_period.
Enforce limits (up to 64k) on the number of dynamic buckets, and
retry allocation with smaller sizes.
Raise default number of dynamic rules to 4096.
Improved handling of set of rules -- now you can atomically
enable/disable multiple sets, move rules from one set to another,
and swap sets.
sbin/ipfw/ipfw2.c:
userland support for "noerror" pipe attribute.
userland support for sets of rules.
minor improvements on rule parsing and printing.
sbin/ipfw/ipfw.8:
more documentation on ipfw2 extensions, differences from ipfw1
(so we can use the same manpage for both), stateful rules,
and some additional examples.
Feedback and more examples needed here.
we can use the names _receive() and _send() for the receive() and send()
checks. Rename related constants, policy implementations, etc.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
When a pipe or queue has the "noerror" attribute, do not report
drops to the caller (ip_output() and friends).
(2 lines to implement it, 2 lines to document it.)
This will let you simulate losses on the sender side as if they
happened in the middle of the network, i.e. with no explicit feedback
to the sender.
manpage and ipfw2.c changes to follow shortly, together with other
ipfw2 changes.
Requested by: silby
MFC after: 3 days
satisfy consumers of ip_var.h that need a complete definition of
struct ipq and don't include mac.h.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
The bugfix (ipfw2.c) makes the handling of port numbers with
a dash in the name, e.g. ftp-data, consistent with old ipfw:
use \\ before the - to consider it as part of the name and not
a range separator.
The new feature (all this description will go in the manpage):
each rule now belongs to one of 32 different sets, which can
be optionally specified in the following form:
ipfw add 100 set 23 allow ip from any to any
If "set N" is not specified, the rule belongs to set 0.
Individual sets can be disabled, enabled, and deleted with the commands:
ipfw disable set N
ipfw enable set N
ipfw delete set N
Enabling/disabling of a set is atomic. Rules belonging to a disabled
set are skipped during packet matching, and they are not listed
unless you use the '-S' flag in the show/list commands.
Note that dynamic rules, once created, are always active until
they expire or their parent rule is deleted.
Set 31 is reserved for the default rule and cannot be disabled.
All sets are enabled by default. The enable/disable status of the sets
can be shown with the command
ipfw show sets
Hopefully, this feature will make life easier to those who want to
have atomic ruleset addition/deletion/tests. Examples:
To add a set of rules atomically:
ipfw disable set 18
ipfw add ... set 18 ... # repeat as needed
ipfw enable set 18
To delete a set of rules atomically
ipfw disable set 18
ipfw delete set 18
ipfw enable set 18
To test a ruleset and disable it and regain control if something
goes wrong:
ipfw disable set 18
ipfw add ... set 18 ... # repeat as needed
ipfw enable set 18 ; echo "done "; sleep 30 && ipfw disable set 18
here if everything goes well, you press control-C before
the "sleep" terminates, and your ruleset will be left
active. Otherwise, e.g. if you cannot access your box,
the ruleset will be disabled after the sleep terminates.
I think there is only one more thing that one might want, namely
a command to assign all rules in set X to set Y, so one can
test a ruleset using the above mechanisms, and once it is
considered acceptable, make it part of an existing ruleset.
case, also preserve the MAC label. Note that this mbuf allocation
is fairly non-optimal, but not my fault.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Add MAC support for the UDP protocol. Invoke appropriate MAC entry
points to label packets that are generated by local UDP sockets,
and to authorize delivery of mbufs to local sockets both in the
multicast/broadcast case and the unicast case.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
pointer and incoming mbuf pointer will be non-NULL in tcp_respond().
This is relied on by the MAC code for correctness, as well as
existing code.
Obtained from: TrustedBSD PRoject
Sponsored by: DARPA, NAI Labs
kernel access control.
Add support for labeling most out-going ICMP messages using an
appropriate MAC entry point. Currently, we do not explicitly
label packet reflect (timestamp, echo request) ICMP events,
implicitly using the originating packet label since the mbuf is
reused. This will be made explicit at some point.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket. Assign labels
to newly accepted connections when the syncache/cookie code has done
its business. Also set peer labels as convenient. Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation. Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Instrument the raw IP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check the
socket and mbuf labels before permitting delivery to a socket,
permitting MAC policies to selectively allow delivery of raw IP mbufs
to various raw IP sockets that may be open. Restructure the policy
checking code to compose IPsec and MAC results in a more readable
manner.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
When fragmenting an IP datagram, invoke an appropriate MAC entry
point so that MAC labels may be copied (...) to the individual
IP fragment mbufs by MAC policies.
When IP options are inserted into an IP datagram when leaving a
host, preserve the label if we need to reallocate the mbuf for
alignment or size reasons.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Instrument the code managing IP fragment reassembly queues (struct ipq)
to invoke appropriate MAC entry points to maintain a MAC label on
each queue. Permit MAC policies to associate information with a queue
based on the mbuf that caused it to be created, update that information
based on further mbufs accepted by the queue, influence the decision
making process by which mbufs are accepted to the queue, and set the
label of the mbuf holding the reassembled datagram following reassembly
completetion.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
When generating an IGMP message, invoke a MAC entry point to permit
the MAC framework to label its mbuf appropriately for the target
interface.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
When generating an ARP query, invoke a MAC entry point to permit the
MAC framework to label its mbuf appropriately for the interface.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Invoke the MAC framework to label mbuf created using divert sockets.
These labels may later be used for access control on delivery to
another socket, or to an interface.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI LAbs
kernel access control.
Label IP fragment reassembly queues, permitting security features to
be maintained on those objects. ipq_label will be used to manage
the reassembly of fragments into IP datagrams using security
properties. This permits policies to deny the reassembly of fragments,
as well as influence the resulting label of a datagram following
reassembly.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.
data structures pick up security and synchronization primitives, it
becomes increasingly desirable not to arbitrarily export them via
include files to userland, as the userland applications pick up new
#include dependencies.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
net.inet.tcp.rexmit_min (default 3 ticks equiv)
This sysctl is the retransmit timer RTO minimum,
specified in milliseconds. This value is
designed for algorithmic stability only.
net.inet.tcp.rexmit_slop (default 200ms)
This sysctl is the retransmit timer RTO slop
which is added to every retransmit timeout and
is designed to handle protocol stack overheads
and delayed ack issues.
Note that the *original* code applied a 1-second
RTO minimum but never applied real slop to the RTO
calculation, so any RTO calculation over one second
would have no slop and thus not account for
protocol stack overheads (TCP timestamps are not
a measure of protocol turnaround!). Essentially,
the original code made the RTO calculation almost
completely irrelevant.
Please note that the 200ms slop is debateable.
This commit is not meant to be a line in the sand,
and if the community winds up deciding that increasing
it is the correct solution then it's easy to do.
Note that larger values will destroy performance
on lossy networks while smaller values may result in
a greater number of unnecessary retransmits.
one second but it badly breaks throughput on networks with minor packet
loss.
Complaints by: at least two people tracked down to this.
MFC after: 3 days
just because you leave your session idle.
Also, put in a fix for 64-bit architectures (to be revised).
In detail:
ip_fw.h
* Reorder fields in struct ip_fw to avoid alignment problems on
64-bit machines. This only masks the problem, I am still not
sure whether I am doing something wrong in the code or there
is a problem elsewhere (e.g. different aligmnent of structures
between userland and kernel because of pragmas etc.)
* added fields in dyn_rule to store ack numbers, so we can
generate keepalives when the dynamic rule is about to expire
ip_fw2.c
* use a local function, send_pkt(), to generate TCP RST for Reset rules;
* save about 250 bytes by cleaning up the various snprintf()
in ipfw_log() ...
* ... and use twice as many bytes to implement keepalives
(this seems to be working, but i have not tested it extensively).
Keepalives are generated once every 5 seconds for the last 20 seconds
of the lifetime of a dynamic rule for an established TCP flow. The
packets are sent to both sides, so if at least one of the endpoints
is responding, the timeout is refreshed and the rule will not expire.
You can disable this feature with
sysctl net.inet.ip.fw.dyn_keepalive=0
(the default is 1, to have them enabled).
MFC after: 1 day
(just kidding... I will supply an updated version of ipfw2 for
RELENG_4 tomorrow).
This was always broken in HEAD (the offending statement was introduced
in rev. 1.123 for HEAD, while RELENG_4 included this fix (in rev.
1.99.2.12 for RELENG_4) and I inadvertently deleted it in 1.99.2.30.
So I am also restoring these two lines in RELENG_4 now.
We might need another few things from 1.99.2.30.
no punch_fw was used.
Fix another couple of bugs which prevented rules from being
installed properly.
On passing, use IPFW2 instead of NEW_IPFW to compile the new code,
and slightly simplify the instruction generation code.
Following Darren's suggestion, make Dijkstra happy and rewrite the
ipfw_chk() main loop removing a lot of goto's and using instead a
variable to store match status.
Add a lot of comments to explain what instructions are supposed to
do and how -- this should ease auditing of the code and make people
more confident with it.
In terms of code size: the entire file takes about 12700 bytes of text,
about 3K of which are for the main function, ipfw_chk(), and 2K (ouch!)
for ipfw_log().
now it should support all the instructions of the old ipfw.
Fix some bugs in the user interface, /sbin/ipfw.
Please check this code against your rulesets, so i can fix the
remaining bugs (if any, i think they will be mostly in /sbin/ipfw).
Once we have done a bit of testing, this code is ready to be MFC'ed,
together with a bunch of other changes (glue to ipfw, and also the
removal of some global variables) which have been in -current for
a couple of weeks now.
MFC after: 7 days
so that, if we recieve a ICMP "time to live exceeded in transit",
(type 11, code 0) for a TCP connection on SYN-SENT state, close
the connection.
MFC after: 2 weeks
syncache_respond(A), ip_output(), ip_input(), tcp_input(), syncache_badack(B)
Which winds up deleting a different entry from the syncache. Handle
this by not utilizing the next entry in the timer chain until after
syncache_respond() completes. The case of A == B should not be possible.
Problem found by: Don Bowman <don@sandvine.com>
This code makes use of variable-size kernel representation of rules
(exactly the same concept of BPF instructions, as used in the BSDI's
firewall), which makes firewall operation a lot faster, and the
code more readable and easier to extend and debug.
The interface with the rest of the system is unchanged, as witnessed
by this commit. The only extra kernel files that I am touching
are if_fw.h and ip_dummynet.c, which is quite tied to ipfw. In
userland I only had to touch those programs which manipulate the
internal representation of firewall rules).
The code is almost entirely new (and I believe I have written the
vast majority of those sections which were taken from the former
ip_fw.c), so rather than modifying the old ip_fw.c I decided to
create a new file, sys/netinet/ip_fw2.c . Same for the user
interface, which is in sbin/ipfw/ipfw2.c (it still compiles to
/sbin/ipfw). The old files are still there, and will be removed
in due time.
I have not renamed the header file because it would have required
touching a one-line change to a number of kernel files.
In terms of user interface, the new "ipfw" is supposed to accepts
the old syntax for ipfw rules (and produce the same output with
"ipfw show". Only a couple of the old options (out of some 30 of
them) has not been implemented, but they will be soon.
On the other hand, the new code has some very powerful extensions.
First, you can put "or" connectives between match fields (and soon
also between options), and write things like
ipfw add allow ip from { 1.2.3.4/27 or 5.6.7.8/30 } 10-23,25,1024-3000 to any
This should make rulesets slightly more compact (and lines longer!),
by condensing 2 or more of the old rules into single ones.
Also, as an example of how easy the rules can be extended, I have
implemented an 'address set' match pattern, where you can specify
an IP address in a format like this:
10.20.30.0/26{18,44,33,22,9}
which will match the set of hosts listed in braces belonging to the
subnet 10.20.30.0/26 . The match is done using a bitmap, so it is
essentially a constant time operation requiring a handful of CPU
instructions (and a very small amount of memmory -- for a full /24
subnet, the instruction only consumes 40 bytes).
Again, in this commit I have focused on functionality and tried
to minimize changes to the other parts of the system. Some performance
improvement can be achieved with minor changes to the interface of
ip_fw_chk_t. This will be done later when this code is settled.
The code is meant to compile unmodified on RELENG_4 (once the
PACKET_TAG_* changes have been merged), for this reason
you will see #ifdef __FreeBSD_version in a couple of places.
This should minimize errors when (hopefully soon) it will be time
to do the MFC.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
Add XXX comments to mark places which need to be taken care of
if we want to remove this part of the kernel from Giant.
Add a comment on a potential performance problem with ip_forward()
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.
The variables removed by this change are:
ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet
Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().
On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.
Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.
option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.
NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.
* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary
* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.
* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).
MFC after: 10 days
Register the ISR early, but do not actually kick off the timer until we
see some activity. This still saves us from running the arp timers on
a system with no network cards.
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.
Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)
Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.
PR: 39141
MFC after: 2 weeks
This time, make sure that ipv4 specific code (aka all of the above)
is only run in the ipv4 case.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.
o Determine the lock strategy for each members in struct socket.
o Lock down the following members:
- so_count
- so_options
- so_linger
- so_state
o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:
- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()
Reviewed by: alfred
results in the syncache entry being turned into a socket. While it's
not used in the main tree, this is required in the MAC tree so that
labels can be propagated from the mbuf to the socket. This is also
useful if you're doing things like transparent IP connection hijacking
and you want to use the syncache/cookie mechanism, but we won't go
there.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Ipfw processing of frames at layer 2 can be enabled by the sysctl variable
net.link.ether.ipfw=1
Consider this feature experimental, because right now, the firewall
is invoked in the places indicated below, and controlled by the
sysctl variables listed on the right. As a consequence, a packet
can be filtered from 1 to 4 times depending on the path it follows,
which might make a ruleset a bit hard to follow.
I will add an ipfw option to tell if we want a given rule to apply
to ether_demux() and ether_output_frame(), but we have run out of
flags in the struct ip_fw so i need to think a bit on how to implement
this.
to upper layers
| |
+----------->-----------+
^ V
[ip_input] [ip_output] net.inet.ip.fw.enable=1
| |
^ V
[ether_demux] [ether_output_frame] net.link.ether.ipfw=1
| |
+->- [bdg_forward]-->---+ net.link.ether.bridge_ipfw=1
^ V
| |
to devices
bridged packets only, soon to come also for packets on ordinary
ether_input() and ether_output() paths. The syntax is
ipfw add <action> MAC dst src type
where dst and src can be "any" or a MAC address optionallyfollowed
by a mask, e.g.
10:20:30:40:50
10:20:30:40:50/32
10:20:30:40:50&ff:ff:ff:f0:ff:0f
and type can be a single ethernet type, a range, or a type followed by
a mask (values are always in hexadecimal) e.g.
0800
0800-0806
0800/8
0800&03ff
Note, I am still uncertain on what is the best format for inputting
these values, having the values in hexadecimal is convenient in most
cases but can be confusing sometimes. Suggestions welcome.
Implement suggestion from PR 37778 to allow "not me" on destination
and source IP. The code in the PR was slightly wrong and interfered
with the normal handling of IP addresses. This version hopefully is
correct.
Minor cleanup of the code, in some places moving the indentation to 4
spaces because the code was becoming too deep. Eventually, in a
separate commit, I will move the whole file to 4 space indent.
were totally useless and have been removed.
ip_input.c, ip_output.c:
Properly initialize the "ip" pointer in case the firewall does an
m_pullup() on the packet.
Remove some debugging code forgotten long ago.
ip_fw.[ch], bridge.c:
Prepare the grounds for matching MAC header fields in bridged packets,
so we can have 'etherfw' functionality without a lot of kernel and
userland bloat.
field. This returns the sdl_data field to a variable-length field. More
importantly, this prevents a easily-reproduceable data-corruption bug when
the interface name plus the hardware address exceed the sdl_data field's
original 12 byte limit. However, token-ring interfaces may still overflow
the new sdl_data field's 46 byte limit if the interface name exceeds 6
characters (since 6 characters for interface name plus 6 for hardware
address plus 34 for source routing = the size of sdl_data). Further
refinements could overcome this limitation but would break binary
compatibility; this commit only addresses fixing the bug for
commonly-occuring cases without breaking binary compatibility with the
intention that the functionality can be MFC'ed to -stable.
See message ID's (both send to -arch):
20020421013332.F87395-100000@gateway.posi.net20020430181359.G11009-300000@gateway.posi.net
for a more thorough description of the bug addressed and how to
reproduce it.
Approved by: silence on -arch and -net
Sponsored by: NTT Multimedia Communications Labs
MFC after: 1 week
- Used mld_xxx and MLD_xxx instead of mld6_xxx and MLD6_xxx according
to the official defintions in rfc2292bis
(macro definitions for backward compatibility were provided)
- Changed the first member of mld_hdr{} from mld_hdr to mld_icmp6_hdr
to avoid name space conflict in C++
This change makes ports/net/pchar compilable again under -CURRENT.
Obtained from: KAME
Turn the sigio sx into a mutex.
Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.
In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.
more on how ipfw(8) deals with tiny fragments. While we're at it, add
a quick log message to even let people know we dropped a packet. (Note
that the second FINE POINT is somewhat redundant given the first, but
since the code is there, leave the docs for it.)
MFC after: 1 day
Requested by: bde
Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.
While I am here, sort include files alphabetically, where possible.
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().
sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.
sections for various standards. Conditionalize sections for various
standards. Use standards conforming spelling for types in the
sockaddr_in structure.
calling ioctl(SIOC[AS]IFADDR).
This allows the following:
ifconfig xx0 inet 1.2.3.1 netmask 0xffffff00
ifconfig xx0 inet 1.2.3.17 netmask 0xfffffff0 alias
ifconfig xx0 inet 1.2.3.25 netmask 0xfffffff8 alias
ifconfig xx0 inet 1.2.3.26 netmask 0xffffffff alias
but would (given the above) reject this:
ifconfig xx0 inet 1.2.3.27 netmask 0xfffffff8 alias
due to the conflicting netmasks. I would assert that it's wrong
to mask the EEXIST returned from rtinit() as in the above scenario, the
deletion of the 1.2.3.25 address will leave the 1.2.3.27 address
as unroutable as it was in the first place.
Offered for review on: -arch, -net
Discussed with: stephen macmanus <stephenm@bayarea.net>
MFC after: 3 weeks
This change allows bootp to work with more than one interface, at the
expense of some rather ``wrong'' looking code. I plan to MFC this in
place of luigi's recent #ifdef BOOTP stuff that was committed to this
file in -stable, as that's slightly more wrong that this is.
Offered for review on: -arch, -net
MFC after: 2 weeks
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
MI, not required to be a fixed size, and used in multiple headers.
This will grow in time, as more things move here from <sys/types.h>
and <machine/ansi.h>.
o Add missing type definitions (uint16_t and uint32_t) to
<arpa/inet.h> and <netinet/in.h>.
o Reduce pollution in <sys/types.h> by using `#if _FOO_T_DECLARED'
widgets to avoid including <sys/stdint.h>.
o Add some missing type definitions to <unistd.h> and note the ones
that still need to be added.
o Make use of <sys/_types.h> primitives in <grp.h> and <sys/types.h>.
Reviewed by: bde
Move the network code from using cr_cansee() to check whether a
socket is visible to a requesting credential to using a new
function, cr_canseesocket(), which accepts a subject credential
and object socket. Implement cr_canseesocket() so that it does a
prison check, a uid check, and add a comment where shortly a MAC
hook will go. This will allow MAC policies to seperately
instrument the visibility of sockets from the visibility of
processes.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
pointer which will then result in the allocated route's reference
count never being decremented. Just flood ping the localhost and
watch refcnt of the 127.0.0.1 route with netstat(1).
Submitted by: jayanth
Back out ip_output.c,v 1.143 and ip_mroute.c,v 1.69 that allowed
ip_output() to be called with a NULL route pointer. The previous
paragraph shows why this was a bad idea in the first place.
MFC after: 0 days
This increases the number of concurrent outgoing connections from ~4000
to ~16000. Other OSes (Solaris, OS X, NetBSD) and many other NAT
products have already made this change without ill effects, so we
should not run into any problems.
MFC after: 1 week
to are about to expire. This prevents high packet rate flows from
experiencing packet drops at the sender following ARP cache entry
timeout.
PR: kern/25517
Reviewed by: luigi
MFC after: 7 days
for POSIX.1-2001 conformance.
o Add magic to <netinet/in.h> and <netinet6/in6.h> to prevent
redefining INET_ADDRSTRLEN and INET6_ADDRSTRLEN.
o Add a note about missing typedefs in <arpa/inet.h>.
o In i386's <machine/endian.h>, macros have some advantages over
inlines, so change some inlines to macros.
o In i386's <machine/endian.h>, ungarbage collect word_swap_int()
(previously __uint16_swap_uint32), it has some uses on i386's with
PDP endianness.
Submitted by: bde
o Move a comment up in <machine/endian.h> that was accidentially moved
down a few revisions ago.
o Reenable userland's use of optimized inline-asm versions of
byteorder(3) functions.
o Fix ordering of prototypes vs. redefinition of byteorder(3)
functions, so that the non-GCC (libc asm) case has proper
prototypes.
o Add proper prototypes for byteorder(3) functions in <sys/param.h>.
o Prevent redundant duplicate prototypes by making use of the
_BYTEORDER_PROTOTYPED define.
o Move the bswap16(), bswap32(), bswap64() C functions into MD space
for platforms in which asm versions don't exist. This significantly
reduces the complexity of some things at the cost of duplicate code.
Reviewed by: bde
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being). Accordingly, change users of
xucred to set and check this field as appropriate. In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former. This also has the pleasant sideaffect of removing some
duplicate code.
Reviewed by: rwatson
were destined for a broadcast IP address. All TCP packets with a
broadcast destination must be ignored. The system only ignored packets
that were _link-layer_ broadcasts or multicast. We need to check the
IP address too since it is quite possible for a broadcast IP address
to come in with a unicast link-layer address.
Note that the check existed prior to CSRG revision 7.35, but was
removed. This commit effectively backs out that nine-year-old change.
PR: misc/35022
so that after the first time we can follow the pointer instead
of having to scan the list.
This was the intended behaviour from day one.
PR: 34639
MFC-after: 3 days
deprecated in favor of the POSIX-defined lowercase variants.
o Change all occurrences of NTOHL() and associated marcros in the
source tree to use the lowercase function variants.
o Add missing license bits to sparc64's <machine/endian.h>.
Approved by: jake
o Clean up <machine/endian.h> files.
o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>.
o Remove prototypes for non-existent bswapXX() functions.
o Include <machine/endian.h> in <arpa/inet.h> to define the
POSIX-required ntohl() family of functions.
o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>,
and <sys/param.h>.
o Prepend underscores to the ntohl() family to help deal with
complexities associated with having MD (asm and inline) versions, and
having to prevent exposure of these functions in other headers that
happen to make use of endian-specific defines.
o Create weak aliases to the canonical function name to help deal with
third-party software forgetting to include an appropriate header.
o Remove some now unneeded pollution from <sys/types.h>.
o Add missing <arpa/inet.h> includes in userland.
Tested on: alpha, i386
Reviewed by: bde, jake, tmm
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.
Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
some time. _All_ packets, regardless of destination, were accepted by
the machine as if addressed to it.
Jump back to 'pass' processing for a teed packet instead of falling
through as if it was ours.
PR: kern/31130
Reviewed by: -net, luigi
MFC after: 2 weeks
- Clear the cached destination before getting another cached route.
Otherwise, garbage in the padding space (which might be filled in if it was
used for IPv4) could annoy rtalloc.
Obtained from: KAME
(We should be able to handle locally originated IP packets, and
these do not have m_pkthdr.rcvif set.)
PR: kern/32806, kern/33766
Reviewed by: luigi
Fix tested by: Maxim Konovalov <maxim@macomnet.ru>,
Erwin Lansing <erwin@lansing.dk>
The following steps are involved:
a) the IP options related to routing (LSRR and SSRR) are processed
as though the router were a host,
b) the other IP options are processed as usual only if the packet
is destined for the router; otherwise they are ignored.
PR: kern/23123
Discussed in: freebsd-hackers
An old route will be NULL at that point if a packet were initially
routed to an interface (using the IP_ROUTETOIF flag.)
Submitted by: Igor Timkin <ivt@gamma.ru>
All TCP ISNs that are sent out are valid cookies, which allows entries
in the syncache to be dropped and still have the ACK accepted later.
As all entries pass through the syncache, there is no sudden switchover
from cache -> cookies when the cache is full; instead, syncache entries
simply have a reduced lifetime. More details may be found in the
"Resisting DoS attacks with a SYN cache" paper in the Usenix BSDCon 2002
conference proceedings.
Sponsored by: DARPA, NAI Labs
Now that we've increased the size of our send / receive buffers, bursting
an entire window onto the network may cause congestion. As a result,
we will slow start beginning with a flightsize of 4 packets.
Problem reported by: Thomas Zenker <thz@Lennartz-electronic.de>
MFC after: 3 days
Easily exploitable by flood pinging the target
host over an interface with the IFF_NOARP flag
set (all you need to know is the target host's
MAC address).
MFC after: 0 days
mutable contents of struct prison (hostname, securelevel, refcount,
pr_linux, ...)
o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/
so as to enforce these protections, in particular, in kern_mib.c
protection sysctl access to the hostname and securelevel, as well as
kern_prot.c access to the securelevel for access control purposes.
o Rewrite linux emulator abstractions for accessing per-jail linux
mib entries (osname, osrelease, osversion) so that they don't return
a pointer to the text in the struct linux_prison, rather, a copy
to an array passed into the calls. Likewise, update linprocfs to
use these primitives.
o Update in_pcb.c to always use prison_getip() rather than directly
accessing struct prison.
Reviewed by: jhb
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.
Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).
Some cleanup. Identify additonal issues in comments.
MFC after: 1 day
o Hide nonstandard functions and types in <netinet/in.h> when
_POSIX_SOURCE is defined.
o Add some missing types (required by POSIX.1-200x) to <netinet/in.h>.
o Restore vendor ID from Rev 1.1 in <netinet/in.h> and make use of new
__FBSDID() macro.
o Fix some miscellaneous issues in <arpa/inet.h>.
o Correct final argument for the inet_ntop() function (POSIX.1-200x).
o Get rid of the namespace pollution from <sys/types.h> in
<arpa/inet.h>.
Reviewed by: fenner
Partially submitted by: bde
interface address, blow the address away again before returning the
error.
In in_ifinit(), if we get an error from rtinit() and we've also got
a destination address, return the error rather than masking EEXISTS.
Failing to create a host route when configuring an interface should
be treated as an error.
received on an interface without an IP address, try to find a
non-loopback AF_INET address to use. If that fails, drop it.
Previously, we used the address at the top of the in_ifaddrhead list,
which didn't make much sense, and would cause a panic if there were no
AF_INET addresses configured on the system.
PR: 29337, 30524
Reviewed by: ru, jlemon
Obtained from: NetBSD
for passive mode data connections (PASV/EPSV -> 227/229). Well,
the actual punching happens a bit later, when the aliasing link
becomes fully specified.
Prodded by: Danny Carroll <dannycarroll@hotmail.com>
MFC after: 1 week
to be followed by nfsnodehashtbl, so bzeroing callouts beyond the end of
tcp_syncache soon caused a null pointer panic when nfsnodehashtbl was
accessed.
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
sysctl_req', which describes in-progress sysctl requests. This permits
sysctl handlers to have access to the current thread, permitting work
on implementing td->td_ucred, migration of suser() to using struct
thread to derive the appropriate ucred, and allowing struct thread to be
passed down to other code, such as network code where td is not currently
available (and curproc is used).
o Note: netncp and netsmb are not updated to reflect this change, as they
are not currently KSE-adapted.
Reviewed by: julian
Obtained from: TrustedBSD Project
"[...] and removes the hostcache code from standard kernels---the
code that depends on it is not going to happen any time soon,
I'm afraid."
Time to clean up.
called and ip_output() encounters an error and bails (i.e. host
unreachable), we will leak an mbuf. This is because the code calls
m_freem(m0) after jumping to the bad: label at the end of the function,
when it should be calling m_freem(m). (m0 is the original mbuf list
_without_ the options mbuf prepended.)
Obtained from: NetBSD
ifconfig, which expects the address returned by the SIOCGIFNETMASK ioctl
to have a valid sa_family. Similar changes may be necessary for IPv6.
While we're here, get rid of an unnecessary temp variable.
MFC after: 2 weeks
is calculated. this caused some trouble in the code which the ip header
is not modified. for example, inbound policy lookup failed.
Obtained from: KAME
MFC after: 1 week
Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).
Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0
Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.
Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().
Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.
Reviewed by: ps, billf
Obtained from: TrustedBSD Project
to send all its data, especially when the data is less than one MSS.
This fixes an issue where the stack was delaying the sending
of data, eventhough there was enough window to send all the data and
the sending of data was emptying the socket buffer.
Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp)
Submitted by: Jayanth Vijayaraghavan
kern.ipc.showallsockets is set to 0.
Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks
+ implement "limit" rules, which permit to limit the number of sessions
between certain host pairs (according to masks). These are a special
type of stateful rules, which might be of interest in some cases.
See the ipfw manpage for details.
+ merge the list pointers and ipfw rule descriptors in the kernel, so
the code is smaller, faster and more readable. This patch basically
consists in replacing "foo->rule->bar" with "rule->bar" all over
the place.
I have been willing to do this for ages!
MFC after: 1 week
not referenced in Stevens, and does not compile with g++.
There is an equivalent structure, struct ipoption in ip_var.h
which is actually used in various parts of the kernel, and also referenced
in Stevens.
Bill Fenner also says:
... if you want the trivia, struct ip_opts was introduced
in in.h SCCS revision 7.9, on 6/28/1990, by Mike Karels.
struct ipoption was introduced in ip_var.h SCCS revision 6.5,
on 9/16/1985, by... Mike Karels.
MFC-after: 3 days
NAT in extended passive mode if the server's public IP address was
different from the main NAT address. This caused a wrong aliasing
link to be created that did not route the incoming packets back to
the original IP address of the server.
natd -v -n pub0 -redirect_address localFTP publicFTP
Note that even if localFTP == publicFTP, one still needs to supply
the -redirect_address directive. It is needed as a helper because
extended passive mode's 229 reply does not contain the IP address.
MFC after: 1 week
and speed. No new functionality added (yet) apart from a bugfix.
MFC will occur in due time and probably in stages.
BUGFIX: fix a problem in old code which prevented reallocation of
the hash table for dynamic rules (there is a PR on this).
OTHER CHANGES: minor changes to the internal struct for static and dynamic rules.
Requires rebuild of ipfw binary.
Add comments to show how data structures are linked together.
(It probably makes no sense to keep the chain pointers separate
from actual rule descriptors. They will be hopefully merged soon.
keep a (sysctl-readable) counter for the number of static rules,
to speed up IP_FW_GET operations
initial support for a "grace time" for expired connections, so we
can set timeouts for closing connections to much shorter times.
merge zero_entry() and resetlog_entry(), they use basically the
same code.
clean up and reduce replication of code for removing rules,
both for readability and code size.
introduce a separate lifetime for dynamic UDP rules.
fix a problem in old code which prevented reallocation of
the hash table for dynamic rules (PR ...)
restructure dynamic rule descriptors
introduce some local variables to avoid multiple dereferencing of
pointer chains (reduces code size and hopefully increases speed).
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
new data is acknowledged, reset the dupacks to 0.
The problem was spotted when a connection had its send buffer full
because the congestion window was only 1 MSS and was not being incremented
because dupacks was not reset to 0.
Obtained from: Yahoo!
to the application as a RST would, this way we're compatible with the most
applications.
MFC candidate.
Submitted by: Scott Renfro <scott@renfro.org>
Reviewed by: Mike Silbersack <silby@silby.com>
about rules and dynamic rules. it later fills this buffer with these
rules.
it also takes the opporunity to compare the expiration of the dynamic
rules with the current time and either marks them for deletion or simply
charges the countdown.
unfortunatly it does this all (the sizing, the buffer copying, and the
expiration GC) with no spl protection whatsoever. it was possible for
the dynamic rule(s) to be ripped out from under the request before it
had completed, resulting in corrupt memory dereferencing.
Reviewed by: ps
MFC before: 4.4-RELEASE, hopefully.
In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.
Reviewed by: jesper
cdevsw entries have been for a long time.
Discover that we now have two version sof the same structure.
I will shoot one of them shortly when I figure out why someone thinks
they need it. (And I can prove they don't)
(netinet/ipprotosw.h should GO AWAY)
Avoid using parenthesis enclosure macros (.Pq and .Po/.Pc) with plain text.
Not only this slows down the mdoc(7) processing significantly, but it also
has an undesired (in this case) effect of disabling hyphenation within the
entire enclosed block.
making pcbs available to the outside world. otherwise, we will see
inpcb without ipsec security policy attached (-> panic() in ipsec.c).
Obtained from: KAME
MFC after: 3 days
- Use sysctl to export stats
- Use ip_encap.c's encapsulation support
- Update lkm to kld (is 6 years a record for a broken module?)
- Remove some unused cruft
This macro was supposed to only match local IP addresses of
interfaces, and all consumers of this macro assume this as
well. (See IP_MULTICAST_IF and IP_ADD_MEMBERSHIP socket
options in the ip(4) manpage.)
This fixes a major security breach in IPFW-based firewalls
where the `me' keyword would match the other end of a P2P
link.
PR: kern/28567
This should help us in nieve benchmark "tests".
It seems a wide number of people think 32k buffers would not cause major
issues, and is in fact in use by many other OS's at this time. The
receive buffers can be bumped higher as buffers are hardly used and several
research papers indicate that receive buffers rarely use much space at all.
Submitted by: Leo Bicknell <bicknell@ufp.org>
<20010713101107.B9559@ussenterprise.ufp.org>
Agreed to in principle by: dillon (at the 32k level)
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.
While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.
To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.
Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.
Reviewed by: jlemon
Tested by: numerous subscribers of -net
RTF_DYNAMIC route, it got freed twice). I am not sure what was
the actual problem in 1992, but the current behavior is memory
leak if PCB holds a reference to a dynamically created/modified
routing table entry. (rt_refcnt>0 and we don't call rtfree().)
My test bed was:
1. Set net.inet.tcp.msl to a low value (for test purposes), e.g.,
5 seconds, to speed up the transition of TCP connection to a
"closed" state.
2. Add a network route which causes ICMP redirect from the gateway.
3. ping(8) host H that matches this route; this creates RTF_DYNAMIC
RTF_HOST route to H. (I was forced to use ICMP to cause gateway
to generate ICMP host redirect, because gateway in question is a
4.2-STABLE system vulnerable to a problem that was fixed later in
ip_icmp.c,v 1.39.2.6, and TCP packets with DF bit set were
triggering this bug.)
4. telnet(1) to H
5. Block access to H with ipfw(8)
6. Send something in telnet(1) session; this causes EPERM, followed
by an in_losing() call in a few seconds.
7. Delete ipfw(8) rule blocking access to H, and wait for TCP
connection moving to a CLOSED state; PCB is freed.
8. Delete host route to H.
9. Watch with netstat(1) that `rttrash' increased.
10. Repeat steps 3-9, and watch `rttrash' increases.
PR: kern/25421
MFC after: 2 weeks
only do getcred calls for sockets which were created in the same jail.
This should allow the ident to work in a reasonable way within jails.
PR: 28107
Approved by: des, rwatson
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.
Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.
Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks
are duplicated by newly defined types/options in RFC3121
- We have no backward compatibility issue. There is no apps in our
distribution which use the above types/options.
Obtained from: KAME
MFC after: 2 weeks
sizeof(ro_dst) is not necessarily the correct one.
this change would also fix the recent path MTU discovery problem for the
destination of an incoming TCP connection.
Submitted by: JINMEI Tatuya <jinmei@kame.net>
Obtained from: KAME
MFC after: 2 weeks
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.
TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.
Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks
around, use a common function for looking up and extracting the tunables
from the kernel environment. This saves duplicating the same function
over and over again. This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.
One way we can reduce the amount of traffic we send in response to a SYN
flood is to eliminate the RST we send when removing a connection from
the listen queue. Since we are being flooded, we can assume that the
majority of connections in the queue are bogus. Our RST is unwanted
by these hosts, just as our SYN-ACK was. Genuine connection attempts
will result in hosts responding to our SYN-ACK with an ACK packet. We
will automatically return a RST response to their ACK when it gets to us
if the connection has been dropped, so the early RST doesn't serve the
genuine class of connections much. In summary, we can reduce the number
of packets we send by a factor of two without any loss in functionality
by ensuring that RST packets are not sent when dropping a connection
from the listen queue.
Submitted by: Mike Silbersack <silby@silby.com>
Reviewed by: jesper
MFC after: 2 weeks
A attacker sending a lot of bogus fragmented packets to the target
(with different IPv4 identification field - ip_id), may be able
to put the target machine into mbuf starvation state.
By setting a upper limit on the number of reassembly queues we
prevent this situation.
This upper limit is controlled by the new sysctl
net.inet.ip.maxfragpackets which defaults to 200,
as the IPv6 case, this should be sufficient for most
systmes, but you might want to increase it if you have
lots of TCP sessions.
I'm working on making the default value dependent on
nmbclusters.
If you want old behaviour (no upper limit) set this sysctl
to a negative value.
If you don't want to accept any fragments (not recommended)
set the sysctl to 0 (zero).
Obtained from: NetBSD
MFC after: 1 week
This closes a minor information leak which allows a remote observer to
determine the rate at which the machine is generating packets, since the
default behaviour is to increment a counter for each packet sent.
Reviewed by: -net
Obtained from: OpenBSD
A attacker sending a lot of bogus fragmented packets to the target
(with different IPv4 identification field - ip_id), may be able
to put the target machine into mbuf starvation state.
By setting a upper limit on the number of reassembly queues we
prevent this situation.
This upper limit is controlled by the new sysctl
net.inet.ip.maxfragpackets which defaults to NMBCLUSTERS/4
If you want old behaviour (no upper limit) set this sysctl
to a negative value.
If you don't want to accept any fragments (not recommended)
set the sysctl to 0 (zero)
Obtained from: NetBSD (partially)
MFC after: 1 week
any response to our third SYN to work-around some broken
terminal servers (most of which have hopefully been retired)
that have bad VJ header compression code which trashes TCP
segments containing unknown-to-them TCP options.
PR: kern/1689
Submitted by: jesper
Reviewed by: wollman
MFC after: 2 weeks
For FTP control connection, keep the CRLF end-of-line termination
status in there.
Fixed the bug when the first FTP command in a session was ignored.
PR: 24048
MFC after: 1 week
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
Change code from PRC_UNREACH_ADMIN_PROHIB to PRC_UNREACH_PORT for
ICMP_UNREACH_PROTOCOL and ICMP_UNREACH_PORT
And let TCP treat PRC_UNREACH_PORT like PRC_UNREACH_ADMIN_PROHIB
This should fix the case where port unreachables for udp returned
ENETRESET instead of ECONNREFUSED
Problem found by: Bill Fenner <fenner@research.att.com>
Reviewed by: jlemon
sysctl, net.inet.ip.fw.permanent_rules.
This allows you to install rules that are persistent across flushes,
which is very useful if you want a default set of rules that
maintains your access to remote machines while you're reconfiguring
the other rules.
Reviewed by: Mark Murray <markm@FreeBSD.org>