Improve consistency of the field names with tcpsinfo_t:
* Use mss instead of max_seg_size.
* Use lport and rport instead of tcp_localport and tcp_foreignport.
Use t_flags instead of flags to improve consistency with t_flags2.
Add laddr and raddr, since the addresses were missing when compared
to the output of siftr.
Reviewed by: cc
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D40834
Summary:
(1) use inp_flowid or a new packet hash for a flow identification
(2) cache constant connection info into struct flow_hash_node
(3) use compressed notation for IPv6 address representation
Reviewers: rscheff, tuexen
Approved by: tuexen (mentor)
Subscribers: imp, melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D40302
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
The 0b70e3e78b changed the original design of a single entry point
into pfil(9) chains providing separate functions for the filtering
points that always provide mbufs and know the direction of a flow.
The motivation was to reduce branching. The logical continuation
would be to do the same for the filtering points that always provide
a memory pointer and retire the single entry point.
o Hooks now provide two functions: one for mbufs and optional for
memory pointers.
o pfil_hook_args() has a new member and pfil_add_hook() has a
requirement to zero out uninitialized data. Bump PFIL_VERSION.
o As it was before, a hook function for a memory pointer may realloc
into an mbuf. Such mbuf would be returned via a pointer that must
be provided in argument.
o The only hook that supports memory pointers is ipfw:default-link.
It is rewritten to provide two functions.
o All remaining uses of pfil_run_hooks() are converted to
pfil_mem_in().
o Transparent union of pfil_packet_t and tricks to fix pointer
alignment are retired. Internal pfil_realloc() reduces down to
m_devget() and thus is retired, too.
Reviewed by: mjg, ocochard
Differential revision: https://reviews.freebsd.org/D37977
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193a, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
Replace the old snd_bwnd field which was kept for compatibility with the
t_flags2 field from the tcpcb. This exposes in siftr logs interesting
things such as ECN, PLPMTUD, Accurate ECN and if first bytes are
complete.
Reviewed by: rscheff (transport), chengc_netapp.com, debdrup (manpages)
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #73
Differential Revision: https://reviews.freebsd.org/D34672
A few pieces of the SIFTR code that are behind #ifdef SIFTR_IPV6 have
not been updated as APIs have changed, etc.
Reported by: Alexander Sideropoulos <Alexander.Sideropoulos@netapp.com>
Reviewed by: rscheff, lstewart
Sponsored by: NetApp
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D32698
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
The new sysctl was not added to the siftr.4 man page at the time.
This updates the man page, and removes one left over trailing whitespace.
Submitted by: Richard Scheffenegger
Reviewed by: bcr@
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21619
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
SIFTR does not allow any kind of filtering, but captures every packet
processed by the TCP stack.
Often, only a specific session or service is of interest, and doing the
filtering in post-processing of the log adds to the overhead of SIFTR.
This adds a new sysctl net.inet.siftr.port_filter. When set to zero, all
packets get captured as previously. If set to any other value, only
packets where either the source or the destination ports match, are
captured in the log file.
Submitted by: Richard Scheffenegger
Reviewed by: Cheng Cui
Differential Revision: https://reviews.freebsd.org/D18897
Correct a logic error.
Only disable when already enabled or enable when disabled.
Submitted by: Richard Scheffenegger
Reviewed by: Cheng Cui
Obtained from: Cheng Cui
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D18885
Use the sysctl_handle_int() handler to write out the old value and read
the new value into a temporary variable. Use the temporary variable
for any checks of values rather than using the CAST_PTR_INT() macro on
req->newptr. The prior usage read directly from userspace memory if the
sysctl() was called correctly. This is unsafe and doesn't work at all on
some architectures (at least i386.)
In some cases, the code could also be tricked into reading from kernel
memory and leaking limited information about the contents or crashing
the system. This was true for CDG, newreno, and siftr on all platforms
and true for i386 in all cases. The impact of this bug is largest in
VIMAGE jails which have been configured to allow writing to these
sysctls.
Per discussion with the security officer, we will not be issuing an
advisory for this issue as root access and a non-default config are
required to be impacted.
Reviewed by: markj, bz
Discussed with: gordon (security officer)
MFC after: 3 days
Security: kernel information leak, local DoS (both require root)
Differential Revision: https://reviews.freebsd.org/D18443
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.
PR: 231428
Reviewed by: mmacy, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17335
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.
In preparation to fix this create two macros depending on if the data is
global or static.
Reviewed by: bz, emaste, markj
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D16140
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
sb_cc member of struct sockbuf to a couple of inline functions:
sbavail() and sbused()
Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.
Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Fix a siftr(4) potential memory leak and INVARIANTS triggered kernel panic in
hashdestroy() by ensuring the last array index in the flow counter hash table is
flushed of entries.
MFC after: 3 days
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.
After this change a packet processed by the stack isn't
modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.
[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.
------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
the SIFTR pfil(9) hook functions to or from all network stacks. This patch
allows packets inbound or outbound from a vnet to be "seen" by SIFTR.
Additional work is required to allow SIFTR to actually generate log messages for
all vnet related packets because the siftr_findinpcb() function does not yet
search for inpcbs across all vnets. This issue will be fixed separately.
Reported and tested by: David Hayes <dahayes at swin edu au>
MFC after: 3 days
tree in preparation for another large code import. Swinburne University is the
legal entity that owns copyright and the 2-clause BSD licence is acceptable.