Given nodes 1 and 2, where node 1 has an advskew of 0 and node 2 has an
advskew of 100, making them master and backup respectively.
If net.inet.carp.demotion is set to a negative value on node 1, node 2
might become master while node 1 still retains it master status. Wether
or not node 2 becomes master seems to depend on the nodes advskew and
what the demotion sysctl was set to on node 1.
The reason for node 2 becoming master seems to be that the calculated
advskew taking demotion into account is truncated to a single unsigned
byte when copied into the carp header for sending, and node 1 stays
master since it takes uses the whole non-truncated calculated advskew
when deciding wether to stay master.
PR: 259528
Reviewed by: donner, glebius
MFC after: 3 weeks
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D32759
(cherry picked from commit 1019354b54)
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).
An equivalent test has existed since 4.4-Lite. It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).
In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es). In practice no work will be
avoided.
Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses. It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.
The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs. As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.
PR: 253166
Reviewed by: ae, bz, donner, emaste, glebius
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D32563
(cherry picked from commit 5c5340108e)
Before passing an IPv6 packet to application apply delayed checksum
calculation. Mbuf flags will be lost when divert listener will return a
packet back, so we will not be able to do delayed checksum calculation
later. Also an application will get a packet with correct checksum.
Reviewed by: donner
Differential Revision: https://reviews.freebsd.org/D32807
(cherry picked from commit 4a9e95286c)
The address with a host part of all zeros was used as a broadcast long
ago, but the default has been all ones since 4.3BSD and RFC1122. Until
now, we would broadcast the host zero address as well as the configured
address. Change to not broadcasting that address by default, but add a
sysctl (net.inet.ip.broadcast_lowest) to re-enable it. Note that the
correct way to use the zero address for broadcast would be to configure
it as the broadcast address for the network.
See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/
and the discussion in https://reviews.freebsd.org/D19316. Note, Linux
now implements this.
Reviewed by: rgrimes, tuexen; melifaro (previous version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D31861
(cherry picked from commit fd0765933c)
Tracking the number of unused holes in the trie and the range table
was a bad metric based on which full trie and / or range rebuilds
were triggered, which would happen in vain by far too frequently,
particularly with live BGP feeds.
Instead, track the total unused space inside the trie and range table
structures, and trigger rebuilds if the percentage of unused space
exceeds a sysctl-tunable threshold.
MFC after: 3 days
PR: 257965
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a
socket rather than a socket buffer. Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.
When locking for I/O, return an error if the socket is a listening
socket. Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.
Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In
particular, add checks to sendfile() and sorflush().
Reviewed by: tuexen, gallatin
Sponsored by: The FreeBSD Foundation
(cherry picked from commit f94acf52a4)
Traversing a single list of unused range chunks in search for a block
of optimal size was suboptimal.
The experience with real-world BGP workloads has shown that on average
unused range chunks are tiny, mostly in length from 1 to 4 or 5, when
DXR is configured with K = 20 which is the current default (D16X4R).
Therefore, introduce a limited amount of buckets to accomodate descriptors
of empty blocks of fixed (small) size, so that those can be found in O(1)
time. If no empty chunks of the requested size can be found in fixed-size
buckets, the search continues in an unsorted list of empty chunks of
variable lengths, which should only happen infrequently.
This change should permit us to manage significantly more empty range
chunks without sacrifying the speed of incremental range table updating.
MFC after: 3 days
There are two flags to request a non-blocking receive on a socket:
MSG_NBIO and MSG_DONTWAIT. They are handled a bit differently in that
soreceive_generic() and soreceive_stream() will block on the socket I/O
lock when MSG_NBIO is set, but not if MSG_DONTWAIT is set. In general,
MSG_NBIO seems to mean, "don't block if there is no data to receive" and
MSG_DONTWAIT means "don't go to sleep for any reason".
SCTP's soreceive implementation did not allow blocking on the I/O lock
if either flag is set, but this violates an assumption in
aio_process_sb(), which specifies MSG_NBIO but nonetheless
expects to make progress if data is available to read. Change
sctp_sorecvmsg() to block on the I/O lock only if MSG_DONTWAIT
is not set.
Reported by: syzbot+c7d22dbbb9aef509421d@syzkaller.appspotmail.com
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit e6c19aa94d)
A division by zero would occur if DXR would be activated on a vnet
with no IP addresses configured on any interfaces.
PR: 257965
MFC after: 3 days
Reported by: Raul Munoz
(cherry picked from commit eb3148cc4d)
Don't rebuild in vain trie parts unaffected by accumulated incremental
RIB updates.
PR: 257965
Tested by: Konrad Kreciwilk
MFC after: 3 days
(cherry picked from commit b51f8bae57)
Free memory before return from arprequest_internal(). In in_arpinput(),
if arp_fillheader() fails, it should use goto drop.
Reviewed by: melifaro, imp, markj
Pull Request: https://github.com/freebsd/freebsd-src/pull/534
(cherry picked from commit f5777c123a)
- The SCTP_PCB_FLAGS_SND_ITERATOR_UP check was racy, since two threads
could observe that the flag is not set and then both set it. I'm not
sure if this is actually a problem in practice, i.e., maybe there's no
problem having multiple sends for a single PCB in the iterator list?
- sctp_sendall() was modifying sctp_flags without the inp lock held.
The change simply acquires the PCB write lock before toggling the flag,
fixing both problems.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 173a7a4ee4)
This appears to be unused in usrsctp as well. No functional change
intended.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit e8e23ec127)
sctp_close() and sctp_abort() disassociate the PCB from its socket.
As a part of this, they attempt to free the PCB, which may end up
lingering. Fix some bugs in this area:
- For some reason, sctp_close() and sctp_abort() set
SCTP_PCB_FLAGS_SOCKET_GONE using an atomic compare-and-set without the
PCB lock held. This is racy since sctp_flags is normally updated
without atomics, using the PCB lock to synchronize. So, the update
can be lost, which can cause all sort of races with other SCTP
components which look for the _GONE flag. Fix the problem simply by
acquiring the PCB lock in order to set the flag. Note that we have to
drop and re-acquire the lock again in sctp_inpcb_free(), but I don't
see a good way around that for now. If it's a real problem, the _GONE
flag could be split out of sctp_flags and into a dedicated sctp_inpcb
field.
- In sctp_inpcb_free(), load sctp_socket after acquiring the PCB lock,
to avoid possible races with parallel sctp_inpcb_free() calls.
- Add an assertion sctp_inpcb_free() to verify that _ALLGONE is not set.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit c17b531bed)
It appears that maliciously crafted ifaliasreq can lead to NULL
pointer dereference in in_aifaddr_ioctl(). In order to replicate
that, one needs to
1. Ensure that carp(4) is not loaded
2. Issue SIOCAIFADDR call setting ifra_vhid field of the request
to a negative value.
A repro code would look like this.
int main() {
struct ifaliasreq req;
struct sockaddr_in sin, mask;
int fd, error;
bzero(&sin, sizeof(struct sockaddr_in));
bzero(&mask, sizeof(struct sockaddr_in));
sin.sin_len = sizeof(struct sockaddr_in);
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = inet_addr("192.168.88.2");
mask.sin_len = sizeof(struct sockaddr_in);
mask.sin_family = AF_INET;
mask.sin_addr.s_addr = inet_addr("255.255.255.0");
fd = socket(AF_INET, SOCK_DGRAM, 0);
if (fd < 0)
return (-1);
memset(&req, 0, sizeof(struct ifaliasreq));
strlcpy(req.ifra_name, "lo0", sizeof(req.ifra_name));
memcpy(&req.ifra_addr, &sin, sin.sin_len);
memcpy(&req.ifra_mask, &mask, mask.sin_len);
req.ifra_vhid = -1;
return ioctl(fd, SIOCAIFADDR, (char *)&req);
}
To fix, discard both positive and negative vhid values in
in_aifaddr_ioctl, if carp(4) is not loaded. This prevents NULL pointer
dereference and kernel panic.
Reviewed by: imp@
Pull Request: https://github.com/freebsd/freebsd-src/pull/530
(cherry picked from commit 620cf65c2b)
This will be used by sctp_listen() to avoid dropping locks when
performing an implicit bind. No functional change intended.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 457abbb857)
Later in sctp_free_assoc(), when we clean up chunk lists,
sctp_free_spbufspace() is used to reset the byte count in the socket
send buffer. However, if the PCB is going away, the socket may already
have been detached from the PCB, in which case this becomes a use-after
free. Clear the socket reference from the association before detaching
it from the PCB, if the PCB has already lost its socket reference.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 65f30a39e1)
At this point we do not hold the inpcb lock, so the only thing holding
the socket reference live is the TCB lock, which needs to be acquired by
sctp_inpcb_free() in order to destroy associations. Defer the unlock to
until after we dereference the socket reference.
Reported by: syzbot+1d0f2c4675de76a4cf1e@syzkaller.appspotmail.com
Reported by: syzbot+fabee77954fe69d3a5ad@syzkaller.appspotmail.com
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit d35be50f57)
Implement kernel support for RFC 5549/8950.
* Relax control plane restrictions and allow specifying IPv6 gateways
for IPv4 routes. This behavior is controlled by the
net.route.rib_route_ipv6_nexthop sysctl (on by default).
* Always pass final destination in ro->ro_dst in ip_forward().
* Use ro->ro_dst to exract packet family inside if_output() routines.
Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.
* Pass extracted family to nd6_resolve() to get the LLE with proper encap.
It leverages recent lltable changes committed in c541bd368f.
Presence of the functionality can be checked using ipv4_rfc5549_support feature(3).
Example usage:
route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0
Differential Revision: https://reviews.freebsd.org/D30398
(cherry picked from commit 62e1a437f3)
Currently we use pre-calculated headers inside LLE entries as prepend data
for `if_output` functions. Using these headers allows saving some
CPU cycles/memory accesses on the fast path.
However, this approach makes adding L2 header for IPv4 traffic with IPv6
nexthops more complex, as it is not possible to store multiple
pre-calculated headers inside lle. Additionally, the solution space is
limited by the fact that PCB caching saves LLEs in addition to the nexthop.
Thus, add support for creating special "child" LLEs for the purpose of holding
custom family encaps and store mbufs pending resolution. To simplify handling
of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
to the "parent" LLE - it is not possible to delete "child" when parent is alive.
Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
machine used by the standard LLEs.
nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
allowing to return such child LLEs. This change uses `LLE_SF()` macro which
packs family and flags in a single int field. This is done to simplify merging
back to stable/. Once this code lands, most of the cases will be converted to
use a dedicated `family` parameter.
Differential Revision: https://reviews.freebsd.org/D31379
(cherry picked from commit c541bd368f)
Consistently use `nh` instead of always dereferencing
ro->ro_nh inside the if block.
Always use nexthop mtu, as it provides guarantee that mtu is accurate.
Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses
of updating ro flags based on different nexthop.
Differential Revision: https://reviews.freebsd.org/D31451
Reviewed by: kp
(cherry picked from commit 9748eb7427)
Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.
While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.
Differential Revision: https://reviews.freebsd.org/D31390
(cherry picked from commit f3a3b06121)
SCTP needs to avoid binding a given socket twice. The check used to
avoid this is racy since neither the inpcb lock nor the global info lock
is held. Fix it by synchronizing using the global info lock. In
particular, sctp_inpcb_bind() may drop the inpcb lock in some cases, but
the info lock is sufficient to prevent double insertion into PCB hash
tables.
Reported by: syzbot+548a8560d959669d0e12@syzkaller.appspotmail.com
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 4a36122b1d)
Eliminate a flag variable and reduce indentation. No functional change
intended.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 2496d812a9)
We only drop the inp lock when binding to a specific port. So, only
acquire an extra reference when required. This simplifies error
handling a bit.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 93908fce72)
This allows the maximum value of 4294967295 (~4Gb/s) instead of previous
value of 2147483647 (~2Gb/s).
Reviewed by: np, scottl
Obtained from: pfSense
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31582
(cherry picked from commit 20ffd88ed5)
The keyword adds nothing as all operations on the var are performed
through atomic_*
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31526
(cherry picked from commit d2b95af1c2)
We hold the SOCKBUF_LOCK so use soroverflow_locked here.
This bug may manifest as a non-killable process stuck in [*so_rcv].
Approved by: scottl
Reviewed by: Roy Marples <roy@marples.name>
Fixes: 7045b1603b
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D31374
(cherry picked from commit a61c24ddb7)
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.
This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.
Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652
(cherry picked from commit 7045b1603b)
If the socket is configured such that the sender is expected to supply
the IP header, then we need to verify that it actually did so.
Reported by: syzkaller+KMSAN
Reviewed by: donner
Sponsored by: The FreeBSD Foundation
(cherry picked from commit ba21825202)
This completes PRR cwnd reduction in all circumstances
for the base TCP stack (SACK loss recovery, ECN window reduction,
non-SACK loss recovery), preventing the arriving ACKs to
clock out new data at the old, too high rate. This
reduces the chance to induce additional losses while
recovering from loss (during congested network conditions).
For non-SACK loss recovery, each ACK is assumed to have
one MSS delivered. In order to prevent ACK-split attacks,
only one window worth of ACKs is considered to actually
have delivered new data.
MFC after: 6 weeks
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29441
(cherry picked from commit 74d7fc8753)
Import OpenBSD's syncookie support for pf. This feature help pf resist
TCP SYN floods by only creating states once the remote host completes
the TCP handshake rather than when the initial SYN packet is received.
This is accomplished by using the initial sequence numbers to encode a
cookie (hence the name) in the SYN+ACK response and verifying this on
receipt of the client ACK.
Reviewed by: kbowling
Obtained from: OpenBSD
MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31138
(cherry picked from commit 8e1864ed07)
Fix a bug in VNET handling, which occurs when using specific NICs.
PR: 257195
Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31212
(cherry picked from commit a730d82378)
The packet_limit can fall to 0, leading to a divide by zero abort in
the "packets % packet_limit".
An possible solution would be to apply a lower limit of 1 after the
calculation of packet_limit, but since any number modulo 1 gives 0,
the more efficient solution is to skip the modulo operation for
packet_limit <= 1.
Reported by: Karl Denninger <karl@denninger.net>
(cherry picked from commit 58080fbca0)
When fixing another bug, I noticed that the alternate
TCP stacks do not build when various combinations of
ipv4 and ipv6 are disabled.
Reviewed by: rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D31094
Sponsored by: Netflix
(cherry picked from commit b1e806c0ed)
This fixes the incorrect use of a sysctl add to u64. It
was for a useconds time, but on 32 bit platforms its
not a u64. Instead use the long directive.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31107
(cherry picked from commit 7312e4e5cf)
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.
Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083
(cherry picked from commit d7955cc0ff)