icinga2/lib
Julian Brost 3302c9b0a8 Fix that NewClientHandler() could hang indefinitely, preventing new connection attempts
There is some race condition when the `async_write()`/`async_flush()` operation
for the `icinga::Hello` message fails (connection reset by peer for example)
around the same time the connect timeout fires and calls `cancel()` on the
stream, the following call to `async_shutdown()` may block indefinitely. If
that happens, the endpoint remains in the connecting state and no new
connection attemps are initiated.

This commit fixes the issue by removing the `Defer` containing the
`async_shutdown()`. The purpose of `async_shutdown()` is to signal a clean
termination of the connection to the peer, which really isn't something that
makes sense to to in a `Defer` block that is also executed in case of errors.
For the one situation where doing a clean TLS shutdown makes some sense
(closing anonymous client connections), a call to GracefulShutdown() is added
to that specific code path.

A large part of the change is just changing the indentation of the code, given
that a now unnecessary `try`/`catch` block is removed.

The following Go code creates a TLS server that can be used to demonstrate the
issue. Note that given that a race condition is involved, this is not reliable
and the sleep duration may need some fine-tuning. For this to work,
`ApiListener.tls_handshake_timeout` needs to be set to a large-enough value
like 60s to disable the timeout for `async_handshake()` itself so that the
overall connect timeout is the one that fires. However, changing the timeout is
not a prerequisite for the problem, it just makes it easier to reproduce. The
error can also happen with the default timeouts if the TCP connect takes long
enough so that the handshake is started late enough that its timeout expires
after the connect timeout.

    package main

    import (
        "crypto/tls"
        "log"
        "net"
        "time"
    )

    func main() {
        cert, err := tls.LoadX509KeyPair("bad-agent.crt", "bad-agent.key")
        if err != nil {
            panic(err)
        }

        listener, err := tls.Listen("tcp", ":1337", &tls.Config{
            Certificates: []tls.Certificate{cert},
        })
        if err != nil {
            panic(err)
        }

        log.Println("Listening on", listener.Addr())

        for {
            conn, err := listener.Accept()
            if err != nil {
                panic(err)
            }

            go handle(conn.(*tls.Conn))
        }
    }

    func handle(conn *tls.Conn) {
        addr := conn.RemoteAddr().String()
        log.Println(addr, "new connection")

        time.Sleep(15*time.Second - 10*time.Millisecond)

        log.Println(addr, "SetLinger(0)", conn.NetConn().(*net.TCPConn).SetLinger(0))
        log.Println(addr, "Handshake()", conn.Handshake())
        log.Println(addr, "conn.NetConn().Close()", conn.NetConn().Close())
    }

With additional logging in the `catch` block for `boost::system::system_error`
and `Defer shutdownSslConn` (both removed by this commit), this showed the
following. Note that in particular, `async_shutdown()` never returned,
indicating that it hangs in there.

    [2026-04-24 17:32:56 +0200] information/ApiListener: Reconnecting to endpoint 'bad-agent' via host 'host.docker.internal' and port '1337'
    [2026-04-24 17:33:11 +0200] critical/ApiListener: Timeout while reconnecting to endpoint 'bad-agent' via host 'host.docker.internal' and port '1337', cancelling attempt
    [2026-04-24 17:33:11 +0200] information/ApiListener: New client connection for identity 'bad-agent' to [172.17.0.1]:1337
    [2026-04-24 17:33:12 +0200] information/ApiListener: rethrowing for bad-agent: Error: Connection reset by peer [system:104 at /usr/include/boost/asio/detail/reactive_socket_send_op.hpp:137 in function 'do_complete']
    [2026-04-24 17:33:12 +0200] information/ApiListener: doing async_shutdown for bad-agent
2026-05-06 09:54:21 +02:00
..
base Merge pull request #9719 from Icinga/execvp 2026-04-23 14:04:31 +02:00
checker Revert "CheckerComponent#CheckThreadProc(): also propagate next check update to Icinga DB" 2026-04-02 16:37:57 +02:00
cli Merge pull request #10028 from RincewindsHat/node_setup_no_globals 2026-02-12 14:44:50 +01:00
compat Add warnings to deprecated features indicating removal in v2.18 2026-03-27 14:20:55 +01:00
config Merge pull request #10734 from Icinga/deprecate-everything-we-dont-like 2026-03-31 10:25:44 +02:00
db_ido Replace all existing copyright headers with SPDX headers 2026-02-04 14:00:05 +01:00
db_ido_mysql Add warnings to deprecated features indicating removal in v2.18 2026-03-27 14:20:55 +01:00
db_ido_pgsql Add warnings to deprecated features indicating removal in v2.18 2026-03-27 14:20:55 +01:00
icinga Silence -Wunnecessary-virtual-specifier warning on clang 2026-04-20 12:46:50 +02:00
icingadb Merge pull request #10619 from Icinga/efficient-config-and-state-update-queue 2026-04-17 11:19:50 +02:00
livestatus Add warnings to deprecated features indicating removal in v2.18 2026-03-27 14:20:55 +01:00
methods Replace all existing copyright headers with SPDX headers 2026-02-04 14:00:05 +01:00
mysql_shim Replace all existing copyright headers with SPDX headers 2026-02-04 14:00:05 +01:00
notification Replace all existing copyright headers with SPDX headers 2026-02-04 14:00:05 +01:00
otel OTel: downgrade broken_pipe errors to debug log 2026-04-15 17:25:14 +02:00
perfdata Merge pull request #10799 from Icinga/fix-pdwc-tls-host-check 2026-04-22 11:22:27 +02:00
pgsql_shim Replace all existing copyright headers with SPDX headers 2026-02-04 14:00:05 +01:00
remote Fix that NewClientHandler() could hang indefinitely, preventing new connection attempts 2026-05-06 09:54:21 +02:00
CMakeLists.txt Add common OTel type/lib 2026-04-01 12:18:21 +02:00