mirror of
https://git.openldap.org/openldap/openldap.git
synced 2025-12-22 07:39:35 -05:00
283 lines
14 KiB
Markdown
283 lines
14 KiB
Markdown
|
|
TODO:
|
||
|
|
- [ ] keep a global op in-flight counter? (might need locking)
|
||
|
|
- [-] scheduling (who does what, more than one select thread? How does the proxy
|
||
|
|
work get distributed between threads?)
|
||
|
|
- [ ] managing timeouts?
|
||
|
|
- [X] outline locking policy: seems like there might be a lock inversion in the
|
||
|
|
design looming: when working with op, might need a lock on both client and
|
||
|
|
upstream but depending on where we started, we might want to start with
|
||
|
|
locking one, then other
|
||
|
|
- [ ] how to deal with the balancer running out of fds? Especially when we hit
|
||
|
|
the limit, then lose an upstream connection and accept() a client, we
|
||
|
|
wouldn't be able to initiate a new one. A bit of a DoS... But probably not
|
||
|
|
a concern for Ericsson
|
||
|
|
- [ ] non-Linux? No idea how anything other than poll works (moot if building a
|
||
|
|
libevent/libuv-based load balancer since they take care of that, except
|
||
|
|
edge-triggered I/O?)
|
||
|
|
- [-] rootDSE? Controls and exops might have different semantics and need
|
||
|
|
binding to the same upstream connection.
|
||
|
|
- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates
|
||
|
|
in the core and the module/subsystem would be a very invasive one. On the
|
||
|
|
other hand, allows to expose live configuration and monitoring over LDAP
|
||
|
|
over the current slapd listeners without re-inventing the wheel.
|
||
|
|
|
||
|
|
|
||
|
|
Expecting to handle only LDAPv3
|
||
|
|
|
||
|
|
terms:
|
||
|
|
server - configured target
|
||
|
|
upstream - a single connection to a server
|
||
|
|
client - an incoming connection
|
||
|
|
|
||
|
|
To maintain fairness `G( requested => ( F( progressed | failed ) ) )`, use
|
||
|
|
queues and put timeouts in
|
||
|
|
|
||
|
|
Runtime organisation
|
||
|
|
------
|
||
|
|
- main thread with its own event base handling signals
|
||
|
|
- one thread (later possibly more) listening on the rendezvous sockets, handing
|
||
|
|
the new sockets to worker threads
|
||
|
|
- n worker threads dealing with client and server I/O (dispatching actual work
|
||
|
|
to the thread pool most likely)
|
||
|
|
- a thread pool to handle actual work
|
||
|
|
|
||
|
|
Operational behaviour
|
||
|
|
------
|
||
|
|
|
||
|
|
- client read -> upstream write:
|
||
|
|
- client read:
|
||
|
|
- if TLS_SETUP, keep processing, set state back when finished and note that
|
||
|
|
we're under TLS
|
||
|
|
- ber_get_next(), if we don't have a tag, finished (unless we have true
|
||
|
|
edge-triggered I/O, also put the fd back into the ones we're waiting for)
|
||
|
|
- peek at op tag:
|
||
|
|
- unbind:
|
||
|
|
- with a single lock, mark all pending ops in upstreams abandoned, clear
|
||
|
|
client link (would it be fast enough if we remove them from upstream
|
||
|
|
map instead?)
|
||
|
|
- locked per op:
|
||
|
|
- remove op from upstream map
|
||
|
|
- check upstream is not write-suspended, if it is ...
|
||
|
|
- try to write the abandon op to upstream, suspend upstream if not
|
||
|
|
fully sent
|
||
|
|
- remove op from client map (how if we're in avl_apply?, another pass?)
|
||
|
|
- would be nice if we could wipe the complete client map then, otherwise
|
||
|
|
we need to queue it to have it freed when all abandons get passed onto
|
||
|
|
the upstream (just dropping them might put extra strain on upstreams,
|
||
|
|
will probably have a queue on each client/upstream anyway, not just a
|
||
|
|
single Ber)
|
||
|
|
- bind:
|
||
|
|
- check mechanism is not EXTERNAL (or implement it)
|
||
|
|
- abandon existing ops (see unbind)
|
||
|
|
- set state to BINDING, put DN into authzid
|
||
|
|
- pick upstream, create PDU and sent
|
||
|
|
- abandon:
|
||
|
|
- find op, mark for abandon, send to appropriate upstream
|
||
|
|
- Exop:
|
||
|
|
- check not BINDING (unless it's a cancel?)
|
||
|
|
- check OID:
|
||
|
|
- STARTTLS:
|
||
|
|
- check we don't have TLS yet
|
||
|
|
- abandon all
|
||
|
|
- set state to TLS_SETUP
|
||
|
|
- send the hello
|
||
|
|
- VC(?):
|
||
|
|
- similar to bind except for the abandons/state change
|
||
|
|
- other:
|
||
|
|
- check not BINDING
|
||
|
|
- pick an upstream
|
||
|
|
- create a PDU, send (marking upstream suspended if not written in full)
|
||
|
|
- check if should read again (keep a counter of number of times to read
|
||
|
|
off a connection in a single pass so that we maintain fairness)
|
||
|
|
- if read enough requests and can still read, re-queue ourselves (if we
|
||
|
|
don't have true edge-triggered I/O, we can just register the fd again)
|
||
|
|
- upstream write (only when suspended):
|
||
|
|
- flush the current BER
|
||
|
|
- there shouldn't be anything else?
|
||
|
|
- upstream read -> client write:
|
||
|
|
- upstream read:
|
||
|
|
- ber_get_next(), if we don't have a tag, finished (unless we have true
|
||
|
|
edge-triggered I/O, also put the fd back into the ones we're waiting for)
|
||
|
|
- when we get it, peek at msgid, resolve client connection, lock, check:
|
||
|
|
- if unsolicited, handle as close (and mark connection closing)
|
||
|
|
- if op is abandoned or does not exist, drop PDU and op, update counters
|
||
|
|
- if client backlogged, suspend upstream, register callback to unsuspend
|
||
|
|
(on progress when writing to client or abandon from client (connection
|
||
|
|
death, abandon proper, ...))
|
||
|
|
- reconstruct final PDU, write BER to client, if did not write fully,
|
||
|
|
suspend client
|
||
|
|
- if a final response, decrement operation counts on upstream and client
|
||
|
|
- check if should read again (keep a counter of number of responses to read
|
||
|
|
off a connection in a single pass so that we don't starve any?)
|
||
|
|
- client write ready (only checked for when suspended):
|
||
|
|
- write the rest of pending BER if any
|
||
|
|
- on successful write, pick all pending ops that need failure response, push
|
||
|
|
to client (are there any controls that need to be present in response even
|
||
|
|
in the case of failure?, what to do with them?)
|
||
|
|
- on successfully flushing them, walk through suspended upstreams, picking
|
||
|
|
the pending PDU (unsuspending the upstream) and writing, if PDU flushed
|
||
|
|
successfully, pick next upstream
|
||
|
|
- if we successfully flushed all suspended upstreams, unsuspend client
|
||
|
|
(and disable the write callback)
|
||
|
|
- upstream close/error:
|
||
|
|
- look up pending ops, try to write to clients, mark clients suspended that
|
||
|
|
have ops that need responses (another queue associated with client to speed
|
||
|
|
up?)
|
||
|
|
- schedule a new connection open
|
||
|
|
- client close/error:
|
||
|
|
- same as unbind
|
||
|
|
- client inactive (no pending ops and nothing happened in x seconds)
|
||
|
|
- might just send notice of disconnection and close
|
||
|
|
- op timeout handling:
|
||
|
|
- mark for abandon
|
||
|
|
- send abandon
|
||
|
|
- send timeLimitExceeded/adminLimitExceeded to client
|
||
|
|
|
||
|
|
Picking an upstream:
|
||
|
|
- while there is a level available:
|
||
|
|
- pick a random ordering of upstreams based on weights
|
||
|
|
- while there is an upstream in the level:
|
||
|
|
- check number of ops in-flight (this is where we lock the upstream map)
|
||
|
|
- find the least busy connection (and check if a new connection should be
|
||
|
|
opened)
|
||
|
|
- try to lock for socket write, if available (no BER queued) we have our
|
||
|
|
upstream
|
||
|
|
|
||
|
|
PDU processing:
|
||
|
|
- request (have an upstream selected):
|
||
|
|
- get new msgid from upstream
|
||
|
|
- create an Op structure (actually, with the need for freelist lock, we can
|
||
|
|
make it a cache for freed operation structures, avoiding some malloc
|
||
|
|
traffic, to reset, we need slap_sl_mem_create( ,,, 1 ))
|
||
|
|
- check proxyauthz is not present? or just let upstream reject it if there are
|
||
|
|
two?
|
||
|
|
- add own controls at the end:
|
||
|
|
- construct proxyauthz from authzid
|
||
|
|
- construct session tracking from remote IP, own name, authzid
|
||
|
|
- send over
|
||
|
|
- insert Op into client and upstream maps
|
||
|
|
- response/intermediate/entry:
|
||
|
|
- look up Op in upstream's map
|
||
|
|
- write old msgid, rest of the response can go unchanged
|
||
|
|
- if a response, remove Op from all maps (client and upstream)
|
||
|
|
|
||
|
|
Managing upstreams:
|
||
|
|
- async connect up to min_connections (is there a point in having a connection
|
||
|
|
count range if we can't use it when needed since all of the below is async?)
|
||
|
|
- when connected, set up TLS (if requested)
|
||
|
|
- when done, send a bind
|
||
|
|
- go for the bind interaction
|
||
|
|
- when done, add it to the upstream's connection list
|
||
|
|
- (if a connection is suspended or connections are over 75 % op limit, schedule
|
||
|
|
creating a new connection setup unless connection limit has been hit)
|
||
|
|
|
||
|
|
Managing timeouts:
|
||
|
|
- two options:
|
||
|
|
- maintain a separate locked priority queue to give a perfect ordering to when
|
||
|
|
each operation is to time out, would need to maintain yet another place
|
||
|
|
where operations can be found.
|
||
|
|
- the locking protocol for disposing of the operation would need to be
|
||
|
|
adjusted and might become even more complicated, might do the alternative
|
||
|
|
initially and then attempt this if it helps performance
|
||
|
|
- just do a sweep over all clients (that mutex is less contended) every so
|
||
|
|
often. With many in-flight operations might be a lot of wasted work.
|
||
|
|
- we still need to sweep over all clients to check if they should be killed
|
||
|
|
anyway
|
||
|
|
|
||
|
|
Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)):
|
||
|
|
- poll on all registered fds
|
||
|
|
- remove each fd that's ready from the registered list and schedule the work
|
||
|
|
- work threads can put their fd back in if they deem necessary (=not suspended)
|
||
|
|
- this works as a poor man's edge-triggered polling, with enough workers, should
|
||
|
|
we do proper edge triggered I/O? What about non-Linux?
|
||
|
|
|
||
|
|
Listener thread:
|
||
|
|
- slapd has just one, which then reassigns the sockets to separate I/O
|
||
|
|
threads
|
||
|
|
|
||
|
|
Threading:
|
||
|
|
- if using slap_sl_malloc, how much perf do we gain? To allocate a context per
|
||
|
|
op, we should have a dedicated parent context so that when we free it, we can
|
||
|
|
use that exclusively. The parent context's parent would be the main thread's
|
||
|
|
context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 )
|
||
|
|
and making sure an op does not allocate/free things from two threads at the
|
||
|
|
same time (might need an Op mutex after all? Not such a huge cost if we
|
||
|
|
routinely reuse Op structures)
|
||
|
|
|
||
|
|
Locking policy:
|
||
|
|
- read mutexes are unnecessary, we only have one thread receiving data from the
|
||
|
|
connection - the one started from the dispatcher
|
||
|
|
- two reference counters of operation structures (an op is accessible from
|
||
|
|
client and upstream map, each counter is consistent when thread has a lock on
|
||
|
|
corresponding map), when decreasing the counter to zero, start freeing
|
||
|
|
procedure
|
||
|
|
- place to mark disposal finished for each side, consistency enforced by holding
|
||
|
|
the freelist lock when reading/manipulating
|
||
|
|
- when op is created, we already have a write lock on upstream socket and map,
|
||
|
|
start writing, insert to upstream map with upstream refcount 1, unlock, lock
|
||
|
|
client, insert (client refcount 0), unlock, lock upstream, decrement refcount
|
||
|
|
(triggers a test if we need to drop it now), unlock upstream, done
|
||
|
|
- when upstream processes a PDU, locks its map, increments counter, (potentially
|
||
|
|
removes if it's a response), unlocks, locks client's map, write mutex (this
|
||
|
|
order?) and full client mutex (if a bind response)
|
||
|
|
- when client side wants to work with a PDU (abandon, (un)bind), locks its map,
|
||
|
|
increase refcount, unlocks, locks upstream map, write mutex, sends or queues
|
||
|
|
abandon, unlocks write mutex, initiates freeing procedure from upstream side
|
||
|
|
(or if having to remember we've already increased client-side refcount, mark
|
||
|
|
for deletion, lose upstream lock, lock client, decref, either triggering
|
||
|
|
deletion from client or mark for it)
|
||
|
|
- if we have operation lock, we can simplify a bit (no need for three-stage
|
||
|
|
locking above)
|
||
|
|
|
||
|
|
Shutdown:
|
||
|
|
- stop accept() thread(s) - potentially add a channel to hand these listening
|
||
|
|
sockets over for zero-downtime restart
|
||
|
|
- if very gentle, mark connections as closing, start timeout and:
|
||
|
|
- when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE
|
||
|
|
- when receiving a PDU from upstream, send over to client, if no ops pending,
|
||
|
|
send unsolicited response and close (RFC4511 suggests unsolicited response
|
||
|
|
is the last PDU coming from the upstream and libldap agrees, so we can't
|
||
|
|
send it for a socket we want to shut down more gracefully)
|
||
|
|
- gentle (or very gentle timed out):
|
||
|
|
- set timeout
|
||
|
|
- mark all ops as abandoned
|
||
|
|
- send unbind to all upstreams
|
||
|
|
- send unsolicited to all clients
|
||
|
|
- imminent (or gentle timed out):
|
||
|
|
- async close all connections?
|
||
|
|
- exit()
|
||
|
|
|
||
|
|
RootDSE:
|
||
|
|
- default option is not to care and if a control/exop has special restrictions,
|
||
|
|
it is the admin's job to flag it as such in the load-balancer's config
|
||
|
|
- another is not to care about the search request but check each search entry
|
||
|
|
being passed back, check DN and if it's a rootDSE, filter the list of
|
||
|
|
controls/exops/sasl mechs (external!) that are supported
|
||
|
|
- last one is to check all search requests for the DN/scope and synthesise the
|
||
|
|
response locally - probably not (would need to configure the complete list of
|
||
|
|
controls, exops, sasl mechs, naming contexts in the balancer)
|
||
|
|
|
||
|
|
Potential red flags:
|
||
|
|
- we suspend upstreams, if we ever suspend clients we need to be sure we can't
|
||
|
|
create dependency cycles
|
||
|
|
- is this an issue when only suspending the read side of each? Because even if
|
||
|
|
we stop reading from everything, we should eventually flush data to those we
|
||
|
|
can still talk to, as upstreams are flushed, we can start sending new
|
||
|
|
requests from live clients (those that are suspended are due to their own
|
||
|
|
inability to accept data)
|
||
|
|
- we might need to suspend a client if there is a reason to choose a
|
||
|
|
particular upstream (multi-request operation - bind, VC, PR, TXN, ...)
|
||
|
|
- a SASL bind, but that means there are no outstanding ops to receive
|
||
|
|
it holds that !suspended(client) \or !suspended(upstream), so they
|
||
|
|
cannot participate in a cycle
|
||
|
|
- VC - multiple binds at the same time - !!! more analysis needed
|
||
|
|
- PR - should only be able to have one per connection (that's a problem
|
||
|
|
for later, maybe even needs a dedicated upstream connection)
|
||
|
|
- TXN - ??? probably same situation as PR
|
||
|
|
- or if we have a queue for pending Bers on the server, we not need to suspend
|
||
|
|
clients, upstream is only chosen if the queue is free or there is a reason
|
||
|
|
to send it to that particular upstream (multi-stage bind/VC, PR, ...), but
|
||
|
|
that still makes it possible for a client to exhaust all our memory by
|
||
|
|
sending requests (VC or other ones bound to a slow upstream or by not
|
||
|
|
reading the responses at all)
|