openldap/doc/devel/lloadd/design.md

TODO:
- [ ] keep a global op in-flight counter? (might need locking)
- [-] scheduling (who does what, more than one select thread? How does the proxy
      work get distributed between threads?)
- [ ] managing timeouts?
- [X] outline locking policy: seems like there might be a lock inversion in the
      design looming: when working with op, might need a lock on both client and
      upstream but depending on where we started, we might want to start with
      locking one, then other
- [ ] how to deal with the balancer running out of fds? Especially when we hit
      the limit, then lose an upstream connection and accept() a client, we
      wouldn't be able to initiate a new one. A bit of a DoS... But probably not
      a concern for Ericsson
- [ ] non-Linux? No idea how anything other than poll works (moot if building a
      libevent/libuv-based load balancer since they take care of that, except
      edge-triggered I/O?)
- [-] rootDSE? Controls and exops might have different semantics and need
      binding to the same upstream connection.
- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates
      in the core and the module/subsystem would be a very invasive one. On the
      other hand, allows to expose live configuration and monitoring over LDAP
      over the current slapd listeners without re-inventing the wheel.


Expecting to handle only LDAPv3

terms:
  server - configured target
  upstream - a single connection to a server
  client - an incoming connection

To maintain fairness `G( requested => ( F( progressed | failed ) ) )`, use
queues and put timeouts in

Runtime organisation
------
- main thread with its own event base handling signals
- one thread (later possibly more) listening on the rendezvous sockets, handing
  the new sockets to worker threads
- n worker threads dealing with client and server I/O (dispatching actual work
  to the thread pool most likely)
- a thread pool to handle actual work

Operational behaviour
------

- client read -> upstream write:
  - client read:
    - if TLS_SETUP, keep processing, set state back when finished and note that
      we're under TLS
    - ber_get_next(), if we don't have a tag, finished (unless we have true
      edge-triggered I/O, also put the fd back into the ones we're waiting for)
    - peek at op tag:
      - unbind:
        - with a single lock, mark all pending ops in upstreams abandoned, clear
          client link (would it be fast enough if we remove them from upstream
          map instead?)
        - locked per op:
          - remove op from upstream map
          - check upstream is not write-suspended, if it is ...
          - try to write the abandon op to upstream, suspend upstream if not
            fully sent
          - remove op from client map (how if we're in avl_apply?, another pass?)
        - would be nice if we could wipe the complete client map then, otherwise
          we need to queue it to have it freed when all abandons get passed onto
          the upstream (just dropping them might put extra strain on upstreams,
          will probably have a queue on each client/upstream anyway, not just a
          single Ber)
      - bind:
        - check mechanism is not EXTERNAL (or implement it)
        - abandon existing ops (see unbind)
        - set state to BINDING, put DN into authzid
        - pick upstream, create PDU and sent
      - abandon:
        - find op, mark for abandon, send to appropriate upstream
      - Exop:
        - check not BINDING (unless it's a cancel?)
        - check OID:
          - STARTTLS:
            - check we don't have TLS yet
            - abandon all
            - set state to TLS_SETUP
            - send the hello
          - VC(?):
            - similar to bind except for the abandons/state change
      - other:
        - check not BINDING
        - pick an upstream
        - create a PDU, send (marking upstream suspended if not written in full)
      - check if should read again (keep a counter of number of times to read
        off a connection in a single pass so that we maintain fairness)
      - if read enough requests and can still read, re-queue ourselves (if we
        don't have true edge-triggered I/O, we can just register the fd again)
  - upstream write (only when suspended):
    - flush the current BER
    - there shouldn't be anything else?
- upstream read -> client write:
  - upstream read:
    - ber_get_next(), if we don't have a tag, finished (unless we have true
      edge-triggered I/O, also put the fd back into the ones we're waiting for)
    - when we get it, peek at msgid, resolve client connection, lock, check:
      - if unsolicited, handle as close (and mark connection closing)
      - if op is abandoned or does not exist, drop PDU and op, update counters
      - if client backlogged, suspend upstream, register callback to unsuspend
        (on progress when writing to client or abandon from client (connection
        death, abandon proper, ...))
    - reconstruct final PDU, write BER to client, if did not write fully,
      suspend client
    - if a final response, decrement operation counts on upstream and client
    - check if should read again (keep a counter of number of responses to read
      off a connection in a single pass so that we don't starve any?)
  - client write ready (only checked for when suspended):
    - write the rest of pending BER if any
    - on successful write, pick all pending ops that need failure response, push
      to client (are there any controls that need to be present in response even
      in the case of failure?, what to do with them?)
    - on successfully flushing them, walk through suspended upstreams, picking
      the pending PDU (unsuspending the upstream) and writing, if PDU flushed
      successfully, pick next upstream
    - if we successfully flushed all suspended upstreams, unsuspend client
      (and disable the write callback)
- upstream close/error:
  - look up pending ops, try to write to clients, mark clients suspended that
    have ops that need responses (another queue associated with client to speed
    up?)
  - schedule a new connection open
- client close/error:
  - same as unbind
- client inactive (no pending ops and nothing happened in x seconds)
  - might just send notice of disconnection and close
- op timeout handling:
  - mark for abandon
  - send abandon
  - send timeLimitExceeded/adminLimitExceeded to client

Picking an upstream:
- while there is a level available:
  - pick a random ordering of upstreams based on weights
  - while there is an upstream in the level:
    - check number of ops in-flight (this is where we lock the upstream map)
    - find the least busy connection (and check if a new connection should be
      opened)
    - try to lock for socket write, if available (no BER queued) we have our
      upstream

PDU processing:
- request (have an upstream selected):
  - get new msgid from upstream
  - create an Op structure (actually, with the need for freelist lock, we can
    make it a cache for freed operation structures, avoiding some malloc
    traffic, to reset, we need slap_sl_mem_create( ,,, 1 ))
  - check proxyauthz is not present? or just let upstream reject it if there are
    two?
  - add own controls at the end:
    - construct proxyauthz from authzid
    - construct session tracking from remote IP, own name, authzid
  - send over
  - insert Op into client and upstream maps
- response/intermediate/entry:
  - look up Op in upstream's map
  - write old msgid, rest of the response can go unchanged
  - if a response, remove Op from all maps (client and upstream)

Managing upstreams:
- async connect up to min_connections (is there a point in having a connection
  count range if we can't use it when needed since all of the below is async?)
- when connected, set up TLS (if requested)
- when done, send a bind
- go for the bind interaction
- when done, add it to the upstream's connection list
- (if a connection is suspended or connections are over 75 % op limit, schedule
  creating a new connection setup unless connection limit has been hit)

Managing timeouts:
- two options:
  - maintain a separate locked priority queue to give a perfect ordering to when
    each operation is to time out, would need to maintain yet another place
    where operations can be found.
    - the locking protocol for disposing of the operation would need to be
      adjusted and might become even more complicated, might do the alternative
      initially and then attempt this if it helps performance
  - just do a sweep over all clients (that mutex is less contended) every so
    often. With many in-flight operations might be a lot of wasted work.
    - we still need to sweep over all clients to check if they should be killed
      anyway

Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)):
- poll on all registered fds
- remove each fd that's ready from the registered list and schedule the work
- work threads can put their fd back in if they deem necessary (=not suspended)
- this works as a poor man's edge-triggered polling, with enough workers, should
  we do proper edge triggered I/O? What about non-Linux?

Listener thread:
- slapd has just one, which then reassigns the sockets to separate I/O
  threads

Threading:
- if using slap_sl_malloc, how much perf do we gain? To allocate a context per
  op, we should have a dedicated parent context so that when we free it, we can
  use that exclusively. The parent context's parent would be the main thread's
  context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 )
  and making sure an op does not allocate/free things from two threads at the
  same time (might need an Op mutex after all? Not such a huge cost if we
  routinely reuse Op structures)

Locking policy:
- read mutexes are unnecessary, we only have one thread receiving data from the
  connection - the one started from the dispatcher
- two reference counters of operation structures (an op is accessible from
  client and upstream map, each counter is consistent when thread has a lock on
  corresponding map), when decreasing the counter to zero, start freeing
  procedure
- place to mark disposal finished for each side, consistency enforced by holding
  the freelist lock when reading/manipulating
- when op is created, we already have a write lock on upstream socket and map,
  start writing, insert to upstream map with upstream refcount 1, unlock, lock
  client, insert (client refcount 0), unlock, lock upstream, decrement refcount
  (triggers a test if we need to drop it now), unlock upstream, done
- when upstream processes a PDU, locks its map, increments counter, (potentially
  removes if it's a response), unlocks, locks client's map, write mutex (this
  order?) and full client mutex (if a bind response)
- when client side wants to work with a PDU (abandon, (un)bind), locks its map,
  increase refcount, unlocks, locks upstream map, write mutex, sends or queues
  abandon, unlocks write mutex, initiates freeing procedure from upstream side
  (or if having to remember we've already increased client-side refcount, mark
  for deletion, lose upstream lock, lock client, decref, either triggering
  deletion from client or mark for it)
- if we have operation lock, we can simplify a bit (no need for three-stage
  locking above)

Shutdown:
- stop accept() thread(s) - potentially add a channel to hand these listening
  sockets over for zero-downtime restart
- if very gentle, mark connections as closing, start timeout and:
  - when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE
  - when receiving a PDU from upstream, send over to client, if no ops pending,
    send unsolicited response and close (RFC4511 suggests unsolicited response
    is the last PDU coming from the upstream and libldap agrees, so we can't
    send it for a socket we want to shut down more gracefully)
- gentle (or very gentle timed out):
  - set timeout
  - mark all ops as abandoned
  - send unbind to all upstreams
  - send unsolicited to all clients
- imminent (or gentle timed out):
  - async close all connections?
  - exit()

RootDSE:
- default option is not to care and if a control/exop has special restrictions,
  it is the admin's job to flag it as such in the load-balancer's config
- another is not to care about the search request but check each search entry
  being passed back, check DN and if it's a rootDSE, filter the list of
  controls/exops/sasl mechs (external!) that are supported
- last one is to check all search requests for the DN/scope and synthesise the
  response locally - probably not (would need to configure the complete list of
  controls, exops, sasl mechs, naming contexts in the balancer)

Potential red flags:
- we suspend upstreams, if we ever suspend clients we need to be sure we can't
  create dependency cycles
  - is this an issue when only suspending the read side of each? Because even if
    we stop reading from everything, we should eventually flush data to those we
    can still talk to, as upstreams are flushed, we can start sending new
    requests from live clients (those that are suspended are due to their own
    inability to accept data)
  - we might need to suspend a client if there is a reason to choose a
    particular upstream (multi-request operation - bind, VC, PR, TXN, ...)
    - a SASL bind, but that means there are no outstanding ops to receive
      it holds that !suspended(client) \or !suspended(upstream), so they
      cannot participate in a cycle
    - VC - multiple binds at the same time - !!! more analysis needed
    - PR - should only be able to have one per connection (that's a problem
      for later, maybe even needs a dedicated upstream connection)
    - TXN - ??? probably same situation as PR
  - or if we have a queue for pending Bers on the server, we not need to suspend
    clients, upstream is only chosen if the queue is free or there is a reason
    to send it to that particular upstream (multi-stage bind/VC, PR, ...), but
    that still makes it possible for a client to exhaust all our memory by
    sending requests (VC or other ones bound to a slow upstream or by not
    reading the responses at all)
lloadd ahoy 2017-03-08 17:59:57 -05:00			`TODO:`
			`- [ ] keep a global op in-flight counter? (might need locking)`
			`- [-] scheduling (who does what, more than one select thread? How does the proxy`
			`work get distributed between threads?)`
			`- [ ] managing timeouts?`
			`- [X] outline locking policy: seems like there might be a lock inversion in the`
			`design looming: when working with op, might need a lock on both client and`
			`upstream but depending on where we started, we might want to start with`
			`locking one, then other`
			`- [ ] how to deal with the balancer running out of fds? Especially when we hit`
			`the limit, then lose an upstream connection and accept() a client, we`
			`wouldn't be able to initiate a new one. A bit of a DoS... But probably not`
			`a concern for Ericsson`
			`- [ ] non-Linux? No idea how anything other than poll works (moot if building a`
			`libevent/libuv-based load balancer since they take care of that, except`
			`edge-triggered I/O?)`
			`- [-] rootDSE? Controls and exops might have different semantics and need`
			`binding to the same upstream connection.`
			`- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates`
			`in the core and the module/subsystem would be a very invasive one. On the`
			`other hand, allows to expose live configuration and monitoring over LDAP`
			`over the current slapd listeners without re-inventing the wheel.`


			`Expecting to handle only LDAPv3`

			`terms:`
			`server - configured target`
			`upstream - a single connection to a server`
			`client - an incoming connection`

			To maintain fairness `G( requested => ( F( progressed \| failed ) ) )`, use
			`queues and put timeouts in`

			`Runtime organisation`
			`------`
			`- main thread with its own event base handling signals`
			`- one thread (later possibly more) listening on the rendezvous sockets, handing`
			`the new sockets to worker threads`
			`- n worker threads dealing with client and server I/O (dispatching actual work`
			`to the thread pool most likely)`
			`- a thread pool to handle actual work`

			`Operational behaviour`
			`------`

			`- client read -> upstream write:`
			`- client read:`
			`- if TLS_SETUP, keep processing, set state back when finished and note that`
			`we're under TLS`
			`- ber_get_next(), if we don't have a tag, finished (unless we have true`
			`edge-triggered I/O, also put the fd back into the ones we're waiting for)`
			`- peek at op tag:`
			`- unbind:`
			`- with a single lock, mark all pending ops in upstreams abandoned, clear`
			`client link (would it be fast enough if we remove them from upstream`
			`map instead?)`
			`- locked per op:`
			`- remove op from upstream map`
			`- check upstream is not write-suspended, if it is ...`
			`- try to write the abandon op to upstream, suspend upstream if not`
			`fully sent`
			`- remove op from client map (how if we're in avl_apply?, another pass?)`
			`- would be nice if we could wipe the complete client map then, otherwise`
			`we need to queue it to have it freed when all abandons get passed onto`
			`the upstream (just dropping them might put extra strain on upstreams,`
			`will probably have a queue on each client/upstream anyway, not just a`
			`single Ber)`
			`- bind:`
			`- check mechanism is not EXTERNAL (or implement it)`
			`- abandon existing ops (see unbind)`
			`- set state to BINDING, put DN into authzid`
			`- pick upstream, create PDU and sent`
			`- abandon:`
			`- find op, mark for abandon, send to appropriate upstream`
			`- Exop:`
			`- check not BINDING (unless it's a cancel?)`
			`- check OID:`
			`- STARTTLS:`
			`- check we don't have TLS yet`
			`- abandon all`
			`- set state to TLS_SETUP`
			`- send the hello`
			`- VC(?):`
			`- similar to bind except for the abandons/state change`
			`- other:`
			`- check not BINDING`
			`- pick an upstream`
			`- create a PDU, send (marking upstream suspended if not written in full)`
			`- check if should read again (keep a counter of number of times to read`
			`off a connection in a single pass so that we maintain fairness)`
			`- if read enough requests and can still read, re-queue ourselves (if we`
			`don't have true edge-triggered I/O, we can just register the fd again)`
			`- upstream write (only when suspended):`
			`- flush the current BER`
			`- there shouldn't be anything else?`
			`- upstream read -> client write:`
			`- upstream read:`
			`- ber_get_next(), if we don't have a tag, finished (unless we have true`
			`edge-triggered I/O, also put the fd back into the ones we're waiting for)`
			`- when we get it, peek at msgid, resolve client connection, lock, check:`
			`- if unsolicited, handle as close (and mark connection closing)`
			`- if op is abandoned or does not exist, drop PDU and op, update counters`
			`- if client backlogged, suspend upstream, register callback to unsuspend`
			`(on progress when writing to client or abandon from client (connection`
			`death, abandon proper, ...))`
			`- reconstruct final PDU, write BER to client, if did not write fully,`
			`suspend client`
			`- if a final response, decrement operation counts on upstream and client`
			`- check if should read again (keep a counter of number of responses to read`
			`off a connection in a single pass so that we don't starve any?)`
			`- client write ready (only checked for when suspended):`
			`- write the rest of pending BER if any`
			`- on successful write, pick all pending ops that need failure response, push`
			`to client (are there any controls that need to be present in response even`
			`in the case of failure?, what to do with them?)`
			`- on successfully flushing them, walk through suspended upstreams, picking`
			`the pending PDU (unsuspending the upstream) and writing, if PDU flushed`
			`successfully, pick next upstream`
			`- if we successfully flushed all suspended upstreams, unsuspend client`
			`(and disable the write callback)`
			`- upstream close/error:`
			`- look up pending ops, try to write to clients, mark clients suspended that`
			`have ops that need responses (another queue associated with client to speed`
			`up?)`
			`- schedule a new connection open`
			`- client close/error:`
			`- same as unbind`
			`- client inactive (no pending ops and nothing happened in x seconds)`
			`- might just send notice of disconnection and close`
			`- op timeout handling:`
			`- mark for abandon`
			`- send abandon`
			`- send timeLimitExceeded/adminLimitExceeded to client`

			`Picking an upstream:`
			`- while there is a level available:`
			`- pick a random ordering of upstreams based on weights`
			`- while there is an upstream in the level:`
			`- check number of ops in-flight (this is where we lock the upstream map)`
			`- find the least busy connection (and check if a new connection should be`
			`opened)`
			`- try to lock for socket write, if available (no BER queued) we have our`
			`upstream`

			`PDU processing:`
			`- request (have an upstream selected):`
			`- get new msgid from upstream`
			`- create an Op structure (actually, with the need for freelist lock, we can`
			`make it a cache for freed operation structures, avoiding some malloc`
			`traffic, to reset, we need slap_sl_mem_create( ,,, 1 ))`
			`- check proxyauthz is not present? or just let upstream reject it if there are`
			`two?`
			`- add own controls at the end:`
			`- construct proxyauthz from authzid`
			`- construct session tracking from remote IP, own name, authzid`
			`- send over`
			`- insert Op into client and upstream maps`
			`- response/intermediate/entry:`
			`- look up Op in upstream's map`
			`- write old msgid, rest of the response can go unchanged`
			`- if a response, remove Op from all maps (client and upstream)`

			`Managing upstreams:`
			`- async connect up to min_connections (is there a point in having a connection`
			`count range if we can't use it when needed since all of the below is async?)`
			`- when connected, set up TLS (if requested)`
			`- when done, send a bind`
			`- go for the bind interaction`
			`- when done, add it to the upstream's connection list`
			`- (if a connection is suspended or connections are over 75 % op limit, schedule`
			`creating a new connection setup unless connection limit has been hit)`

			`Managing timeouts:`
			`- two options:`
			`- maintain a separate locked priority queue to give a perfect ordering to when`
			`each operation is to time out, would need to maintain yet another place`
			`where operations can be found.`
			`- the locking protocol for disposing of the operation would need to be`
			`adjusted and might become even more complicated, might do the alternative`
			`initially and then attempt this if it helps performance`
			`- just do a sweep over all clients (that mutex is less contended) every so`
			`often. With many in-flight operations might be a lot of wasted work.`
			`- we still need to sweep over all clients to check if they should be killed`
			`anyway`

			`Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)):`
			`- poll on all registered fds`
			`- remove each fd that's ready from the registered list and schedule the work`
			`- work threads can put their fd back in if they deem necessary (=not suspended)`
			`- this works as a poor man's edge-triggered polling, with enough workers, should`
			`we do proper edge triggered I/O? What about non-Linux?`

			`Listener thread:`
			`- slapd has just one, which then reassigns the sockets to separate I/O`
			`threads`

			`Threading:`
			`- if using slap_sl_malloc, how much perf do we gain? To allocate a context per`
			`op, we should have a dedicated parent context so that when we free it, we can`
			`use that exclusively. The parent context's parent would be the main thread's`
			`context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 )`
			`and making sure an op does not allocate/free things from two threads at the`
			`same time (might need an Op mutex after all? Not such a huge cost if we`
			`routinely reuse Op structures)`

			`Locking policy:`
			`- read mutexes are unnecessary, we only have one thread receiving data from the`
			`connection - the one started from the dispatcher`
			`- two reference counters of operation structures (an op is accessible from`
			`client and upstream map, each counter is consistent when thread has a lock on`
			`corresponding map), when decreasing the counter to zero, start freeing`
			`procedure`
			`- place to mark disposal finished for each side, consistency enforced by holding`
			`the freelist lock when reading/manipulating`
			`- when op is created, we already have a write lock on upstream socket and map,`
			`start writing, insert to upstream map with upstream refcount 1, unlock, lock`
			`client, insert (client refcount 0), unlock, lock upstream, decrement refcount`
			`(triggers a test if we need to drop it now), unlock upstream, done`
			`- when upstream processes a PDU, locks its map, increments counter, (potentially`
			`removes if it's a response), unlocks, locks client's map, write mutex (this`
			`order?) and full client mutex (if a bind response)`
			`- when client side wants to work with a PDU (abandon, (un)bind), locks its map,`
			`increase refcount, unlocks, locks upstream map, write mutex, sends or queues`
			`abandon, unlocks write mutex, initiates freeing procedure from upstream side`
			`(or if having to remember we've already increased client-side refcount, mark`
			`for deletion, lose upstream lock, lock client, decref, either triggering`
			`deletion from client or mark for it)`
			`- if we have operation lock, we can simplify a bit (no need for three-stage`
			`locking above)`

			`Shutdown:`
			`- stop accept() thread(s) - potentially add a channel to hand these listening`
			`sockets over for zero-downtime restart`
			`- if very gentle, mark connections as closing, start timeout and:`
			`- when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE`
			`- when receiving a PDU from upstream, send over to client, if no ops pending,`
			`send unsolicited response and close (RFC4511 suggests unsolicited response`
			`is the last PDU coming from the upstream and libldap agrees, so we can't`
			`send it for a socket we want to shut down more gracefully)`
			`- gentle (or very gentle timed out):`
			`- set timeout`
			`- mark all ops as abandoned`
			`- send unbind to all upstreams`
			`- send unsolicited to all clients`
			`- imminent (or gentle timed out):`
			`- async close all connections?`
			`- exit()`

			`RootDSE:`
			`- default option is not to care and if a control/exop has special restrictions,`
			`it is the admin's job to flag it as such in the load-balancer's config`
			`- another is not to care about the search request but check each search entry`
			`being passed back, check DN and if it's a rootDSE, filter the list of`
			`controls/exops/sasl mechs (external!) that are supported`
			`- last one is to check all search requests for the DN/scope and synthesise the`
			`response locally - probably not (would need to configure the complete list of`
			`controls, exops, sasl mechs, naming contexts in the balancer)`

			`Potential red flags:`
			`- we suspend upstreams, if we ever suspend clients we need to be sure we can't`
			`create dependency cycles`
			`- is this an issue when only suspending the read side of each? Because even if`
			`we stop reading from everything, we should eventually flush data to those we`
			`can still talk to, as upstreams are flushed, we can start sending new`
			`requests from live clients (those that are suspended are due to their own`
			`inability to accept data)`
			`- we might need to suspend a client if there is a reason to choose a`
			`particular upstream (multi-request operation - bind, VC, PR, TXN, ...)`
			`- a SASL bind, but that means there are no outstanding ops to receive`
			`it holds that !suspended(client) \or !suspended(upstream), so they`
			`cannot participate in a cycle`
			`- VC - multiple binds at the same time - !!! more analysis needed`
			`- PR - should only be able to have one per connection (that's a problem`
			`for later, maybe even needs a dedicated upstream connection)`
			`- TXN - ??? probably same situation as PR`
			`- or if we have a queue for pending Bers on the server, we not need to suspend`
			`clients, upstream is only chosen if the queue is free or there is a reason`
			`to send it to that particular upstream (multi-stage bind/VC, PR, ...), but`
			`that still makes it possible for a client to exhaust all our memory by`
			`sending requests (VC or other ones bound to a slow upstream or by not`
			`reading the responses at all)`