unbound/doc/TODO

TODO items. These are interesting todo items.
o understand synthesized DNAMEs, so those TTL=0 packets are cached properly.
o NSEC/NSEC3 aggressive negative caching, so that updates to NSEC/NSEC3
  will result in proper negative responses.
o (option) where port 53 is used for send and receive, no other ports are used.
o (option) to not send replies to clients after a timeout of (say 5 secs) has
  passed, but keep task active for later retries by client.
o (option) private TTL feature (always report TTL x in answers).
o (option) pretend-dnssec-unaware, and pretend-edns-unaware modes for workshops.
o delegpt use rbtree for ns-list, to avoid slowdown for very large NS sets.
o (option) reprime and refresh oft used data before timeout.
o (option) retain prime results in a overlaid roothints file.
o (option) store primed key data in a overlaid keyhints file (sort of like drafttimers).
o windows version, auto update feature, a query to check for the version.
o command the server with TSIG inband. get-config, clearcache,
	get stats, get memstats, get ..., reload, clear one zone from cache
o NSID rfc 5001 support.
o timers rfc 5011 support.
o Treat YXDOMAIN from a DNAME properly, in iterator (not throwaway), validator.
o make timeout backoffs randomized (a couple percent random) to spread traffic.
o inspect date on executable, then warn user in log if its more than 1 year.
o (option) proactively prime root, stubs and trust anchors, feature.
  early failure, faster on first query, but more traffic.
o library add convenience functions for A, AAAA, PTR, getaddrinfo, libresolve.
o library add function to validate input from app that is signed.
o add dynamic-update requests (making a dynupd request) to libunbound api.
o SIG(0) and TSIG.
o support OPT record placement on recv anywhere in the additional section.
o add local-file: config with authority features.
o (option) to make local-data answers be secure for libunbound (default=no)
o (option) to make chroot: copy all needed files into jail (or make jail)
	perhaps also print reminder to link /dev/random and sysloghack.
o overhaul outside-network servicedquery to merge with udpwait and tcpwait,
  to make timers in servicedquery independent of udpwait queues.
o check into rebinding ports for efficiency, configure time test.
o EVP hardware crypto support.
o option to ignore all inception and expiration dates for rrsigs.
o cleaner code; return and func statements on newline.
o memcached module that sits before validator module; checks for memcached
  data (on local lan), stores recursion lookup.  Provides one cache for multiple resolver machines, coherent reply content in anycast setup.
o no openssl_add_all_algorithms, but only the ones necessary, less space.
o listen to NOTIFY messages for zones and flush the cache for that zone
  if received.  Useful when also having a stub to that auth server.
  Needs proper protection, TSIG, in place.
o winevent - do not go more than 64 fds (by polling with select one by
  one), win95/98 have 100fd limit in the kernel, so this ruins w9x portability.

*** Features features, for later
* dTLS, TLS, look to need special port numbers, cert storage, recent libssl.
* aggressive negative caching for NSEC, NSEC3.
* multiple queries per question, server exploration, server selection.
* support TSIG on queries, for validating resolver deployment.
* retry-mode, where a bogus result triggers a retry-mode query, where a list
  of responses over a time interval is collected, and each is validated.
  or try in TCP mode. Do not 'try all servers several times', since we must
  not create packet storms with operator errors.
o on windows version, implement that OS ancillary data capabilities for
  interface-automatic. IPPKTINFO, IP6PKTINFO for WSARecvMsg, WSASendMsg.
o local-zone directive with authority service, full authority server
  is a non-goal.
o infra and lame cache: easier size config (in Mb), show usage in graphs.

1.3.x:
- spoofed delegpt fixes - if DNSKEY prime fails
	- set DNSKEY bogus and DNSKEY query msg bogus.
	- make NS set bogus too - if not validated as secure.
	- check where queries go - otherwise reduce TTL on NS.
	- also make DS NSEC bogus. Also DS msg cache entry.
- mark bogus under stringent conditions
	- if DS at parent and validly signed. Then DNSKEY must exist.
	- Also for trust anchor points themselves. DNSKEY must exist.
	- so if then DNSKEY keyprime fails
	  - then it is not simply a server that only answers qtype A.
	  - then parent is agreeing (somewhat) with the DS record
	  - but it could still be a lame domain, these exist
	    The objective is to keep tries for genuinely lame domains to a
	    minimum, while detecting forgeries quickly. exponential backoff.
	- for unbound we can check if we got something to verify while
	  building that chain of trust.  If so - not lame, agressive retry.
	  - but security-lame zones also exist and should not pose
	    too high a burden. Exponential backoff again.
	    (fe. badly signed or dnskey reply too large fails).
	- the delegation NS for the domain is bogus.
	  The referral retried, with exponential backoff.
	  This exponential backoff should go towards values which are close
	  to the TTLs that are used now (on lame delegations for example).
	  so that the extra traffic is manageable.
	- for unbound, reset the TTL on the NS rrset. Let it timeout.
	  Set NS rrset bogus - no more queries to the domain are done.
	  Also set DNSKEY and DS (rrset, NSEC, msg) bogus and ttl like that.
	  (to the same absolute value, so a clean retry is done).
	  TTL of NS is (rounddown) timeout in seconds.
	  Until the NS times out and referral is done again.
	  Make sure multiple validations for chains of trust do not result
	  in a flood of queries or backoff too quickly.
- bogus exponential backoff cache. hash(name,t,c), size(1M, 5%).
	TTL of 24h.  Backoff from 200msec to 24h.
	x2 on bogus(18 tries), x8 backoff on lameness(6 tries),
	  when servfail for DNSKEY.
	remove entry when validated as secure.
	delegptspoofrecheck on lameness when harden-referral-path NS
	query has servfail, then build chain of trust down (check DS,
	then perform DNSKEY query) if that DNSKEY query fails servfail,
	perform the x8 lameness retry fallback.

Retry harder to get valid DNSSEC data.
Triggered by a trust anchor or by a signed DS record for a zone.
* If data is fetched and validation fails for it
  or DNSKEY is fetched and validated into chain-of-trust fails for it
  or DS is fetched and validated into chain-of-trust fails for it
  Then
  	blame(signer zone, IP origin of the data/DNSKEY/DS, x2)
* If data was not fetched (SERVFAIL, lame, ...), and the data
  is under a signed DS then:
  	blame(thatDSname, IP origin of the data/DNSKEY/DS, x8)
  x8 because the zone may be lame.
  This means a chain of trust is built also for unfetched data, to
  determine if a signed DS is present.  If insecure, nothing is done.
* If DNSKEY was not fetched for chain of trust (SERVFAIL, lame, ...),
  Then
  	blame(DNSKEYname, IP origin of the data/DNSKEY/DS, x8)
  x8 because the zone may be lame.
* blame(zonename, guiltyIP, multiplier):
  * Set the guiltyIP,zonename as DNSSEC-bogus-data=true in lameness cache.
    Thusly marked servers are avoided if possible, used as last resort.
    The guilt TTL is 15 minutes or the backoff TTL if that is larger.
  * If the key cache entry 'being-backed-off' is true then:
  	set this data element RRset&msg to the current backoff TTL.
	and done.
  * if no retry entry exists for the zone key, create one with 24h TTL, 10 ms.
    else the backoff *= multiplier.
  * If the backoff is less than a second, remove entries from cache and
    restart query.  Else set the TTL for the entries to that value.
  * Entries to set or remove: DNSKEY RRset&msg, DS RRset&msg, NS RRset&msg,
  	in-zone glue (A and AAAA) RRset&msg, and key-cache-entry TTL.
	The the data element RRset&msg to the backoff TTL.
	If TTL>1sec set key-cache-entry flag 'being-backed-off' to true.
	when entry times out that flag is reset to zero again.
* Storage extra is:
  IP address per RRset and message.  A lot of memory really, since that is
  132 bytes per RRset and per message.  Store plain IP: 4/16 bytes, len byte.
  Check if port number is necessary.
  guilt flag and guilt TTL in lameness cache. Must be very big for forwarders.
  being-backed-off flag for key cache, also backoff time value and its TTL.
* Load on authorities:
  For lame servers: 7 tries per day (one per three hours on average).
  Others get up to 23 tries per day (one per hour on average).
  Unless the cache entry falls out of the cache due to memory. In that
  case it can be tried more often, this is similar to the NS entry falling
  out of the cache due to memory, in that case it also has to be retried.
* Performance analysis:
  * domain is sold.  Unbound sees invalid signature (expired) or the old
    servers refuse the queries.  Retry within the second, if parent has
    new DS and NS available instantly works again (no downtime).
  * domain is bogus signed.  Parent gets 1 query per hour.
  * domain partly bogus.  Parent gets 1 query per hour.
  * spoof attempt.  Unbound tries a couple times.  If not spoofed again,
    it works, if spoofed every time unbound backs off and stops trying.
  * parent has inconsistently signed DS records.  Together with a subzone that
    is badly managed.  Unbound backs up to the root once per hour.
  * parent has bad DS records, different sets on different servers, but they
    are signed ok.  If child is okay with one set, unbound may get lucky
    at one attempt and it'll work, otherwise, the parent is tried once in a
    while but the zone goes dark.  Because the server that gave that bad DS
    with good signature is not marked as problematic.
    Perhaps mark the IPorigin of the DS as problematic on a failed applicated
    DS as well.
  * domain is sold, but decomission is faster than the setup of new server.
    Unbound does exponential backoff, if new setup is fast, it'll pickup the
    new data fast.
  * key rollover failed.  The zone has bad keys.  Like it was bogus signed.
  * one nameserver has bad data.  Unbound goes back to the parent but also
    marks that server as guilty.  Picks data from other server right after,
    retry without blackout for the user.  If the nameserver stays bad, then
    once every retry unbound unmarks it as guilty, can then encounter
    it again if queried, then retries with backoff.
    If more than 7 servers are bogus, the zone becomes bogus for a while.
  * domain was sold, but unbound has old entries in the cache.  These somehow
    need (re)validation (were queried with +cd, now -cd).  The entries are
    bogus.  Then this algo starts to retry but if there are many entries,
    then unbound starts to give blackouts before trying again.
    Due to the backoff.
    This would be solved if we reset the backoff after successful retry,
    however, reset of the backoff can lead to a loop.  And how to define
    that reset condition.
    Another option is to check if the IP address for the bad data is in
    the delegation point for the zone.  If it is not - try again instantly.
    This is a loop if the NS has zero TTL on its address.
    Flush cache is when the zone is backed off to more than one second.
    Flush is denoted by an age number, we use the rrset-special-id number,
    this is a thread-specific number. At validation failure, if the data
    RRset is older than this number, it is flushed and the query is restarted.
    A thread stores its own id number when a backoff larger than a second
    occurs and its id number has not been stored yet.
  * unbound is configured to talk to upstream caches.  These caches have
    inconsistent bad data.  If one is bad, it is marked bad for that zone.
    If all are bad, there may not be any way for unbound to remove the
    bad entries from the upstream caches.  It simply fails.
    Recommendation: make the upstream caches validate as well.

later
- selective verbosity; ubcontrol trace example.com
- option to log only bogus domainname encountered, for demos
- cache fork-dump, pre-load
- for fwds, send queries to N servers in fwd-list, use first reply.
  document high scalable, high available unbound setup onepager.
- prefetch DNSKEY when DS in delegation seen (nonCD, underTA).
- use libevent if available on system by default(?), default outgoing 256to1024