diff --git a/docs/internals/data-structures.rst b/docs/internals/data-structures.rst index b355efa3c..785c095a5 100644 --- a/docs/internals/data-structures.rst +++ b/docs/internals/data-structures.rst @@ -327,7 +327,7 @@ The archive object itself further contains some metadata: When :ref:`borg_check` rebuilds the manifest (e.g. if it was corrupted) and finds more than one archive object with the same name, it adds a counter to the name in the manifest, but leaves the *name* field of the archives as it was. -* *items*, a list of chunk IDs containing item metadata (size: count * ~33B) +* *items*, a list of chunk IDs containing item metadata (size: count * ~34B) * *cmdline*, the command line which was used to create the archive * *hostname* * *username* @@ -339,8 +339,7 @@ The archive object itself further contains some metadata: .. _archive_limitation: -Note about archive limitations -++++++++++++++++++++++++++++++ +.. rubric:: Note about archive limitations The archive is currently stored as a single object in the repository and thus limited in size to MAX_OBJECT_SIZE (20MiB). @@ -435,18 +434,16 @@ The cache The **files cache** is stored in ``cache/files`` and is used at backup time to quickly determine whether a given file is unchanged and we have all its chunks. -The files cache is in memory a key -> value mapping (a Python *dict*) and contains: +In memory, the files cache is a key -> value mapping (a Python *dict*) and contains: -* key: - - - full, absolute file path id_hash +* key: id_hash of the encoded, absolute file path * value: - file inode number - file size - file mtime_ns - - list of file content chunk id hashes - age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1) + - list of chunk ids representing the file's contents To determine whether a file has not changed, cached values are looked up via the key in the mapping and compared to the current file attribute values. @@ -572,8 +569,9 @@ HashIndex The chunks cache and the repository index are stored as hash tables, with only one slot per bucket, spreading hash collisions to the following buckets. As a consequence the hash is just a start position for a linear -search, and if the element is not in the table the index is linearly crossed -until an empty bucket is found. +search. If a key is looked up that is not in the table, then the hash table +is searched from the start position (the hash) until the first empty +bucket is reached. This particular mode of operation is open addressing with linear probing. @@ -582,14 +580,24 @@ emptied to 25%, its size is shrinked. Operations on it have a variable complexity between constant and linear with low factor, and memory overhead varies between 33% and 300%. -Further, if the number of empty slots becomes too low (recall that linear probing +If an element is deleted, and the slot behind the deleted element is not empty, +then the element will leave a tombstone, a bucket marked as deleted. Tombstones +are only removed by insertions using the tombstone's bucket, or by resizing +the table. They present the same load to the hash table as a real entry, +but do not count towards the regular load factor. + +Thus, if the number of empty slots becomes too low (recall that linear probing for an element not in the index stops at the first empty slot), the hash table -is rebuilt. The maximum *effective* load factor is 93%. +is rebuilt. The maximum *effective* load factor, i.e. including tombstones, is 93%. Data in a HashIndex is always stored in little-endian format, which increases efficiency for almost everyone, since basically no one uses big-endian processors any more. +HashIndex does not use a hashing function, because all keys (save manifest) are +outputs of a cryptographic hash or MAC and thus already have excellent distribution. +Thus, HashIndex simply uses the first 32 bits of the key as its "hash". + The format is easy to read and write, because the buckets array has the same layout in memory and on disk. Only the header formats differ. diff --git a/docs/internals/object-graph.png b/docs/internals/object-graph.png index 6abfa4d34..8a153428c 100644 Binary files a/docs/internals/object-graph.png and b/docs/internals/object-graph.png differ diff --git a/docs/internals/object-graph.vsd b/docs/internals/object-graph.vsd index 3d74cbb8d..70989e1dd 100644 Binary files a/docs/internals/object-graph.vsd and b/docs/internals/object-graph.vsd differ