update docs about fixed chunker and chunker algo spec needed

2026-06-13 10:50:21 -04:00 · 2019-02-13 06:30:13 +01:00 · 2019-02-13 06:30:13 +01:00 · 7f46eb99aa
commit 7f46eb99aa
parent ac0803fe0b
6 changed files with 68 additions and 19 deletions
--- a/docs/internals.rst
+++ b/docs/internals.rst
@ -19,7 +19,8 @@ specified when the backup was performed.
 Deduplication is performed globally across all data in the repository
 (multiple backups and even multiple hosts), both on data and file
 metadata, using :ref:`chunks` created by the chunker using the
-Buzhash_ algorithm.
+Buzhash_ algorithm ("buzhash" chunker) or a simpler fixed blocksize
+algorithm ("fixed" chunker).

 To actually perform the repository-wide deduplication, a hash of each
 chunk is checked against the :ref:`chunks cache <cache>`, which is a
--- a/docs/internals/data-structures.rst
+++ b/docs/internals/data-structures.rst
@ -580,16 +580,43 @@ A chunk is stored as an object as well, of course.
 Chunks
 ~~~~~~

-The Borg chunker uses a rolling hash computed by the Buzhash_ algorithm.
-It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero,
-producing chunks of 2^HASH_MASK_BITS Bytes on average.
+Borg has these chunkers:
+
+- "fixed": a simple, low cpu overhead, fixed blocksize chunker, optionally
+  supporting a header block of different size.
+- "buzhash": variable, content-defined blocksize, uses a rolling hash
+  computed by the Buzhash_ algorithm.
+
+For some more general usage hints see also ``--chunker-params``.
+
+"fixed" chunker
+++++++++++++++
+
+The fixed chunker triggers (chunks) at even-spaced offsets, e.g. every 4MiB,
+producing chunks of same block size (the last chunk is not required to be
+full-size).
+
+Optionally, it can cut the first "header" chunk with a different size (the
+default is not to have a differently sized header chunk).
+
+``borg create --chunker-params fixed,BLOCK_SIZE[,HEADER_SIZE]``
+
+- BLOCK_SIZE: no default value, multiple of the system page size (usually 4096
+  bytes) recommended. E.g.: 4194304 would cut 4MiB sized chunks.
+- HEADER_SIZE: optional, defaults to 0 (no header chunk).
+
+"buzhash" chunker
+++++++++++++++++
+
+The buzhash chunker triggers (chunks) when the last HASH_MASK_BITS bits of
+the hash are zero, producing chunks of 2^HASH_MASK_BITS Bytes on average.

 Buzhash is **only** used for cutting the chunks at places defined by the
 content, the buzhash value is **not** used as the deduplication criteria (we
 use a cryptographically strong hash/MAC over the chunk contents for this, the
 id_hash).

-``borg create --chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE``
+``borg create --chunker-params buzhash,CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE``
 can be used to tune the chunker parameters, the default is:

 - CHUNK_MIN_EXP = 19 (minimum chunk size = 2^19 B = 512 kiB)
@ -602,8 +629,6 @@ for the repository, and stored encrypted in the keyfile. This is to prevent
 chunk size based fingerprinting attacks on your encrypted repo contents (to
 guess what files you have based on a specific set of chunk sizes).

-For some more general usage hints see also ``--chunker-params``.
-
 .. _cache:

 The cache
@ -690,7 +715,8 @@ Indexes / Caches memory usage

 Here is the estimated memory usage of Borg - it's complicated::

-  chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS
+  chunk_size ~= 2 ^ HASH_MASK_BITS  (for buzhash chunker, BLOCK_SIZE for fixed chunker)
+  chunk_count ~= total_file_size / chunk_size

  repo_index_usage = chunk_count * 40

@ -732,11 +758,11 @@ For small hash tables, we start with a growth factor of 2, which comes down to

 E.g. backing up a total count of 1 Mi (IEC binary prefix i.e. 2^20) files with a total size of 1TiB.

-a) with ``create --chunker-params 10,23,16,4095`` (custom, like borg < 1.0 or attic):
+a) with ``create --chunker-params buzhash,10,23,16,4095`` (custom, like borg < 1.0 or attic):

  mem_usage  =  2.8GiB

-b) with ``create --chunker-params 19,23,21,4095`` (default):
+b) with ``create --chunker-params buzhash,19,23,21,4095`` (default):

  mem_usage  =  0.31GiB

--- a/docs/internals/frontends.rst
+++ b/docs/internals/frontends.rst
@ -376,6 +376,7 @@ The same archive with more information (``borg info --last 1 --json``)::
        "archives": [
            {
                "chunker_params": [
+                    "buzhash",
                    13,
                    23,
                    16,
--- a/docs/internals/security.rst
+++ b/docs/internals/security.rst
@ -396,16 +396,27 @@ Stored chunk sizes
 A borg repository does not hide the size of the chunks it stores (size
 information is needed to operate the repository).

-The chunks stored are the (compressed and encrypted) output of the chunker,
-chunked according to the input data, the chunker's parameters and the secret
-chunker seed (which all influence the chunk boundary positions).
+The chunks stored in the repo are the (compressed, encrypted and authenticated)
+output of the chunker. The sizes of these stored chunks are influenced by the
+compression, encryption and authentication.
+
+buzhash chunker
+++++++++++++++
+
+The buzhash chunker chunks according to the input data, the chunker's
+parameters and the secret chunker seed (which all influence the chunk boundary
+positions).

 Small files below some specific threshold (default: 512kiB) result in only one
 chunk (identical content / size as the original file), bigger files result in
 multiple chunks.

-After chunking is done, compression, encryption and authentication are applied,
-which influence the sizes of the chunks stored into the repository.
+fixed chunker
+++++++++++++
+
+This chunker yields fixed sized chunks, with optional support of a differently
+sized header chunk. The last chunk is not required to have the full block size
+and is determined by the input file size.

 Within our attack model, an attacker posessing a specific set of files which
 he assumes that the victim also posesses (and backups into the repository)
--- a/docs/usage/create.rst
+++ b/docs/usage/create.rst
@ -36,10 +36,10 @@ Examples
    # Make a big effort in fine granular deduplication (big chunk management
    # overhead, needs a lot of RAM and disk space, see formula in internals
    # docs - same parameters as borg < 1.0 or attic):
-    $ borg create --chunker-params 10,23,16,4095 /path/to/repo::small /smallstuff
+    $ borg create --chunker-params buzhash,10,23,16,4095 /path/to/repo::small /smallstuff

    # Backup a raw device (must not be active/in use/mounted at that time)
-    $ dd if=/dev/sdx bs=10M | borg create /path/to/repo::my-sdx -
+    $ dd if=/dev/sdx bs=4M | borg create --chunker-params fixed,4194304 /path/to/repo::my-sdx -

    # No compression (none)
    $ borg create --compression none /path/to/repo::arch ~
--- a/docs/usage/notes.rst
+++ b/docs/usage/notes.rst
@ -14,16 +14,26 @@ resource usage (RAM and disk space) as the amount of resources needed is
 (also) determined by the total amount of chunks in the repository (see
 :ref:`cache-memory-usage` for details).

-``--chunker-params=10,23,16,4095`` results in a fine-grained deduplication|
+``--chunker-params=buzhash,10,23,16,4095`` results in a fine-grained deduplication|
 and creates a big amount of chunks and thus uses a lot of resources to manage
 them. This is good for relatively small data volumes and if the machine has a
 good amount of free RAM and disk space.

-``--chunker-params=19,23,21,4095`` (default) results in a coarse-grained
+``--chunker-params=buzhash,19,23,21,4095`` (default) results in a coarse-grained
 deduplication and creates a much smaller amount of chunks and thus uses less
 resources. This is good for relatively big data volumes and if the machine has
 a relatively low amount of free RAM and disk space.

+``--chunker-params=fixed,4194304`` results in fixed 4MiB sized block
+deduplication and is more efficient than the previous example when used for
+for block devices (like disks, partitions, LVM LVs) or raw disk image files.
+
+``--chunker-params=fixed,4096,512`` results in fixed 4kiB sized blocks,
+but the first header block will only be 512B long. This might be useful to
+dedup files with 1 header + N fixed size data blocks. Be careful to not
+produce a too big amount of chunks (like using small block size for huge
+files).
+
 If you already have made some archives in a repository and you then change
 chunker params, this of course impacts deduplication as the chunks will be
 cut differently.