mirror of
https://github.com/borgbackup/borg.git
synced 2026-04-15 21:59:58 -04:00
update docs about fixed chunker and chunker algo spec needed
This commit is contained in:
parent
ac0803fe0b
commit
7f46eb99aa
6 changed files with 68 additions and 19 deletions
|
|
@ -19,7 +19,8 @@ specified when the backup was performed.
|
|||
Deduplication is performed globally across all data in the repository
|
||||
(multiple backups and even multiple hosts), both on data and file
|
||||
metadata, using :ref:`chunks` created by the chunker using the
|
||||
Buzhash_ algorithm.
|
||||
Buzhash_ algorithm ("buzhash" chunker) or a simpler fixed blocksize
|
||||
algorithm ("fixed" chunker).
|
||||
|
||||
To actually perform the repository-wide deduplication, a hash of each
|
||||
chunk is checked against the :ref:`chunks cache <cache>`, which is a
|
||||
|
|
|
|||
|
|
@ -580,16 +580,43 @@ A chunk is stored as an object as well, of course.
|
|||
Chunks
|
||||
~~~~~~
|
||||
|
||||
The Borg chunker uses a rolling hash computed by the Buzhash_ algorithm.
|
||||
It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero,
|
||||
producing chunks of 2^HASH_MASK_BITS Bytes on average.
|
||||
Borg has these chunkers:
|
||||
|
||||
- "fixed": a simple, low cpu overhead, fixed blocksize chunker, optionally
|
||||
supporting a header block of different size.
|
||||
- "buzhash": variable, content-defined blocksize, uses a rolling hash
|
||||
computed by the Buzhash_ algorithm.
|
||||
|
||||
For some more general usage hints see also ``--chunker-params``.
|
||||
|
||||
"fixed" chunker
|
||||
+++++++++++++++
|
||||
|
||||
The fixed chunker triggers (chunks) at even-spaced offsets, e.g. every 4MiB,
|
||||
producing chunks of same block size (the last chunk is not required to be
|
||||
full-size).
|
||||
|
||||
Optionally, it can cut the first "header" chunk with a different size (the
|
||||
default is not to have a differently sized header chunk).
|
||||
|
||||
``borg create --chunker-params fixed,BLOCK_SIZE[,HEADER_SIZE]``
|
||||
|
||||
- BLOCK_SIZE: no default value, multiple of the system page size (usually 4096
|
||||
bytes) recommended. E.g.: 4194304 would cut 4MiB sized chunks.
|
||||
- HEADER_SIZE: optional, defaults to 0 (no header chunk).
|
||||
|
||||
"buzhash" chunker
|
||||
+++++++++++++++++
|
||||
|
||||
The buzhash chunker triggers (chunks) when the last HASH_MASK_BITS bits of
|
||||
the hash are zero, producing chunks of 2^HASH_MASK_BITS Bytes on average.
|
||||
|
||||
Buzhash is **only** used for cutting the chunks at places defined by the
|
||||
content, the buzhash value is **not** used as the deduplication criteria (we
|
||||
use a cryptographically strong hash/MAC over the chunk contents for this, the
|
||||
id_hash).
|
||||
|
||||
``borg create --chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE``
|
||||
``borg create --chunker-params buzhash,CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE``
|
||||
can be used to tune the chunker parameters, the default is:
|
||||
|
||||
- CHUNK_MIN_EXP = 19 (minimum chunk size = 2^19 B = 512 kiB)
|
||||
|
|
@ -602,8 +629,6 @@ for the repository, and stored encrypted in the keyfile. This is to prevent
|
|||
chunk size based fingerprinting attacks on your encrypted repo contents (to
|
||||
guess what files you have based on a specific set of chunk sizes).
|
||||
|
||||
For some more general usage hints see also ``--chunker-params``.
|
||||
|
||||
.. _cache:
|
||||
|
||||
The cache
|
||||
|
|
@ -690,7 +715,8 @@ Indexes / Caches memory usage
|
|||
|
||||
Here is the estimated memory usage of Borg - it's complicated::
|
||||
|
||||
chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS
|
||||
chunk_size ~= 2 ^ HASH_MASK_BITS (for buzhash chunker, BLOCK_SIZE for fixed chunker)
|
||||
chunk_count ~= total_file_size / chunk_size
|
||||
|
||||
repo_index_usage = chunk_count * 40
|
||||
|
||||
|
|
@ -732,11 +758,11 @@ For small hash tables, we start with a growth factor of 2, which comes down to
|
|||
|
||||
E.g. backing up a total count of 1 Mi (IEC binary prefix i.e. 2^20) files with a total size of 1TiB.
|
||||
|
||||
a) with ``create --chunker-params 10,23,16,4095`` (custom, like borg < 1.0 or attic):
|
||||
a) with ``create --chunker-params buzhash,10,23,16,4095`` (custom, like borg < 1.0 or attic):
|
||||
|
||||
mem_usage = 2.8GiB
|
||||
|
||||
b) with ``create --chunker-params 19,23,21,4095`` (default):
|
||||
b) with ``create --chunker-params buzhash,19,23,21,4095`` (default):
|
||||
|
||||
mem_usage = 0.31GiB
|
||||
|
||||
|
|
|
|||
|
|
@ -376,6 +376,7 @@ The same archive with more information (``borg info --last 1 --json``)::
|
|||
"archives": [
|
||||
{
|
||||
"chunker_params": [
|
||||
"buzhash",
|
||||
13,
|
||||
23,
|
||||
16,
|
||||
|
|
|
|||
|
|
@ -396,16 +396,27 @@ Stored chunk sizes
|
|||
A borg repository does not hide the size of the chunks it stores (size
|
||||
information is needed to operate the repository).
|
||||
|
||||
The chunks stored are the (compressed and encrypted) output of the chunker,
|
||||
chunked according to the input data, the chunker's parameters and the secret
|
||||
chunker seed (which all influence the chunk boundary positions).
|
||||
The chunks stored in the repo are the (compressed, encrypted and authenticated)
|
||||
output of the chunker. The sizes of these stored chunks are influenced by the
|
||||
compression, encryption and authentication.
|
||||
|
||||
buzhash chunker
|
||||
+++++++++++++++
|
||||
|
||||
The buzhash chunker chunks according to the input data, the chunker's
|
||||
parameters and the secret chunker seed (which all influence the chunk boundary
|
||||
positions).
|
||||
|
||||
Small files below some specific threshold (default: 512kiB) result in only one
|
||||
chunk (identical content / size as the original file), bigger files result in
|
||||
multiple chunks.
|
||||
|
||||
After chunking is done, compression, encryption and authentication are applied,
|
||||
which influence the sizes of the chunks stored into the repository.
|
||||
fixed chunker
|
||||
+++++++++++++
|
||||
|
||||
This chunker yields fixed sized chunks, with optional support of a differently
|
||||
sized header chunk. The last chunk is not required to have the full block size
|
||||
and is determined by the input file size.
|
||||
|
||||
Within our attack model, an attacker posessing a specific set of files which
|
||||
he assumes that the victim also posesses (and backups into the repository)
|
||||
|
|
|
|||
|
|
@ -36,10 +36,10 @@ Examples
|
|||
# Make a big effort in fine granular deduplication (big chunk management
|
||||
# overhead, needs a lot of RAM and disk space, see formula in internals
|
||||
# docs - same parameters as borg < 1.0 or attic):
|
||||
$ borg create --chunker-params 10,23,16,4095 /path/to/repo::small /smallstuff
|
||||
$ borg create --chunker-params buzhash,10,23,16,4095 /path/to/repo::small /smallstuff
|
||||
|
||||
# Backup a raw device (must not be active/in use/mounted at that time)
|
||||
$ dd if=/dev/sdx bs=10M | borg create /path/to/repo::my-sdx -
|
||||
$ dd if=/dev/sdx bs=4M | borg create --chunker-params fixed,4194304 /path/to/repo::my-sdx -
|
||||
|
||||
# No compression (none)
|
||||
$ borg create --compression none /path/to/repo::arch ~
|
||||
|
|
|
|||
|
|
@ -14,16 +14,26 @@ resource usage (RAM and disk space) as the amount of resources needed is
|
|||
(also) determined by the total amount of chunks in the repository (see
|
||||
:ref:`cache-memory-usage` for details).
|
||||
|
||||
``--chunker-params=10,23,16,4095`` results in a fine-grained deduplication|
|
||||
``--chunker-params=buzhash,10,23,16,4095`` results in a fine-grained deduplication|
|
||||
and creates a big amount of chunks and thus uses a lot of resources to manage
|
||||
them. This is good for relatively small data volumes and if the machine has a
|
||||
good amount of free RAM and disk space.
|
||||
|
||||
``--chunker-params=19,23,21,4095`` (default) results in a coarse-grained
|
||||
``--chunker-params=buzhash,19,23,21,4095`` (default) results in a coarse-grained
|
||||
deduplication and creates a much smaller amount of chunks and thus uses less
|
||||
resources. This is good for relatively big data volumes and if the machine has
|
||||
a relatively low amount of free RAM and disk space.
|
||||
|
||||
``--chunker-params=fixed,4194304`` results in fixed 4MiB sized block
|
||||
deduplication and is more efficient than the previous example when used for
|
||||
for block devices (like disks, partitions, LVM LVs) or raw disk image files.
|
||||
|
||||
``--chunker-params=fixed,4096,512`` results in fixed 4kiB sized blocks,
|
||||
but the first header block will only be 512B long. This might be useful to
|
||||
dedup files with 1 header + N fixed size data blocks. Be careful to not
|
||||
produce a too big amount of chunks (like using small block size for huge
|
||||
files).
|
||||
|
||||
If you already have made some archives in a repository and you then change
|
||||
chunker params, this of course impacts deduplication as the chunks will be
|
||||
cut differently.
|
||||
|
|
|
|||
Loading…
Reference in a new issue