Implement IntegrityCheckedFile
This is based on much earlier work from October 2016 by me, but is
overall simplified and changed terminology (from "signing" to
hashing and integrity checking).
See #1688 for the full history.
just getting data from the repo can already raise IntegrityErrors
in LoggedIO, so we need to catch them also.
see also the code a few lines above where this is done in the same way.
This fixes the problem raised by issue #2314 by requiring that each root
subtree be fully traversed.
The problem occurs when a patterns file excludes a parent directory P later
in the file, but earlier in the file a subdirectory S of P is included.
Because a tree is processed recursively with a depth-first search, P is
processed before S is. Previously, if P was excluded, then S would not even
be considered. Now, it is possible to recurse into P nonetheless, while not
adding P (as a directory entry) to the archive.
With this commit, a `-` in a patterns-file will allow an excluded directory
to be searched for matching descendants. If the old behavior is desired, it
can be achieved by using a `!` in place of the `-`.
The following is a list of specific changes made by this commit:
* renamed InclExclPattern named-tuple -> CmdTuple (with names 'val' and 'cmd'), since it is used more generally for commands, and not only for representing patterns.
* represent commands as IECommand enum types (RootPath, PatternStyle, Include, Exclude, ExcludeNoRecurse)
* archiver: Archiver.build_matcher() paths arg renamed -> include_paths to prevent confusion as to whether the list of paths are to be included or excluded.
* helpers: PatternMatcher has recurse_dir attribute that is used to communicate whether an excluded dir should be recursed (used by Archiver._process())
* archiver: Archiver.build_matcher() now only returns a PatternMatcher instance, and not an include_patterns list -- this list is now created and housed within the PatternMatcher instance, and can be accessed from there.
* moved operation of finding unmatched patterns from Archiver to PatternMatcher.get_unmatched_include_patterns()
* added / modified some documentation of code
* renamed _PATTERN_STYLES -> _PATTERN_CLASSES since "style" is ambiguous and this helps clarify that the set contains classes and not instances.
* have PatternBase subclass instances store whether excluded dirs are to be recursed. Because PatternBase objs are created corresponding to each +, -, ! command it is necessary to differentiate - from ! within these objects.
* add test for '!' exclusion rule (which doesn't recurse)
Most code of the CM is just moved 1:1 from the regular file block.
Use the CM for regular files, FIFOs and devices, but not for:
- directories (can not have hardlinks)
- symlinks (we can not support hardlinked symlinks)
- nlink > 1 for dirs does not mean hardlinking
(at least not everywhere, wondering how apple does it)
- we can not archive hardlinked symlinks due to item.source dual-use,
see issue #2343.
likely nobody uses this anyway.
make_parent(path) helper to reduce code duplication.
also use it for directories although makedirs can also do it.
bugfix: also create parent dir for device files, if needed.
if a hardlink master is not in the to-be-extracted subset, the "x"
status was not displayed for it.
also, the matcher was called twice for matching items.
on the wheezy32 test machine, a test testing with corrupted data crashed
with a MemoryError when it tried to get a ~800MB large buffer.
MemoryError is now transformed to DecompressionError, so it gets handled
better.
Also, the bound for giving up is now much lower: 1GiB -> 128MiB.
For a borg create run using a patterns file with 15.000 PathFullPattern excludes
that excluded almost all files in the input data set:
- before this optimization: ~60s
- after this optimization: ~1s
not really a pattern (as in potentially having any variable parts) - it just does a full,
precise match, after the usual normalizations.
the reason for adding this is mainly for later optimizations, e.g. via set membership check,
so that a lot of such PathFullPatterns can be "matched" within O(1) time.
a symlink has a 'source' attribute, so it was confused with a hardlink
slave here. see also issue #2343.
also, a symlink's fs size is defined as the length of the target path.
Before this changeset, async responses were:
- if not an error: ignored
- if an error: raised as response to the arbitrary/unrelated next command
Now, after sending async commands, the async_response command must be used
to process outstanding responses / exceptions.
We are avoiding to pile up lots of stuff in cases of high latency, because we do NOT
first wait until ALL responses have arrived, but we just can begin to process responses.
Calls with wait=False will just return what we already have received.
Repeated calls with wait=True until None is returned will fetch all responses.
Async commands now actually could have non-exception non-None results, but
this is not used yet. None responses are still dropped.
The motivation for this is to have a clear separation between a request
blowing up because it (itself) failed and failures unrelated to that request /
to that line in the sourcecode.
also: fix processing for async repo obj deletes
exception_ignored is a special object used that is "not None" (as None is used to signal
"finished with processing async results") but also not a potential async response result value.
Also:
added wait=True to chunk_decref() and add_chunk()
this makes async processing explicit - the default is synchronous and you only
need to be careful and do extra steps for async processing if you explicitly
request async by calling with wait=False (usually for speed reasons).
to process async results, use async_response, see above.
the bug was compr_args.update(compr_spec), helpers.py:2168 - that mutated
the compression spec dict (and not just some local one, but the compr spec
dict parsed from the commandline args).
so a change that was intended just for 1 chunk changed the desired
compression level on the archive scope.
I refactored the stuff to use a namedtuple (which is immutable, so such
effects can not happen again).
This reverts commit b7eaeee266.
We still need the bigint stuff for compatibility to borg 1.0 archives.
# Conflicts:
# src/borg/archive.py
# src/borg/archiver.py
# src/borg/helpers.py
# src/borg/key.py
The SaveFile code, while ensuring atomicity, did not allow for secure
erasure of the config file (containing the old encrypted key). Now it
creates a hardlink to the file, lets SaveFile do its thing, and writes
random data over the old file (via the hardlink). A secure erase is
needed because the config file can contain the old key after changing
one's password.
if there are too many deleted buckets (tombstones), hashtable performance goes down the drain.
in the worst case of 0 empty buckets and lots of tombstones, this results in full table scans for
new / unknown keys.
thus we make sure we always have a good amount of empty buckets.
* trigger bug in --verify-data, see #2221
* raise decompression errors as DecompressionError, fixes#2221
this is a subclass of IntegrityError, so borg check --verify-data works correctly if
the decompressor stumbles over corrupted data before the plaintext gets verified
(in a unencrypted repository, otherwise the MAC check would fail first).
* fixup: fix exception docstring, add placeholder, change wording
Obviously this means that --log-json with remote repos requires 1.1
on the remote end, but if you don't have that, then random "Remote:"
lines would break stderr anyway.
the chunk accounting code tried to reflect repo space usage via the st_blocks of the files.
so, a specific chunk that was shared between multiple files [inodes] was only accounted for one specific file.
thus, the overall "du" of everything in the fuse mounted repo was maybe correctly reflecting the repo space usage,
but the decision which file has the chunk (the space) was kind of arbitrary and not really useful.
otoh, a simple fuse getattr() was rather expensive due to this as it needed to iterate over the chunks list
to compute the st_blocks value. also it needed quite some memory for the accounting.
thus, st_blocks is now just ceil(size / blocksize).
also: fixed bug that st_blocks was a floating point value previously.
also: preparing for further optimization of size computation (see next cs)
if an item has a chunk list, pre-compute the total size and store it into "size" metadata entry.
this speeds up access to item size (e.g. for regular files) and could also be used to verify the validity of the chunks list.
note about hardlinks: size is only stored for hardlink masters (only they have an own chunk list)