just getting data from the repo can already raise IntegrityErrors
in LoggedIO, so we need to catch them also.
see also the code a few lines above where this is done in the same way.
This fixes the problem raised by issue #2314 by requiring that each root
subtree be fully traversed.
The problem occurs when a patterns file excludes a parent directory P later
in the file, but earlier in the file a subdirectory S of P is included.
Because a tree is processed recursively with a depth-first search, P is
processed before S is. Previously, if P was excluded, then S would not even
be considered. Now, it is possible to recurse into P nonetheless, while not
adding P (as a directory entry) to the archive.
With this commit, a `-` in a patterns-file will allow an excluded directory
to be searched for matching descendants. If the old behavior is desired, it
can be achieved by using a `!` in place of the `-`.
The following is a list of specific changes made by this commit:
* renamed InclExclPattern named-tuple -> CmdTuple (with names 'val' and 'cmd'), since it is used more generally for commands, and not only for representing patterns.
* represent commands as IECommand enum types (RootPath, PatternStyle, Include, Exclude, ExcludeNoRecurse)
* archiver: Archiver.build_matcher() paths arg renamed -> include_paths to prevent confusion as to whether the list of paths are to be included or excluded.
* helpers: PatternMatcher has recurse_dir attribute that is used to communicate whether an excluded dir should be recursed (used by Archiver._process())
* archiver: Archiver.build_matcher() now only returns a PatternMatcher instance, and not an include_patterns list -- this list is now created and housed within the PatternMatcher instance, and can be accessed from there.
* moved operation of finding unmatched patterns from Archiver to PatternMatcher.get_unmatched_include_patterns()
* added / modified some documentation of code
* renamed _PATTERN_STYLES -> _PATTERN_CLASSES since "style" is ambiguous and this helps clarify that the set contains classes and not instances.
* have PatternBase subclass instances store whether excluded dirs are to be recursed. Because PatternBase objs are created corresponding to each +, -, ! command it is necessary to differentiate - from ! within these objects.
* add test for '!' exclusion rule (which doesn't recurse)
Most code of the CM is just moved 1:1 from the regular file block.
Use the CM for regular files, FIFOs and devices, but not for:
- directories (can not have hardlinks)
- symlinks (we can not support hardlinked symlinks)
- nlink > 1 for dirs does not mean hardlinking
(at least not everywhere, wondering how apple does it)
- we can not archive hardlinked symlinks due to item.source dual-use,
see issue #2343.
likely nobody uses this anyway.
make_parent(path) helper to reduce code duplication.
also use it for directories although makedirs can also do it.
bugfix: also create parent dir for device files, if needed.
if a hardlink master is not in the to-be-extracted subset, the "x"
status was not displayed for it.
also, the matcher was called twice for matching items.
on the wheezy32 test machine, a test testing with corrupted data crashed
with a MemoryError when it tried to get a ~800MB large buffer.
MemoryError is now transformed to DecompressionError, so it gets handled
better.
Also, the bound for giving up is now much lower: 1GiB -> 128MiB.
For a borg create run using a patterns file with 15.000 PathFullPattern excludes
that excluded almost all files in the input data set:
- before this optimization: ~60s
- after this optimization: ~1s
not really a pattern (as in potentially having any variable parts) - it just does a full,
precise match, after the usual normalizations.
the reason for adding this is mainly for later optimizations, e.g. via set membership check,
so that a lot of such PathFullPatterns can be "matched" within O(1) time.
a symlink has a 'source' attribute, so it was confused with a hardlink
slave here. see also issue #2343.
also, a symlink's fs size is defined as the length of the target path.
Before this changeset, async responses were:
- if not an error: ignored
- if an error: raised as response to the arbitrary/unrelated next command
Now, after sending async commands, the async_response command must be used
to process outstanding responses / exceptions.
We are avoiding to pile up lots of stuff in cases of high latency, because we do NOT
first wait until ALL responses have arrived, but we just can begin to process responses.
Calls with wait=False will just return what we already have received.
Repeated calls with wait=True until None is returned will fetch all responses.
Async commands now actually could have non-exception non-None results, but
this is not used yet. None responses are still dropped.
The motivation for this is to have a clear separation between a request
blowing up because it (itself) failed and failures unrelated to that request /
to that line in the sourcecode.
also: fix processing for async repo obj deletes
exception_ignored is a special object used that is "not None" (as None is used to signal
"finished with processing async results") but also not a potential async response result value.
Also:
added wait=True to chunk_decref() and add_chunk()
this makes async processing explicit - the default is synchronous and you only
need to be careful and do extra steps for async processing if you explicitly
request async by calling with wait=False (usually for speed reasons).
to process async results, use async_response, see above.
the bug was compr_args.update(compr_spec), helpers.py:2168 - that mutated
the compression spec dict (and not just some local one, but the compr spec
dict parsed from the commandline args).
so a change that was intended just for 1 chunk changed the desired
compression level on the archive scope.
I refactored the stuff to use a namedtuple (which is immutable, so such
effects can not happen again).
This reverts commit b7eaeee266.
We still need the bigint stuff for compatibility to borg 1.0 archives.
# Conflicts:
# src/borg/archive.py
# src/borg/archiver.py
# src/borg/helpers.py
# src/borg/key.py
The SaveFile code, while ensuring atomicity, did not allow for secure
erasure of the config file (containing the old encrypted key). Now it
creates a hardlink to the file, lets SaveFile do its thing, and writes
random data over the old file (via the hardlink). A secure erase is
needed because the config file can contain the old key after changing
one's password.
if there are too many deleted buckets (tombstones), hashtable performance goes down the drain.
in the worst case of 0 empty buckets and lots of tombstones, this results in full table scans for
new / unknown keys.
thus we make sure we always have a good amount of empty buckets.
* trigger bug in --verify-data, see #2221
* raise decompression errors as DecompressionError, fixes#2221
this is a subclass of IntegrityError, so borg check --verify-data works correctly if
the decompressor stumbles over corrupted data before the plaintext gets verified
(in a unencrypted repository, otherwise the MAC check would fail first).
* fixup: fix exception docstring, add placeholder, change wording
Obviously this means that --log-json with remote repos requires 1.1
on the remote end, but if you don't have that, then random "Remote:"
lines would break stderr anyway.
the chunk accounting code tried to reflect repo space usage via the st_blocks of the files.
so, a specific chunk that was shared between multiple files [inodes] was only accounted for one specific file.
thus, the overall "du" of everything in the fuse mounted repo was maybe correctly reflecting the repo space usage,
but the decision which file has the chunk (the space) was kind of arbitrary and not really useful.
otoh, a simple fuse getattr() was rather expensive due to this as it needed to iterate over the chunks list
to compute the st_blocks value. also it needed quite some memory for the accounting.
thus, st_blocks is now just ceil(size / blocksize).
also: fixed bug that st_blocks was a floating point value previously.
also: preparing for further optimization of size computation (see next cs)
if an item has a chunk list, pre-compute the total size and store it into "size" metadata entry.
this speeds up access to item size (e.g. for regular files) and could also be used to verify the validity of the chunks list.
note about hardlinks: size is only stored for hardlink masters (only they have an own chunk list)
See #1452
This is 100 % accurate.
Also increases maximum data size by ~41 bytes. Not 100 % side-effect free;
if you manage to exactly land in that area then older Borg would not read
it. OTOH it gives us a nice round number there.
also: add some missing assertion messages
severity:
- no issue on little-endian platforms (== most, including x86/x64)
- harmless even on big-endian as long as refcount is below 0xfffbffff,
which is very likely always the case in practice anyway.
we do not trust the remote, so we are careful unpacking its responses.
the remote could return manipulated msgpack data that announces e.g.
a huge array or map or string. the local would then need to allocate huge
amounts of RAM in expectation of that data (no matter whether really
that much is coming or not).
by using limits in the Unpacker, a ValueError will be raised if unexpected
amounts of data shall get unpacked. memory DoS will be avoided.
hardcoded the encoding for reading it. while utf-8 is the default
encoding on many systems, it does not work everywhere.
and when it tries to decode with the ascii decoder, it fails.
Computer clocks are often not set very accurately set, but borg
assumes manifest timestamps are never going back in time.
Ensure that this is actually the case.
# Conflicts:
# src/borg/helpers.py
Original-Commit: 6b8cf0a
hashindex_lookup would always hint at skipping whatever it's probe
length had been with no regard for tombstones it had encountered. This
meant new keys would not overwrite first tombstones, but would always
land on empty buckets.
The regression was introduced in #1748
Add --keep-exclude-tags option as alias to --keep-tag-files and
deprecate the later. Also make tagging accept directories as tags,
allowing things like `--exclude-if-present .git`.
fixes#1999
CRC slice by 8 for generic CPUs outperforms zlib CRC32 on ppc
and x86 (ARM untested but expected to as well).
PCLMULQDQ derived from Intel's zlib patches outperforms every other
CRC implementation by a huge margin.
2**63 nanoseconds are 292 years, so this change is good until 2262.
See also https://en.wikipedia.org/wiki/Time_formatting_and_storage_bugs#Year_2262
I expect that we will have plenty of time to revert this commit in time
for 2262.
timespec := time_t + long, so it's probably only 64 bits on some platforms
anyway.
This is some 15 times faster than @contextmanager, because no instance
creation is involved and no generator has to be maintained. Overall
difference is low, but still nice for a very simple change.
This makes an surprisingly large difference. Test case: ~70000 empty files.
(Ie. little data shoveling, lots of metadata shoveling). Before: 9.1 seconds
+- 0.1 seconds. After: 8.4 seconds +- 0.1 seconds.). That's a huge
win for changing a few lines.
I'd expect that this improves performance in almost all areas that touch
the items (list, delete, prune).