This change implements the functionality requested in issue #361:
extracting files with a given extension. It does so by permitting
patterns to be used instead plain prefix paths. The pattern styles
supported are the same as for exclusions.
The “extract” command supports extracting all files underneath a given
set of prefix paths. The forthcoming support for extracting files using
a pattern (i.e. only files ending in “.zip”) requires the introduction
of path prefixes as a third pattern style, making it also available for
exclusions.
A function to parse pattern specifications was introduced in commit
2bafece. Since then it had a hardcoded default style of “fm”, meaning
fnmatch. With the forthcoming support for extracting files using
patterns this default style must be more flexible.
The utility functions “adjust_patterns” and “exclude_path” produce
respectively use a standard list object containing pattern objects.
With the forthcoming introduction of patterns for filtering files
to be extracted it's better to move the logic of these classes into
a single class.
The wrapper allows adding any number of patterns to an internal list
together with a value to be returned if a match function finds that
one of the patterns matches. A fallback value is returned otherwise.
- Stop using “adjust_pattern” and “exclude_path” as they're utility
functions not relevant to testing pattern classes
- Cover a few more cases, especially with more than one path separator
and relative paths
- At least one dedicated test function for each pattern style as opposed
to a single, big test mixing styles
- Use positive instead of negative matching (i.e. the expected list of
resulting items is a list of items matching a pattern)
also: make check in Lock.close more precise, check for "is not None".
note: a lot of blocks were just indented to be under the "with" statement,
in one case a block had to be moved into a function.
there is no such method in the code.
we use "check" method to repair the repo, so maybe this was left over
from a time when repair was separate from check.
the problem was that the borg process removed its own shared lock when upgrading it to an exclusive lock.
this is fine if we get the exclusive lock, but if we don't, we must re-add our shared lock.
this fixes the KeyError in locking.py:217
the items metadata stream is usually not that big (compared to the file content data) -
it is just file and dir names and other metadata.
if we use too rough granularity there (and big minimum chunk size), we usually will get no deduplication.
The class names “IncludePattern” and “ExcludePattern” may have been
appropriate when they were the only styles. With the recent addition of
regular expression support and with at least one more style being added
in forthcoming changes these classes should be renamed to be more
descriptive. “ExcludeRegex” is also renamed to match the new names.
The unit tests for Unicode in path patterns contained a lot of
unnecessary duplication. One set of duplication was for Mac OS X (also
known as Darwin) as it normalizes Unicode in paths to NFD. Then each
test case was repeated for every type of pattern.
With this change the tests become parametrized using py.test. The
duplicated code has been removed.
- rename BUCKET_(LOWER|UPPER)_LIMIT to HASH_(MIN|MAX)_LOAD
as this value is usually called the hash table's minimum/maximum load factor.
- remove MAX_BUCKET_SIZE (not used)
- regroup/reorder definitions
Patterns to exclude files can be loaded from a text file using the
“--exclude-from” option. Whitespace at the beginning or end of lines was
not stripped. Indented comments would be interpreted as a pattern and
a misplaced space at the end of a line--some text editors don't strip
them--could cause an exclusion pattern to not match as desired. With the
recent addition of regular expression support for exclusions the spaces
can be matched if necessary (“^\s” or “\s$”), though it's highly
unlikely that there are many paths deliberately starting or ending with
whitespace.
The existing option to exclude files and directories, “--exclude”, is
implemented using fnmatch[1]. fnmatch matches the slash (“/”) with “*”
and thus makes it impossible to write patterns where a directory with
a given name should be excluded at a specific depth in the directory
hierarchy, but not anywhere else. Consider this structure:
home/
home/aaa
home/aaa/.thumbnails
home/user
home/user/img
home/user/img/.thumbnails
fnmatch incorrectly excludes “home/user/img/.thumbnails” with a pattern
of “home/*/.thumbnails” when the intention is to exclude “.thumbnails”
in all home directories while retaining directories with the same name
in all other locations.
With this change regular expressions are introduced as an additional
pattern syntax. The syntax is selected using a prefix on “--exclude”'s
value. “re:” is for regular expression and “fm:”, the default, selects
fnmatch. Selecting the syntax is necessary when regular expressions are
desired or when the desired fnmatch pattern starts with two alphanumeric
characters followed by a colon (i.e. “aa:something/*”). The exclusion
described above can be implemented as follows:
--exclude 're:^home/[^/]+/\.thumbnails$'
The “--exclude-from” option permits loading exclusions from a text file
where the same prefixes can now be used, e.g. “re:\.tmp$”.
The documentation has been extended and now not only describes the two
pattern styles, but also the file format supported by “--exclude-from”.
This change has been discussed in issue #43 and in change request #497.
[1] https://docs.python.org/3/library/fnmatch.html
Signed-off-by: Michael Hanselmann <public@hansmi.ch>
The two classes for applying inclusion and exclusion patterns contained
unnecessarily duplicated logic. The introduction of a shared base class
allows for easier reuse, especially considering that two more classes
are going to be added in forthcoming changes (regular expressions and
shell-style patterns).
prune and create now both require --verbose --stats to show stats.
it was implemented in this way (and not with print) so you can feed the stats data
into the logging system, too.
delete now says "Archive deleted" in verbose mode (for consistency,
it already said "Repository deleted" when deleting a repo).
also: add helpers.log_multi to comfortably and prettily output a block of log lines
The parsing code for exclude files (given via `--exclude-from`) was not
tested. Its core is factorized into a separate function to facilitate an
easier test. The observable behaviour is unchanged.
i checked it: copying the index.* and hints.* files in advance is not needed, open() and close() do not modify them.
also: fix unicode exception with encoded filename
because Repository.__init__ normally opens and locks the repo, and the upgrader just
inherited from (borg) Repository, it created a lock file there before the "backup copy"
was made.
No big problem, but a bit unclean.
Fixed it to not lock at the beginning, then make the copy, then lock.
For 0.29 we worked towards a "silent by default" behaviour, so interactive usage will include -v more frequently in future.
But I noticed that this conflicts with the progress display. This would be no problem if users willingly decide which one
of --verbose or --progress they want to see, but before this fix, the progress display was activated magically when
a tty was detected. So, to counteract this magic, users would need to use --no-progress.
That's backwards imho, so I removed the magic again and users have to give --progress when they want
to see a progress indicator. Or (alternatively) they give --verbose when they want to see the long file list.
From https://github.com/borgbackup/borg/pull/480 discussion:
Did you try 1024 (linux cache block size) or 4096 (internal sector size of bigger
hdds, also used in msgpack fallback.py as lower bound, see link)?
I've tested different values - 512 and 1024 are slightly better than 4096 in my case.
read_size = 1 ls -laR: 75.57 sec
read_size = 64 ls -laR: 27.81 sec
read_size = 512 ls -laR: 27.40 sec
read_size = 1024 ls -laR: 27.20 sec
read_size = 4096 ls -laR: 30.15 sec
read_size = 0 ls -laR: 442.96 sec (default)
OK, maybe we should go for 1024 then. That happens to be < MTU size, so in case someone works on NFS
(or other network FS) we will have less reads, less network packets, less latency.
Single-shot unpacker read buffer decreased from (default) 1Mb to 512b.
"ls -alR" on ~100k files backup mounted with fuse went from ~7min to 30 seconds.
as soon as one target segment is full, it is a good time to commit it and remove the source segments
that are already completely unused (because they were transferred int the target segment).
so, for compact_segments(save_space=True), the additional space needed should be about 1 segment size.
note: we can't just do that at the end of one source segment as this might create very small
target segments, which is not wanted.
removed --log-level due to overlap with how --verbose works now.
for consistency, added --info as alias to --verbose (as the effect is
setting INFO log level).
also added --debug which sets DEBUG log level.
note: there are no messages emitted at DEBUG level yet.
WARNING is the default (because we want mostly silent behaviour,
except if something serious happens), so we don't need --warning
as an option.
this was also the loop contents of hashindex_merge, but we also need it callable from Cython/Python code.
this saves some cycles, esp. if the key is already present in the index.
The read_msgpack and write_msgpack functions were only used in one place
each. Since msgpack is read and written in lots of places, having
functions with these generic names is confusing. Also, the helpers
module is quite a mess, so reducing its size seems to be a good idea.
the problem here was that we do not just have changed and unchanged items,
but also a lot of items besides regular files which we just back up "as is" without
determining whether they are changed or not. thus, we can't support changed/unchanged
in a way users would expect them to work.
the A/M/U status only applies to the data content of regular files (compared to the index).
for all items, we ALWAYS save the metadata, there is no changed / not changed detection there.
thus, I replaced this with a --filter option where you can just specify which
status chars you want to see listed in the output.
E.g. --filter AM will only show regular files with A(dded) or M(odified) state, but nothing else.
Not giving --filter defaults to showing all items no matter what status they have.
Output is emitted via logger at info level, so it won't show up except if the logger is at that level.
BUCKET_UPPER_LIMIT: 90% load degrades hash table performance severely,
so I lowered that to 75% (which is a usual value - java uses 75%, python uses 66%).
I chose the higher value of both because we also should not consume too much
memory, considering the RAM usage already is rather high.
MIN_BUCKETS: I can't explain why, but benchmarks showed that choosing 2^N as
table size severely degrades performance (by 3 orders of magnitude!). So a prime
start value improves this a lot, even if we later still use the grow-by-2x algorithm.
hashindex_resize: removed the hashindex_get() call as we already know that the values
come at key + key_size address.
hashindex_init: do not calloc X*Y elements of size 1, but rather X elements of size Y.
Makes the code simpler, not sure if it affects performance.
The tests needed fixing as the resulting hashtable blob is now of course different due
to the above changes, so its sha hash changed.
print_verbose is now simply logger.info() and is always displayed if
log level allows it. this affects only the `prune` and `mount`
commands which were the only users of the --verbose option. the
additional display is which archives are kept and pruned and a single
message when the fileystem is mounted.
files iteration in create and extract is now printed through a
separate function which will be later controled through a topical
flag.