Code shared by read() and iter_objects() was moved into _read().
Compared to read()'s previous state, this improved:
- fixed size check to avoid read with negative size
- exception handler for struct unpack
- checking for short read
- more precise exception messages
it seems the file cache does *not* have the ATTIC magic header (nor
does it have one in borg), so we don't need to edit the file - we just
copy it like a regular file.
while i'm here, simplify the cache conversion loop: it's no use
splitting the copy and the edition since the latter is so fast, just
do everything in one loop, which makes it much easier to read.
convert is too generic for the Attic conversion: we may have other
converters, from other, more foreign systems that will require
different options and different upgrade mechanisms that convert could
never cover appropriately. we are more likely to use an approach
similar to "git fast-import" instead here, and have the conversion
tools be external tool that feed standard data into borg during
conversion.
upgrade seems like a more natural fit: Attic could be considered like
a pre-historic version of Borg that requires invasive changes for borg
to be able to use the repository. we may require such changes in the
future of borg as well: if we make backwards-incompatible changes to
the repository layout or data format, it is possible that we require
such changes to be performed on the repository before it is usable
again. instead of scattering those conversions all over the code, we
should simply have assertions that check the layout is correct and
point the user to upgrade if it is not.
upgrade should eventually automatically detect the repository format
or version and perform appropriate conversions. Attic is only the
first one. we still need to implement an adequate API for
auto-detection and upgrade, only the seeds of that are present for now.
of course, changes to the upgrade command should be thoroughly
documented in the release notes and an eventual upgrade manual.
we separate the conversion and the copy in order to be able to copy
arbitrary files from attic without converting them. this allows us to
copy the config file cleanly without attempting to rewrite its magic
number
this greatly simplifies the display of those objects, as the
__format__() parameter allows for arbitrary display of the internal
fields of both objects
this will allow us to display those summaries without having to pass a
label to the string representation. we can also print the objects
directly without formatting at all.
- issue #234: handle exception when config file is empty is really not a borg cache config
- there was a unused %s in the Exception string
- error msg was wrong when version check failed - this IS a borg cache, but not of expected version
the heuristics i used are the following:
1. if we are prompting the use, use print on stderr (input() may
produce some stuff on stdout, but it's outside the scope of this
patch). we do not want those prompts to end up on the standard
output in case we are piping stuff around
2. if the command is primarily producing output for the user on the
console (`list`, `info`, `help`), we simply print on the default
file descriptor.
3. everywhere else, we use the logging module with varying levels of
verbosity, as appropriate.
the logging level varies: most is logging.info(), in some place
logging.warning() or logging.error() are used when the condition is
clearly an error or warning. in other cases, we keep using print, but
force writing to sys.stderr, unless we interact with the user.
there were 77 calls to print before this commit, now there are 7, most
of which in the archiver module, which interacts directly with the
user. in one case there, we still use print() only because logging is
not setup properly yet during argument parsing.
it could be argued that commands like info or list should use print
directly, but we have converted them anyways, without ill effects on
the unit tests
unit tests still use print() in some places
this switches all informational output to stderr, which should help
with, if not fixjborg/attic#312 directly
it only checked for too big sizes, but not for too small ones.
that made it die with a ValueError and not raise the appropriate IntegrityError
that gets handled in check() and triggers the repair attempt for the segment.
not sure where the problem is:
it seems to announce it supports st_mtime_ns, but if one uses it and
has a file with ...123ns, i t gets restored as ...000ns.
Then I tried setting st_mtime_ns_round to -3, but it still failed with +1000ns difference.
Maybe rounding is incorrect and it should be truncating?
Issue with granularity could be in python, in netbsd (netbsd platform code), in ffs filesystem, ...
for vagrant testing on misc. platforms, we can't know the group /
we can't have the same group everywhere.
but the OS won't let us set setgid bit if the file does not have our group.
on netbsd, the created file somehow happens to have group "wheel",
but vagrant is not in group wheel.
they somehow pull in some floating point error code that led to a undefined
symbol FPE_... when using the borgbackup wheel on some non-ubuntu/debian linux
platform.
Using a decorator moves the duplicate code in the init methods into a
single decorator method, while still retaining the same runtime overhead
(zero for for the non-OSX path, one extra function call plus the call to
unicodedata.normalize for OSX). The pattern classes are much visually
cleaner, and duplicate code limited to two lines normalizing the pattern
on OSX.
Because the decoration happens at class init time (vs instance init time
for the previous approach), the OSX and non-OSX test cases can no longer
be called in the same run, so I also removed the OSX test case monkey
patching and uncommented the platform skipif decorator.
The OS X file system HFS+ stores path names as Unicode, and converts
them to a variant of Unicode NFD for storage. Because path names will
always be in this canonical form, it's not friendly to require users to
match this form exactly. Convert paths from the repository and patterns
from the command line to NFD before comparing them.
Unix (and Windows, I think) file systems don't convert path names into a
canonical form, so users will continue to have to exactly match the path
name they want, because there could be two paths with the same character
visually that are actually composed of different byte sequences.
Note: there is a failing archiver test on py33-only now.
It is somehow related to __del__ method usage in Cache
and/or locking code. Could not find out the exact reason
why it behaves like that.
the daemonize code changes the cwd, thus a relative repo path can't work.
borg mount repo mnt # did not work
borg mount --foreground repo mnt # did work
borg mount /abs/path/repo mnt # did work
sets the default repository to use, e.g. like:
export BORG_REPO=/mnt/backup/repo
borg init
borg create ::archive
borg list
borg mount :: /mnt
fusermount -u /mnt
borg delete ::archive
added a check that compares the size of the new chunk with the stored size of the
already existing chunk in storage that has the same id_hash value.
raise an exception if there is a size mismatch.
this could happen if:
- the stored size is somehow incorrect (corruption or software bug)
- we found a hash collision for the id_hash (for sha256, this is very unlikely)
the compression was quite cpu intensive and didn't work that great anyway.
now the disk space usage is a bit higher, but it is much faster and less hard on the cpu.
disk space needs grow linearly with the amount and size of the archives, this
is a problem esp. if one has many and/or big archives (but this problem existed
before also because compression was not as effective as I believed).
the tar archive always needed a complete rebuild (and thus: decompression
and recompression) because deleting outdated archive indexes was not
possible in the tar file.
now we just have a directory chunks.archive.d and keep archive index files
there for all archives we already know.
if an archive does not exist any more in the repo, we just delete its index file.
if an archive is unknown still, we fetch the infos and build a new index file.
when merging, we avoid growing the hash table from zero, but just start
with the first archive's index as basis for merging.
also remove the comment about how good xz compresses - while that was true for smaller index files,
it seems to be less effective with bigger ones. maybe just an issue with compression dict size.
This fixes a infrequent problem when (refcount * chunksize) overflowed a int32_t.
chunksize is always <= 8MiB and usually rather ~64KiB (with default chunker params).
Thus, this happened only for high refcounts and/or unusually big chunks.
e.g.:
- setting any security.* key is expected to fail with EACCES if one is not root.
- issue #162 on our issue tracker: user was root, but due to some specific scenario
involving docker and selinux, setting security.selinux key fails even when running as root
not sure if it is the best solution to silently ignore this, but some lines below this change
failure to do a chown is also silently ignored (happens e.g. when restoring a file not owned
by the current user as a non-root user).
if we use {} as default for item.get(), we do not need the "if" as iteration over an empty dict won't do anything.
also fixes too deep indentation the original code had.
the parser for the --chunker-params argument had a wrong parameter order.
fixed the order so it conforms to the help text and the docs.
also added some tests for it and a text for the ValueError exception.
currently, we only use sha256 hashes as key, so key length is always 32.
but instead of hardcoding 32 everywhere, using key_length is just better
readable and also more flexible for the future.
borg list --short just spills out the list of files / dirs - better for some tests
and also useful on the commandline for interactive use.
the tests previously needed fakeroot because in the test setup it always
made calls to mknod and chown, which require (fake)root.
now, the tests adapt to whether it detects (fake)root or not - to run the
the tests completely, you still need fakeroot, but it won't fail all the archiver
tests just due to failing test setup.
also, a test not working correctly due to fakeroot was found:
it should detect whether a read-only repo is usable, but it failed to do that
because with (fake)root, there is no "read only" (at least not via taking away
the w permission bits).
environment context manager: if a env var was not present before, it should not be present afterwards
teardown: cd out of the tmpdir before deleting it
I re-wrote lrucache (and it seems like no-one had looked at it much
before :). I was told my test function would have been simpler in
native py.test, so let's have a go converting it all.
We can avoid any reference to unittest, because lrucache doesn't write
files so it doesn't need any of our custom assertion helpers.
this was left over from times when we either used mock from stdlib
or pypi mock. but as we only use pypi mock now, the indirection is
not needed any more.
if one used --last (or since shortly: gave an archive name), verify_chunks (old method name) was
not called because it requires all archives having been checked.
the problem was that also the final manifest.write() and repository.commit() was done in that method,
so all other repair work did not get committed in that case.
I moved these calls that to a separate finish() method.
New null and lz4 compression.
Giving -C 0 now uses null compression, not zlib level 0 any more
(null has almost zero overhead while zlib-level0 still had to package everything into zlib frames).
Giving -C 10 uses new lz4 compression, super fast compression and even faster decompression.
See borg create --help (and --compression argument).
fix some issues, clean up, optimize:
CNULL: always return bytes
LZ4: deal with getting memoryviews
Compressor: give bytes to detect(), avoid memoryviews
for lz4, always use same COMPR_BUFFER, avoid memory management costs.
check --chunker-params CHUNK_MAX_EXP upper limit
This fix is maybe not perfect yet, but maybe better than nothing.
A comment by Ernest0x (see https://github.com/jborg/attic/issues/232 ):
@ThomasWaldmann your patch did the job.
attic check --repair did the repairing and attic delete deleted the archive.
Thanks.
That said, however, I am not sure if the best place to put the check is where
you put it in the patch. For example, the check operation uses a custom msgpack
unpacker class named "RobustUnpacker", which it does try to check for correct
format (see the comment: "Abort early if the data does not look like a
serialized dict"), but it seems it does not catch my case. The relevant code
in 'cache.py', on the other hand, uses msgpack's Unpacker class.
it was silently failing until recently. and it can't work the way it is on RemoteRepository.
it's still active (and now even really working) for the (local) Repository tests.
the old code blows up with an integer OverflowError when the cache file goes beyond 2GiB size.
the new code just reuses the Repository implementation as a local temporary key/value store.
still an issue: if the place where the temporary RepositoryCache is stored (usually /tmp) can't
cope with the cache size and runs full.
if you copy data from a fuse mount, the cache size is the copied deduplicated data size.
so, if you have lots of data to extract (more than your /tmp can hold), rather do not use fuse!
besides fuse mounts, this also affects attic check and cache sync (in these cases, only the
metadata size counts, but even that can go beyond 2GiB for some people).
also: add some benchmarking output showing singlethread, multithread and
multithread-with-gil-releasing-chunker performance.
this changeset maybe improves multithreading performance a little, about 3%
(but that might be close to the measurement accuracy).
always use archiver.print_error, so it goes to sys.stderr
always say "Error: ..." for errors
for rc != 0 always say "Exiting with failure status ..."
catch all exceptions subclassing Exception, so we can log them in same way and set exit_code=1
- use power-of-2 sizes / n bit hash mask so one can give them more easily
- chunker api: give seed first, so we can give *chunker_params after it
- fix some tests that aren't possible with 2^N
- make sparse file extraction zero detection flexible for variable chunk max size
regular files are most common, more than directories. fifos are rare.
was no big issue, the calls are cheap, but also no big issue to just fix the order.
they are rare, so it's pointless to check for them first.
seen the stat..S_ISSOCK in profiling results with high call count.
was no big issue, that call is cheap, but also no big issue to just fix the order.
Re-synchronize chunks cache with repository.
If present, uses a compressed tar archive of known backup archive
indices, so it only needs to fetch infos from repo and build a chunk
index once per backup archive.
If out of sync, the tar gets rebuilt from known + fetched chunk infos,
so it has complete and current information about all backup archives.
Finally, it builds the master chunks index by merging all indices from
the tar.
Note: compression (esp. xz) is very effective in keeping the tar
relatively small compared to the files it contains.
Use python >= 3.3 to get better compression with xz,
there's a fallback to bz2 or gz when xz is not supported.
if we have a OS file handle, we can directly read to the final destination - one memcpy less.
if we have a Python file object, we get a Python bytes object as read result (can't save the memcpy here).
a lot of speedup for:
"list <repo>", "delete <repo>" list, "prune" - esp. for slow connections to remote repositories.
the previous method used metadata from the archive itself, which is (in total) rather large.
so if you had many archives and a slow (remote) connection, it was very slow.
but there is a lot easier way: just use the archives list from the repository manifest - we already
have it anyway and it also has name, id and timestamp for all archives - and that's all we need.
I defined a ArchiveInfo namedtuple that has same element names as seen as attribute names
of the Archive object, so as long as name, id, ts is enough, it can be used in its place.
Making much better use of the CPU by dispatching all CPU intensive stuff
(hashing, crypto, compression) to N crypter threads (N == logical cpu count ==
4 for a dual-core CPU with hyperthreading).
I/O intensive stuff also runs in separate threads: the MainThread does the
filesystem traversal, the reader thread reads and chunks the files, the writer
thread writes to the repo. This way, we don't need to sit idle waiting for I/O,
but the I/O thread will block and another thread will get dispatched and use
the time. This applies for read as well as for write/fsync I/O wait time
(access time + data transfer).
There's one more thread, the "delayer". We need it to handle a race condition
related to the computation of the compressed size (which is only possible after
hashing/compression/encryption has finished). This "csize" makes all this code
quite more complicated than if we would not need it.
Although there is the GIL issue for Python code, we can still make good use of
multithreading as I/O operations and C code (that releases the GIL) can run in
parallel.
All threads are connected via Python Queues (which are intended for this and
thread safe). The Cache.chunks datastructure is also updated by threadsafe
code.
A little benchmark
------------------
Both is with compression (zlib level 6) and encryption on a haswell/ssd laptop:
Without multithreading code:
Command being timed: "borg create /extra/attic/borg::1 /home/tw/Desktop/"
User time (seconds): 13.78
System time (seconds): 0.40
Percent of CPU this job got: 83%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.98
With multithreading code:
Command being timed: "borg create /extra/attic/borg::1 /home/tw/Desktop/"
User time (seconds): 24.08
System time (seconds): 1.16
Percent of CPU this job got: 249%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.11
It's unclear to me why it uses much more "User time" (I'm not even sure that
measurement is correct). But the overall runtime "Elapsed" significantly
dropped and it makes better use of all cpu cores (not just 83% of one).