wrap msgpack to avoid future upstream api changes making troubles
or that we would have to globally spoil our code with extra params.
make sure the packing is always with use_bin_type=False,
thus generating "old" msgpack format (as borg always did) from
bytes objects.
make sure the unpacking is always with raw=True,
thus generating bytes objects.
note:
safe unicode encoding/decoding for some kinds of data types is done in Item
class (see item.pyx), so it is enough if we care for bytes objects on the
msgpack level.
also wrap exception handling, so borg code can catch msgpack specific
exceptions even if the upstream msgpack code raises way too generic
exceptions typed Exception, TypeError or ValueError.
We use own Exception classes for this, upstream classes are deprecated
reading the files cache can take considerable amount of time (a user
reported 1h 42min for a 700MB files cache for a repo with 8M files and
15TB total), so we must init the checkpoint timer after that or borg
will create the checkpoint too early.
creating a checkpoint means (among other stuff) saving the files cache,
which will also take a lot of time in such a case, one time too much.
doing this in a clean way required some refactoring:
- cache_mode is now given to Cache initializer and stored in instance
- the files cache is loaded early in _do_open (if needed)
the problem was that the upper layer code did not have enough information
about the file, whether it is known or not - and thus, could not decide
correctly whether status should be M)odified or A)dded.
now, file_known_and_unchanged method returns an additional "known"
boolean to fix this.
also: add comment about files cache loading in cache_mode='r'
now deals with:
- corrupted files cache (truncated or modified not by borg)
- inaccessible/unreadable files cache
- missing files cache
The latter fix is not sufficient, the cache transaction processing
would still stumble over expected, but missing files in the cache.
(cherry picked from commit 423ec4ba1e)
You can now control the files cache mode using this option:
--files-cache={ctime,mtime,size,inode,rechunk,disabled}*
(only some combinations are supported)
Previously, only these modes were supported:
- mtime,size,inode (default of borg < 1.1.0rc4)
- mtime,size (by using --ignore-inode)
- disabled (by using --no-files-cache)
Now, you additionally get:
- ctime alternatively to mtime (more safe), e.g.:
ctime,size,inode (this is the new default of borg >= 1.1.0rc4)
- rechunk (consider all files as changed, rechunk them)
Deprecated:
- --ignore-inodes (use modes without "inode")
- --no-files-cache (use "disabled" mode)
The tests needed some changes:
- previously, we use os.utime() to set a files mtime (atime) to specific
values, but that does not work for ctime.
- now use time.sleep() to create the "latest file" that usually does
not end up in the files cache (see FAQ)
chunk_incref was called when dealing with part files without giving the
known chunk size in the size_ parameter.
adjusted LocalCache.chunk_incref to have same signature.
This is a (relatively) simple state machine running in the
data callbacks invoked by the msgpack unpacking stack machine
(the same machine is used in msgpack-c and msgpack-python,
changes are minor and cosmetic, e.g. removal of msgpack_unpack_object,
removal of the C++ template thus porting to C and so on).
Compared to the previous solution this has multiple advantages
- msgpack-c dependency is removed
- this approach is faster and requires fewer and smaller
memory allocations
Testability of the two solutions does not differ in my
professional opinion(tm).
Two other changes were rolled up; _hashindex.c can be compiled
without Python.h again (handy for fuzzing and testing);
a "small" bug in the cache sync was fixed which allocated too
large archive indices, leading to excessive archive.chunks.d
disk usage (that actually gave me an idea).