Because it ended the loop only when .list() returned an
empty result, this always needed one call more than
necessary.
We can also detect that we are finished, if .list()
returns less than the limit we gave to it.
Also: reduce code duplication by using repo_lister func.
borg compact now uses ChunkIndex (a specialized, memory-efficient data structure),
so it needs less memory now. Also, it saves that chunks index to cache/chunks in
the repository.
When the chunks index is needed, it is first tried to get it from cache/chunks.
If that fails, fall back to building the chunks index via repository.list(),
which can be rather slow and immediately cache the resulting ChunkIndex in the
repo.
borg check --repair currently just deletes the chunks cache, because it might
have deleted some invalid chunks in the repo.
cache.close now saves the chunks index to cache/chunks in repo if it
was modified.
thus, borg create will update the cached chunks index with new chunks.
cache/chunks_hash can be used to validate cache/chunks (and also to validate /
invalidate locally cached copies of that).
we discard all files cache entries referring to files
with timestamps AFTER we started the backup.
so, even in case we would back up an inconsistent file
that has been changed while we backed it up, we would
not have a files cache entry for it and would fully
read/chunk/hash it again in next backup.
- changes to locally stored files cache:
- store as files.<H(archive_name)>
- user can manually control suffix via env var
- if local files cache is not found, build from previous archive.
- enable rebuilding the files cache via loading the previous
archive's metadata from the repo (better than starting with
empty files cache and needing to read/chunk/hash all files).
previous archive == same archive name, latest timestamp in repo.
- remove AdHocCache (not needed any more, slow)
- remove BORG_CACHE_IMPL, we only have one
- remove cache lock (this was blocking parallel backups to same
repo from same machine/user).
Cache entries now have ctime AND mtime.
Note: TTL and age still needed for discarding removed files.
But due to the separate files caches per series, the TTL
was lowered to 2 (from 20).
reuse_chunk is the complement of add_chunk for already existing chunks.
It doesn't do refcounting anymore.
.seen_chunk does not return the refcount anymore, but just whether the chunk exists.
If we add a new chunk, it immediately sets its refcount to MAX_VALUE, so
there is no difference anymore between previously existing chunks and new
chunks added. This makes the stats even more useless, but we have less complexity.
When the AdhocCache(WithFiles) queries chunk IDs from the repo to build the chunks
index, it won't know their refcount and thus all chunks in the index have their
refcount at the MAX_VALUE (representing "infinite") and that would never decrease
nor could that ever reach zero and get the chunk deleted from the repo.
Only completely new chunks first written in the current borg run have a valid
refcount.
In some exception handlers, borg tried to clean up chunks that won't be used
by an item by decref'ing them. That is either:
- pointless due to refcount being at MAX_VALUE
- inefficient, because the user might retry the backup and would need to
transmit these chunks to the repo again.
We'll just rely on borg compact ONLY to clean up any unused/orphan chunks.
borg1 needed this due to its transactional / rollback behaviour:
if there was uncommitted stuff in the repo, next repo opening automatically
rolled back to last commit. thus we needed checkpoint archives to reference
chunks and commit the repo.
borg2 does not do that anymore, unused chunks are only removed when the
user invokes borg compact.
thus, if a borg create gets interrupted, the user can just run borg create
again and it will find some chunks are already in the repo, making progress
even if borg create gets frequently interrupted.
Dummy returns all-zero stats from that call.
Problem was that these values can't be computed from the chunks cache
anymore. No correct refcounts, often no size information.
Also removed hashindex.ChunkIndex.summarize (previously used by the above mentioned
.stats() call) and .stats_against (unused) for same reason.
Note: this is the default cache implementation in borg 1.x,
it worked well, but there were some issues:
- if the local chunks cache got out of sync with the repository,
it needed an expensive rebuild from the infos in all archives.
- to optimize that, a local chunks.archive.d cache was used to
speed that up, but at the price of quite significant space needs.
AdhocCacheWithFiles replaced this with a non-persistent chunks cache,
requesting all chunkids from the repository to initialize a simplified
non-persistent chunks index, that does not do real refcounting and also
initially does not have size information for pre-existing chunks.
We want to move away from precise refcounting, LocalCache needs to die.
Simplify the repository a lot:
No repository transactions, no log-like appending, no append-only, no segments,
just using a key/value store for the individual chunks.
No locking yet.
Also:
mypy: ignore missing import
there are no library stubs for borgstore yet, so mypy errors without that option.
pyproject.toml: install borgstore directly from github
There is no pypi release yet.
use pip install -e . rather than python setup.py develop
The latter is deprecated and had issues installing the "borgstore from github" dependency.
Also: support a "cli" env var value, that does not determine
the implementation from the env var, but rather from cli options (similar to as it was before adding BORG_CACHE_IMPL).
- skip test_cache_chunks if there is no persistent chunks cache file
- init self.chunks for AdHocCache
- remove warning output from AdHocCache.__init__, it gets mixed with JSON output and fails the JSON decoder.
Add new borg create option '--prefer-adhoc-cache' to prefer the
AdHocCache over the NewCache implementation.
Adjust a test to match the previous default behaviour (== use the
AdHocCache) with --no-cache-sync.
removed some code borg had for backwards compatibility with
old borg versions (that had timestamp only in the cache).
now the manifest timestamp is only checked against the manifest-timestamp
file in the security dir, simplifying the code.
removed some code borg had for backwards compatibility with
old borg versions (that had key_type only in the cache).
now the repo key_type is only checked against the key-type
file in the security dir, simplifying the code.
removed some code borg had for backwards compatibility with
old borg versions (that had previous_location only in the
cache).
now the repo location is only checked against the location
file in the security dir, simplifying the code and also
fixing a related test failure with NewCache.
also improved test_repository_move to test for aborting in
case the repo location changed unexpectedly.
incref: returns (id, size), so it needs the size if it can't
get it from the chunks index. also needed for updating stats.
decref: caller does not always have the chunk size (e.g. for
metadata chunks),
as we consider 0 to be an invalid size, we call with size == 1
in that case. thus, stats might be slightly off.
the files cache used to have only the chunk ids,
so it had to rely on the chunks index having the
size information - which is problematic with e.g.
the AdhocCache (has size==0 for all not new chunks) and blocked using the files cache there.
Try to rebuild cache if an exception is raised, fixes#5213
For now, we catch FileNotFoundError and FileIntegrityError.
Write cache config without manifest to prevent override of manifest_id.
This is needed in order to have an empty manifest_id.
This empty id triggers the re-syncing of the chunks cache by calling sync() inside LocalCache.__init__()
Adapt and extend test_cache_chunks to new behaviour:
- a cache wipe is expected now.
- borg detects the corrupt cache and wipes/rebuilds the cache.
- check if the in-memory and on-disk cache is as expected (a rebuilt chunks cache).
writing: put type into repoobj metadata
reading: check wanted type against type we got
repoobj metadata is encrypted and authenticated.
repoobj data is encrypted and authenticated, also (separately).
encryption and decryption of both metadata and data get the
same "chunk ID" as AAD, so both are "bound" to that (same) ID.
a repo-side attacker can neither see cleartext metadata/data,
nor successfully tamper with it (AEAD decryption would fail).
also, a repo-side attacker could not replace a repoobj A with a
differently typed repoobj B without borg noticing:
- the metadata/data is cryptographically bound to its ID.
authentication/decryption would fail on mismatch.
- the type check would fail.
thus, the problem (see CVEs in changelog) solved in borg 1 by the
manifest and archive TAMs is now already solved by the type check.
rebuild_refcounts verifies and recreates the TAM.
Now it re-uses the salt, so that the archive ID does not change
just because of a new salt if the archive has still the same data.
while on macOS the new and old security dir location is the same path,
this is not the case on e.g. Linux, it could move from .config/borg/security to
.local/share/borg/security .
See #5760.