[BUGFIX] tsdb: store a millisecond timestamp (not a WAL segment number) in `walExpiries`
when a series is evicted via `CompactStaleHead`/`CompactSelectedSeries`, so the series'
label record is correctly retained in the next WAL checkpoint and replays cleanly.
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
* chore: fix typos in comments
Fix three minor typos in source comments:
- scrape: mimicks -> mimics
- tsdb: descibes -> describes
- ui/codemirror-promql: theses -> these
Signed-off-by: RoySerbi <roy676564@gmail.com>
* ci: retrigger CI to clear known 32-bit flake
Empty commit to retrigger CI. The previous run failed only on
'Go tests for 32-bit x86' due to the known intermittent flake in
TestRemoteWrite_PerQueueMetricsAfterRelabeling (see #17356), which
is unrelated to this comment-only PR.
Signed-off-by: RoySerbi <roy676564@gmail.com>
---------
Signed-off-by: RoySerbi <roy676564@gmail.com>
Add TestAppendHistogramErrorDoesNotSetPendingCommit (V1) and
TestHeadAppenderV2_HistogramErrorDoesNotSetPendingCommit (V2),
each covering the integer and float histogram branches.
The integer V1 branch previously set s.pendingCommit on the error
path, which left the flag stuck on existing series whenever an
append was rejected (e.g. ErrOutOfOrderSample). Because the failed
sample is never added to the appender's batch, Commit/Rollback
never clears pendingCommit for that series, and head GC at
tsdb/head.go treats it as still in use.
The V1 integer subtest fails on main without the prior commit;
both subtests pass with it. The V2 paths already use err == nil
and the V2 test is a lock-in; inverting the V2 condition locally
confirms the test would catch a similar regression there.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
AppendHistogram used err != nil when deciding to set pendingCommit for integer histograms, while the float histogram branch uses err == nil. Align the classic histogram path so pendingCommit is set only after a successful appendableHistogram check, matching appendableFloatHistogram.
Signed-off-by: Weixie Cui <cuiweixie@gmail.com>
histogramSamplesV2 and floatHistogramSamplesV2 tracked the previous
sample's Ref and ST via a *RefHistogramSample pointer (prev). Taking the
address of a loop-local variable (prev = &rh) forced the compiler to
heap-allocate rh on every iteration; the first iteration also allocated
a separate sentinel struct. The pointed-to fields were only ever read as
two int64 scalars, so the pointer added zero semantic value.
Replace prev with two scalar variables (prevRef, prevST) and a boolean
sentinel. rh no longer has its address taken and stays on the stack.
This affects every caller of dec.HistogramSamples that produces V2
records (EnableSTStorage=true): WAL replay, the WAL watcher (remote
write tail), and checkpoint creation.
Benchmarks (go test -count=6 -benchmem, benchstat):
BenchmarkDecodeHistogramSamples (tsdb/record)
│ before │ after │
│ allocs/op │ allocs/op vs base │
buckets=0/v2 │ 2.001k ± 0%│ 1.000k ± 0% -50.02% (p=0.002)│
buckets=4/v2 │ 4.001k ± 0%│ 3.000k ± 0% -25.02% (p=0.002)│
buckets=16/v2 │ 4.001k ± 0%│ 3.000k ± 0% -25.02% (p=0.002)│
│ before │ after │
│ B/op │ B/op vs base │
buckets=0/v2 │ 187.5Ki ± 0%│ 156.2Ki ± 0% -16.68% (p=0.002)│
buckets=4/v2 │ 250.0Ki ± 0%│ 218.8Ki ± 0% -12.51% (p=0.002)│
buckets=16/v2 │ 437.5Ki ± 0%│ 406.2Ki ± 0% -7.15% (p=0.002)│
BenchmarkLoadWLs end-to-end WAL replay (tsdb), stStorage=true only
│ before │ after │
│ allocs/op │ allocs/op vs base │
histogramSeriesPct=1.000 │ 19.70M ± 0% │ 14.90M ± 0% -24.39% (p=0.002)│
histogramSeriesPct=0.500 │ 10.47M ± 0% │ 8.06M ± 0% -23.00% (p=0.002)│
│ before │ after │
│ B/op │ B/op vs base │
histogramSeriesPct=1.000 │ 1.539Gi ± 0%│ 1.394Gi ± 0% -9.42% (p=0.002)│
histogramSeriesPct=0.500 │ 1051.3Mi ± 0%│ 975.1Mi ± 0% -7.25% (p=0.002)│
│ before │ after │
│ sec/op │ sec/op vs base │
histogramSeriesPct=1.000 │ 824.9m ± 0% │ 762.6m ± 1% -7.55% (p=0.002)│
histogramSeriesPct=0.500 │ 488.6m ± 1% │ 451.4m ± 1% -7.61% (p=0.002)│
V1 paths and float-only shapes are unchanged (p >> 0.05 throughout).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Miguel Bernabeu Diaz <miguel.bernabeu@coralogix.com>
Add two benchmark components to measure the native histogram decode hot
path, which is shared by WAL replay, WAL watcher (remote write), and
checkpoint creation.
tsdb/record: BenchmarkDecodeHistogramSamples isolates the V1 and V2
histogram decoder paths across bucket counts (0, 4, 16), giving a
precise per-sample allocation signal for decoder changes.
tsdb: BenchmarkLoadWLs gains two new shapes:
- all-histogram (histogramSeriesPct=1.0, bucketsPerHistogram=8): mirrors
the existing "In between" float shape for direct comparison.
- mixed (histogramSeriesPct=0.5, bucketsPerHistogram=8): models a
deployment partway through migrating to native histograms.
Both shapes are parameterised over stStorage (V1 vs V2 encoding) via the
existing enableSTStorage loop, so benchstat can show the V1/V2 delta
without additional test infrastructure. The subtest names include
histogramSeriesPct and bucketsPerHistogram only when non-zero, leaving
existing float-only subtest names unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Miguel Bernabeu Diaz <miguel.bernabeu@coralogix.com>
Replace the catch-all default branch in encodeToSnapshotRecord and
decodeSeriesFromChunkSnapshot with an explicit EncFloatHistogram case and a
default that panics (encode) or returns an error (decode), making unknown
encodings immediately visible rather than silently mishandling them.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
EncXOR2 is a float encoding and must be treated like EncXOR in all
places that enumerate chunk types:
- compact.go: NumFloatSamples was not incremented for EncXOR2 chunks
during compaction, leading to under-reported block stats.
- head_wal.go: encodeToSnapshotRecord fell through to the default
(FloatHistogram) branch for EncXOR2 head chunks, which would corrupt
chunk snapshots; the decode path already handled EncXOR2 correctly.
- ooo_head.go: update stale comment to mention EncXOR2.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
The oversize-chunk trigger introduced in #18692 was implemented as a
closure defined inside the per-sample loop in
populateChunksFromIterable and invoked once at the `if` condition.
Replace it with a plain conditional and hoist
`len(currentChunk.Bytes())` out of the switch so the two encoding
cases don't repeat the same expression. The new shape preserves the
original `||` short-circuit: the size check is only evaluated when
neither the encoding nor the start-timestamp capability forces a new
chunk, which also keeps `currentChunk` non-nil at the point of read.
`gcflags=-m=2` reports the closure body inlined and the symbol table
shows no separate `func1` symbol, yet benchstat shows a measurable
speedup. The most likely explanation: the closure body inlines, but
the `funcval` struct (capturing `currentChunk` and `currentValueType`)
is still stack-constructed each iteration — invisible to escape
analysis, but a real per-iteration cost in a hot loop.
Benchmark, `go test -count=6 -benchmem -bench=BenchmarkQuerierSelectWithOutOfOrder -benchtime=5s -run=^$ ./tsdb/`,
Intel Xeon Platinum 8280 @ 2.70 GHz (linux/amd64), 1M-series head,
query selectivity varies:
│ main │ optimized │
│ sec/op │ sec/op vs base │
Head/1of1000000-16 301.5m ± 4% 257.0m ± 4% -14.74% (p=0.002 n=6)
Head/10of1000000-16 305.6m ± 3% 260.4m ± 2% -14.80% (p=0.002 n=6)
Head/100of1000000-16 303.9m ± 2% 259.7m ± 2% -14.54% (p=0.002 n=6)
Head/1000of1000000-16 303.8m ± 2% 267.0m ± 2% -12.13% (p=0.002 n=6)
Head/10000of1000000-16 318.1m ± 1% 278.9m ± 8% -12.33% (p=0.002 n=6)
Head/100000of1000000-16 364.1m ± 7% 352.8m ± 4% ~ (p=0.065 n=6)
Head/1000000of1000000-16 1.115 ± 2% 1.089 ± 26% ~ (p=0.394 n=6)
geomean 377.8m 337.3m -10.71%
allocs/op and B/op unchanged. The two largest-selectivity cases trend
faster but are dominated by the per-sample append cost so the relative
delta is smaller and lost in variance.
`TestChunkQuerier_OverlappingInOrderAndOOOChunks` continues to
exercise the overflow path.
```release-notes
NONE
```
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
* fix(tsdb): chunk overflow on ooo query
Protect against and fix overflow of chunks with more than 2^16-1 samples
in case we're recoding chunks due to for example in-order and ooo samples
overlap during compaction or query.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
* tsdb: extract stripeSeries.refStripe helper
Extract the repeated ref-to-stripe-index calculation into a method on
stripeSeries, replacing five inline copies that used two different
casting styles (int and uint64). The helper computes with uint64
internally so it is correct on 32-bit architectures.
* tsdb: skip entire stripes in mmapHeadChunks via per-stripe ready count
Add a per-stripe mmapReady counter to stripeSeries that tracks how many
series in each stripe have headChunkCount >= 2 (i.e., are ready for
mmapping). mmapHeadChunks skips stripes where the counter is zero,
avoiding the RLock and map iteration entirely.
---------
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Implements ST for Histograms and Float Histograms (and their custom bucket cousins) in WAL. New tests, new benchmarks.
Part of https://github.com/prometheus/prometheus/issues/17790
```release-notes
[CHANGE] Adds Start Time value to all WAL Histogram samples in memory, and therefore may increase memory usage.
```
Signed-off-by: Owen Williams <owen.williams@grafana.com>
initTime previously set minTime first and maxTime second. Because
Head.initialized() keys only off minTime, a concurrent Head.Appender call
could observe initialized() == true while maxTime was still
math.MinInt64. h.appender() then computes appendableMinValidTime as
MaxTime() - chunkRange/2, which underflows to a large positive number
and rejects in-range samples with ErrOutOfBounds.
Set maxTime first, then minTime. The CAS-loser wait now spins on
minTime instead of maxTime, preserving the existing anti-deadlock
timeout. AppenderV2 shares the same gate, so this single change covers
both paths.
The TestHead_InitAppenderRace_ErrOutOfBounds test added in #17963 is now
stable across 1000 iterations (and 100 iterations under -race).
Relates to #17941
Builds on #17963
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
I have seen some flakiness in these tests, including timeouts. LLM suggested these fixes to make them more deterministic. They look good to me.
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Adds CheckpointFromInMemorySeries option for agent.Options to enable a faster checkpoint implementation that skips segment re-read and just uses in-memory data instead.
* feat: impl agent-specific checkpoint dir
* feat: impl ActiveSeries interface
* feat: use new checkpoint impl
* feat: hide new checkpoint impl behind a feature flag
* feat: add benchmark
* feat: add benchstat case
* feat: use feature flag in bench
* feat: use same labels for persisted state and append
* feat: set WAL segment size
* feat: add checkpoint size metric and bump series size
* feat: wal replay test
* feat: expose new checkpoint opts in cmd flags
* feat: update cli doc
* add ActiveSeries and DeletedSeries doc
Signed-off-by: x1unix <9203548+x1unix@users.noreply.github.com>
Signed-off-by: Denys Sedchenko <9203548+x1unix@users.noreply.github.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
* test(tsdb): add OOO error coverage for ST zero sample appends
Add unit tests exercising the out-of-order error paths in
AppendSTZeroSample, AppendHistogramSTZeroSample (AppenderV1), and
the best-effort ST injection in AppenderV2.Append.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
* make format
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
* test(tsdb): add TestHeadAppenderV2_BestEffortSTZeroSample_OOO
The three OOO cases added to TestHeadAppenderV2_Append_EnableSTAsZeroSample
use a single appender so headChunks is nil at append time; the zero sample
enters the batch and is rejected silently in commitFloats, never reaching
the error-handling branch at line 374 of bestEffortAppendSTZeroSample.
Add a dedicated test that commits the first sample before appending the
second. This makes headChunks non-nil, so appendFloat/appendHistogram/
appendFloatHistogram returns ErrOutOfOrderSample at append time and the
branch at line 374 is actually executed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
---------
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
tsdb: cache collected head chunks on ChunkReader for O(1) lookup
The query path calls s.chunk() once per chunk meta via
ChunkOrIterableWithCopy. Each call walks the head chunks linked list
from the head to the target position. For a series with N head chunks
iterated oldest-first, total work is O(N²).
Cache the collected []*memChunk slice on headChunkReader, keyed by
series ref, head pointer, and mmapped chunks length. Collected once
per series under lock; reused on subsequent chunk lookups for the same
series. The backing array is reused across series (zero alloc after
first use).
Series with 0 or 1 head chunks skip the cache entirely to avoid
per-series overhead that dominates for typical workloads where most
series have a single head chunk.
The cache is gated behind an enableCache flag, toggled via an optional
chunkCacheToggler interface only when hints.Step > 0 (range queries).
Instant queries only need one chunk per series, so the cache overhead
is not recouped.
Also replace O(N²) linked-list traversals in appendSeriesChunks with
O(N) collectHeadChunks + slice iteration, and thread reusable
headChunksBuf through the index reader paths to avoid per-series
allocations.
---------
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
tsdb: Skip clean series during periodic head chunk mmap
The periodic mmapHeadChunks cycle previously acquired a per-series
lock on every series, even though typically >99% have nothing to
mmap. This was identified as a CPU bottleneck in Grafana Mimir.
Add a headChunkCount field (sync/atomic.Uint32) to memSeries that
tracks the number of head chunks. It is incremented in
cutNewHeadChunk and the histogram new-chunk paths, and reset by
mmapChunks and truncateChunksBefore. mmapHeadChunks uses a lock-free
Load to skip series with fewer than 2 head chunks, avoiding the
per-series lock for clean series.
sync/atomic.Uint32 (4 bytes) is used instead of go.uber.org/atomic
(8 bytes) to fit in existing struct padding without growing
memSeries. Chunk counts are bounded by the 3-byte field in
HeadChunkRef, so cannot overflow uint32.
Also fix pre-existing comment inaccuracies in the touched code:
headChunks.next -> headChunks.prev, mmapHeadChunks() -> mmapChunks()
in the doc comment, and a grammar error.
---------
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
The test only writes ~80 samples, so the default 512MB chunk segment
pre-allocation during compaction is unnecessary. Use 1MB instead to
avoid large file allocations on constrained CI environments.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
* tsdb: fix prometheus_tsdb_head_chunks going negative after WAL replay
When truncateStaleSeries deletes a series (writing a full-range tombstone
to the WAL) and the same label set is immediately re-created, WAL replay
queues the following sequence on the same processor shard for the shared
memSeries pointer:
reset(mSeries, M mmappedChunks, walRef=old)
deleteSeriesByID(old)
reset(mSeries, N mmappedChunks, walRef=new)
deleteSeriesByID correctly subtracts M from the gauge but does not clear
series.mmappedChunks. The subsequent reset subtracts M again, driving
prometheus_tsdb_head_chunks negative when M > N.
Fix by setting series.mmappedChunks = nil in deleteSeriesByID after
accounting for those chunks.
Fixes#10884
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jorge Creixell <jcreixell@gmail.com>
* Simplify test
- Re-use appending helper
- Cleanup comments
Signed-off-by: Jorge Creixell <jcreixell@gmail.com>
* Improve comments in test
Signed-off-by: Jorge Creixell <jcreixell@gmail.com>
* Fix formatting
Signed-off-by: Jorge Creixell <jcreixell@gmail.com>
* Improve comment
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Jorge Creixell <jcreixell@gmail.com>
---------
Signed-off-by: Jorge Creixell <jcreixell@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>