promql: Add metric and stats to track total samples read per query
[FEATURE] PromQL: Expose per-query `samplesRead` (and `samplesReadPerStep` with `stats=all` and the `promql-per-step-stats` feature flag) in the query stats response, and add the `prometheus_engine_query_samples_read_total` engine counter. `samplesRead` reflects storage I/O distinct from `totalQueryableSamples`, which counts samples loaded into the evaluator (and so over-counts when a sample is reused across multiple range-vector windows).
[BUGFIX] PromQL: A range query whose `end` was not aligned to `step` caused subqueries inside it to evaluate past the parent's last actual step, inflating `peakSamples` in the query stats and against the `query.max-samples` limit, and wasting storage I/O reading samples that were never used in the result.
[BUGFIX] PromQL: A range query containing an at-modifier-unsafe function over a range-vector with an `@` modifier (e.g. `predict_linear(metric[60s] @ T, X)`) silently under-counted `totalQueryableSamples` for steps after step 0.
- Rename chunkenc.Compatible to CompatibleValues and document that it
concerns sample-value encoding only; ST compatibility is the caller's
responsibility. Update callers and tests.
- Use audience-neutral wording in validateOpts errors (the tsdb package is
also consumed as a library) instead of Prometheus-only config-key names.
- Correct docs/feature_flags.md: with st-storage active and a resolved XOR
encoding, Prometheus fails configuration validation rather than warning.
- Drop the unrelated LICENSE-check change from scripts/sync_repo_files.sh.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
--enable-feature=xor2-encoding sets tsdb.Options.FloatChunkEncoding to
EncXOR2 (a concrete encoding value, not a feature-flag bool). ApplyConfig
uses this as the fallback when chunk_encoding.floats is absent. No global
mutable state is introduced in the config package.
Signed-off-by: Julien Pivotto <julien.pivotto@grafana.com>
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Adding this as a meta label makes it possible to dynamically configure
this setting through discovery labels.
This helps in usecases like K8S where we could enable this with a pod
annotation.
Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
Add a per-query samples-read counter that distinguishes storage I/O
from the existing TotalSamples count. TotalSamples reports samples
loaded into the evaluator and so over-counts when a sample is reused
across multiple range-vector windows; samplesRead reflects what the
storage layer actually had to read.
For range-vector functions in range queries, samplesRead applies
delta semantics: step 0 counts the full window, later steps count
only the points not already present in the previous step's window.
For other query shapes samplesRead equals TotalSamples.
Expose the new counter:
- as `samplesRead` and `samplesReadPerStep` in /api/v1/query{,_range}
stats responses (the per-step variant only with stats=all and the
per-step stats feature flag enabled);
- as the `prometheus_engine_query_samples_read_total` Prometheus
counter, reporting the cumulative samples-read across all queries;
- in the OpenAPI schemas.
Also update the per-step stats feature flag and querying API
documentation, and add unit tests covering instant queries, range
queries, range-vector functions, and subqueries.
Signed-off-by: Dan Cech <dcech@grafana.com>
Signed-off-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>
Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>
Implement least(a, b) and greatest(a, b) as scalar-returning PromQL
functions backed by math.Min and math.Max respectively.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
- Clear stCache state in scrape loop when append fails for existing series.
- Copy float histogram before storing in cache to avoid mutation.
- Add test for state mutation on OOO failure.
- Update docs to reflect behavior on failure.
Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>
Rename the `min()` and `max()` duration expression functions to
`least()` and `greatest()` to avoid conflicts with the existing
PromQL aggregate functions `min` and `max`.
Update documentation and tests accordingly.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Adds CheckpointFromInMemorySeries option for agent.Options to enable a faster checkpoint implementation that skips segment re-read and just uses in-memory data instead.
* feat: impl agent-specific checkpoint dir
* feat: impl ActiveSeries interface
* feat: use new checkpoint impl
* feat: hide new checkpoint impl behind a feature flag
* feat: add benchmark
* feat: add benchstat case
* feat: use feature flag in bench
* feat: use same labels for persisted state and append
* feat: set WAL segment size
* feat: add checkpoint size metric and bump series size
* feat: wal replay test
* feat: expose new checkpoint opts in cmd flags
* feat: update cli doc
* add ActiveSeries and DeletedSeries doc
Signed-off-by: x1unix <9203548+x1unix@users.noreply.github.com>
Signed-off-by: Denys Sedchenko <9203548+x1unix@users.noreply.github.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
* test(cmd/prometheus): add TestFeatureFlagsDocumented and fix help text
Add TestFeatureFlagsDocumented to ensure the --enable-feature help text
in and docs/feature_flags.md list the same set of flags.
The help text was out of sync with the documentation:
- Flags present in docs but missing from help text: `auto-reload-config`,
`metadata-wal-records`, `otlp-native-delta-ingestion`,
`promql-delayed-name-removal`, `type-and-unit-labels`. Added them.
- Flags present in help text but missing from docs: `auto-gomaxprocs`,
`expand-external-labels`. Removed them.
The help text is now sorted for better readability and kept in sync
with the documentation.
Also, the parsing of an empty `--enable-feature` was changed to
print `msg="Unknown option for --enable-feature" option=""` instead of nothing.
Signed-off-by: Ayoub Mrini <ayoubmrini424@gmail.com>
* main.go remove default for --enable-feature to avoid unwanted
Signed-off-by: Ayoub Mrini <ayoubmrini424@gmail.com>
---------
Signed-off-by: Ayoub Mrini <ayoubmrini424@gmail.com>
This adds a /api/v1/status/self_metrics endpoint that allows the frontend to
fetch metrics about the server itself, making it easier to construct frontend
pages that show the current server state. This is needed because fetching
metrics from its own /metrics endpoint would be both hard to parse and also
require CORS permissions on that endpoint (for cases where the frontend
dashboard is not the same origin, at least).
Signed-off-by: Julius Volz <julius.volz@gmail.com>
The filter field was documented as targeting the Catalog API but since
PR #17349 it was also passed to the Health API. This broke existing
configs using Catalog-only fields like ServiceTags, which the Health API
rejects (it uses Service.Tags instead).
Introduce a separate health_filter field that is passed exclusively to
the Health API, while filter remains catalog-only. Update the docs to
explain the two-phase discovery (Catalog for service listing, Health for
instances) and the field name differences between the two APIs.
Fixes#18479
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Building off config-specific Prometheus refresh metrics from an earlier
PR (https://github.com/prometheus/prometheus/pull/17138), this deletes
refresh metrics like `prometheus_sd_refresh_duration_seconds` and
`prometheus_sd_refresh_failures_total` when the underlying scrape job
configuration is removed on reload. This reduces un-needed cardinality
from scrape job specific metrics while still preserving metrics that
indicate overall health of a service discovery engine.
For example,
`prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1`
will no longer be exported by Prometheus when the `linode-servers`
scrape job for the Linode service provider is removed. The generic,
service discovery specific `prometheus_sd_linode_failures_total` metric
will persist however.
* fix: add targetsMtx lock for targets access
* test: validate refresh/discover metrics are gone
* ref: combine sdMetrics and refreshMetrics
Good idea from @bboreham to combine sdMetrics and refreshMetrics!
They're always passed around together and don't have much of a
reason not to be combined. mechanismMetrics makes it clear what kind of
metrics this is used for (service discovery mechanisms).
---------
Signed-off-by: Will Bollock <wbollock@linode.com>