Fixes#18078
`TestReshardPartialBatch` runs 100 iterations of `shards.stop()` / `shards.start(1)` with a 2-second deadline per iteration, treating any slower completion as a deadlock.
Under normal conditions one iteration takes ~10–50ms, so the 2s window is plenty. But the stack trace attached to #18078 shows the test timing out while 72 `runShard` goroutines from other parallel tests are active — those are from `TestReshard` (spawns up to 60 shards) and `TestReshardRaceWithStop` (drives 100 reshards back-to-back). Under that kind of scheduler pressure, a legitimate non-deadlocked `stop()` can occasionally cross 2s, and the test misreports it as a deadlock.
`t.Parallel()` was added to `TestReshardPartialBatch` (and most of its neighbours) in bulk in fe1bb5337 as part of a general "parallelize everything" pass — the flakiness started from that point. The other reshard tests are less timing-sensitive, but this one explicitly measures shutdown latency, so running it alongside them invalidates the premise.
Dropping `t.Parallel()` here restores the test's original isolation. It still detects a real deadlock (it would never complete), it just stops firing false positives when the CPU is saturated by its siblings.
```release-notes
NONE
```
Signed-off-by: texasich <101962694+texasich@users.noreply.github.com>
The previous commit changed CertificatePassword from string to config_util.Secret but did not add the corresponding import. The CI build for this PR alone passed only because GitHub builds the merge of the PR with upstream main, which already imports config_util (introduced in upstream commit 5ccebcdb3 for ClientSecret). Add the import so the PR's azuread.go is self-consistent.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Kaveesh Dubey (from Dev Box) <kadubey@microsoft.com>
The ClientSecret field in OAuthConfig was typed as plain string,
causing it to be exposed in plaintext via the /-/config HTTP endpoint.
Change it to config_util.Secret so Prometheus redacts it as <secret>.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Apply the same io.LimitReader guard (decodeReadLimit = 32 MiB) to the
OTLP write decoder that remote read already use, so that a gzip-encoded request
body cannot decompress to unbounded memory.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
See the detailed analysis https://docs.google.com/document/d/1efVAMcEw7-R_KatHHcobcFBlNsre-DoThVHI8AO2SDQ/edit?tab=t.0
I ran extensive benchmarks using synthetic data as well as real WAL segments pulled from the prombench runs.
All benchmarks are here https://github.com/prometheus/prometheus/compare/bwplotka/wal-reuse?expand=1
* optimization(tsdb/wlog): reuse Ref* buffers across WAL watchers' reads
Signed-off-by: bwplotka <bwplotka@gmail.com>
* optimization(tsdb/wlog): avoid expensive error wraps
Signed-off-by: bwplotka <bwplotka@gmail.com>
* optimization(tsdb/wlog): reuse array for filtering
Signed-off-by: bwplotka <bwplotka@gmail.com>
* fmt
Signed-off-by: bwplotka <bwplotka@gmail.com>
* lint fix
Signed-off-by: bwplotka <bwplotka@gmail.com>
* tsdb/record: add test for clear() on histograms
Signed-off-by: bwplotka <bwplotka@gmail.com>
* updated WriteTo with what's currently expected
Signed-off-by: bwplotka <bwplotka@gmail.com>
---------
Signed-off-by: bwplotka <bwplotka@gmail.com>
In addHistogramDataPoints, exemplars assigned to the +Inf bucket of one
data point were carried over into the _sum and _count Append calls of
the next data point via the shared appOpts. Clear appOpts.Exemplars at
the start of each loop iteration to restore the nil-exemplar semantics
that existed before the AppenderV2 migration.
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
The OTLP write handler and the PRW v2 histogram append path were missing
ErrTooOldSample from their error type checks, causing these errors to
fall through to the default case and return HTTP 500 Internal Server Error.
This triggered unnecessary retries in OTLP clients like the Python SDK.
The PRW v1 write handler (line 115) and the PRW v2 sample append path
(line 377) already correctly handle ErrTooOldSample as a 400, and this
change makes the remaining paths consistent.
Also adds ErrTooOldSample to the v1 sample/histogram log checks so
these errors are properly logged instead of silently returned.
Fixes#16645
Signed-off-by: Varun Chawla <varun_6april@hotmail.com>
Initial implementation of https://github.com/prometheus/prometheus/issues/17790.
Only implements ST-per-sample for Counters. Tests and benchmarks updated.
Note: This increases the size of the RefSample object for all users, whether st-per-sample is turned on or not.
Signed-off-by: Owen Williams <owen.williams@grafana.com>
The createAttributes error was incorrectly returning nil instead of err,
causing errors to be silently discarded. This could lead to silent data
loss for sum metrics during OTLP ingestion.
Fixes#17953
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
* simplify readability of timeseries filtering by using the slices package
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* ensure that BenchmarkBuildTimeSeries doesn't account for the building of
the actual proto in the benchmark results, we only care about the
buildTimeSeries call
Signed-off-by: Callum Styan <callumstyan@gmail.com>
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* otlptranslator: filter __name__ from OTLP attributes to prevent duplicates
OTLP metrics can have a __name__ attribute which, when combined with the
metric name passed via extras, creates duplicate __name__ labels.
This commit implements filtering out of any __name__ metric attribute from OTLP.
Also rename TestCreateAttributes to TestPrometheusConverter_createAttributes
for consistency, and add test cases for __name__, __type__, and __unit__ OTLP metric attributes.
---------
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
* otlptranslator: add label caching for OTLP-to-Prometheus conversion
Add per-request caching to reduce redundant computation and allocations
during OTLP metric conversion:
1. Per-request label sanitization cache: Cache sanitized label names
within a request to avoid repeated string allocations for commonly
repeated labels like __name__, job, instance.
2. Resource-level label caching: Precompute and cache job, instance,
promoted resource attributes, and external labels once per
ResourceMetrics boundary instead of for each datapoint.
3. Scope-level label caching: Precompute and cache scope metadata labels
(otel_scope_name, otel_scope_version, etc.) once per ScopeMetrics
boundary.
4. LabelNamer instance caching: Reuse the LabelNamer struct across
datapoints within the same resource context.
These optimizations significantly reduce allocations and improve latency
for OTLP ingestion workloads with many datapoints per resource/scope.
---------
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>