Commit graph

16503 commits

Author SHA1 Message Date
Ahmed Hassan
01be7bfb2e add NumFloatSamples to TSDB block stats
Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
2025-07-07 13:48:18 -07:00
Lukasz Mierzwa
559fd44be6 Rename labels.go -> labels_slicelabels.go
labels.go is now holding slicelabels code, so let's rename it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-07 12:37:42 +01:00
machine424
ffcba01c5a chore: do not hardcode required versions in README.md
add links to the sources of truth.

It's hard to keep up to date, the "go" one
is "wrong" (not really as an old 1.22 binray could still
download/use newer toolchains...) for example.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-07 08:42:31 +01:00
Charles Korn
1e58d792a5
storage/remote: fix "http: read on closed response body" errors if chunkedSeriesSet.Next is called again after the series set is exhausted (#16838)
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-07 09:23:34 +02:00
Michael Hoffmann
44ee5e2ad6 promql: fix topk error on NaN argument for non-existing series
Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
2025-07-07 06:19:39 +00:00
RaphSku
938e5cb62b
docs: Added documentation for promtool configuration with http.config.file (#16522)
Includes an example.

Signed-off-by: RaphSku <rapsku.dev@gmail.com>
2025-07-07 00:00:51 +02:00
beorn7
c0a13223e7 promql: add tests to demonstrate extrapolation below zero
This shows how float counters cannot go below zero when extrapolationg
for rate/increase, and how histograms do not have that protection yet,
leading to an overestimation of the rate/increase.

This also demonstrates edge cases where the count extrapolation does
not need to be limited, but an individual bucket still goes below
zero.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-07-06 23:42:55 +02:00
Michael Hoffmann
21b1536b5a
storage: add projection fields to select hints (#16423)
This commit adds Projection metadata to SelectHints so that downstream
storage implementations can use it to save effort when answering to
Select calls.

Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
2025-07-06 12:57:19 +02:00
Arve Knudsen
f561aa795d
OTLP receiver: Generate target_info samples between the earliest and latest samples per resource (#16737)
* OTLP receiver: Generate target_info samples between the earliest and latest samples per resource

Modify the OTLP receiver to generate target_info samples between the earliest
and latest samples per resource instead of only one for the latest timestamp.
The samples are spaced lookback delta/2 apart.

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-07-04 14:38:16 +00:00
Jon Kartago Lamida
819500bdbc
Add ByteSize method for Labels (#16717)
Add `ByteSize()` method to different labels implementations.
One of the use case so that we can track the memory used by Labels.

Signed-off-by: Jon Kartago Lamida <me@lamida.net>
2025-07-04 15:09:01 +01:00
sujal shah
4408a6bcaf api: Create /status/tsdb/blocks endpoint.
this endpoint serves blocks data to the client.

Signed-off-by: sujal shah <sujalshah28092004@gmail.com>
2025-07-04 03:13:54 +05:30
machine424
c2d6e528e4
feat(discovery/kubernetes): allow attaching namespace metadata
to endpointslice, endpoints and pod roles

after injecting the labels for endpointslice, claude-4-sonnet
helped transpose the code and tests to endpoints and pod roles

fixes https://github.com/prometheus/prometheus/issues/9510
supersedes https://github.com/prometheus/prometheus/pull/13798

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Co-authored-by: Paul BARRIE <paul.barrie.calmels@gmail.com>
2025-07-03 19:41:08 +02:00
Arve Knudsen
5a5424cbc1
Consolidate around prometheus/common/model.ValidationScheme (#16806)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-07-03 15:37:46 +02:00
Bartlomiej Plotka
419d436a44
Merge pull request #16822 from prometheus/bump-otlptranslator
Bump otlptranslator to latest SHA
2025-07-03 12:40:31 +01:00
Matthias Loibl
61064cb774
Merge pull request #16819 from jscheffner/prometheus-dashboard-uid
mixin: add uid to prometheus overview dashboard
2025-07-03 11:16:05 +02:00
Julien
011c7fe87d
Merge pull request #16820 from prymitive/discoveryRace
discovery: fix a race in ApplyConfig while Prometheus is being stopped
2025-07-03 10:52:59 +02:00
dependabot[bot]
ce2e48f39e
build(deps): bump github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatocumulativeprocessor
Bumps [github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatocumulativeprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib) from 0.128.0 to 0.129.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-collector-contrib/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CHANGELOG-API.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-collector-contrib/compare/v0.128.0...v0.129.0)

---
updated-dependencies:
- dependency-name: github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatocumulativeprocessor
  dependency-version: 0.129.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-03 08:10:56 +00:00
github-actions[bot]
3c25eb2a0d
Merge pull request #16815 from prometheus/dependabot/go_modules/github.com/oklog/run-1.2.0
build(deps): bump github.com/oklog/run from 1.1.0 to 1.2.0
2025-07-03 10:09:10 +02:00
Ahmed Hassan
6d77b47d13 add numHistogramSamples to block stats
Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
2025-07-02 19:52:04 -07:00
Arthur Silva Sens
0502f2d8fb
Bump otlptranslator to latest SHA
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
2025-07-02 14:55:51 -03:00
Bryan Boreham
74aca682b7
Merge pull request #16807 from bboreham/test-sizeoflabels
[TESTS] Labels: Add a test for SizeOfLabels
2025-07-02 18:44:10 +01:00
Lukasz Mierzwa
b49d143595 Fix a race in discovery manager ApplyConfig & shutdown
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:

```
if len(prov.newSubs) == 0 {
	wg.Add(1)
	prov.done = func() {
		wg.Done()
	}
}
```

then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.

For each provider there is a goroutine created by calling Manager.startProvider(*Provider):

```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
	m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
	ctx, cancel := context.WithCancel(ctx)
	updates := make(chan []*targetgroup.Group)

	p.mu.Lock()
	p.cancel = cancel
	p.mu.Unlock()

	go p.d.Run(ctx, updates)
	go m.updater(ctx, p, updates)
}
```

It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:

```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
	// Ensure targets from this provider are cleaned up.
	defer m.cleaner(p)
	for {
		select {
		case <-ctx.Done():
			return
[...]
```

we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().

cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:

```
func (m *Manager) Run() error {
	go m.sender()
	<-m.ctx.Done()
	m.cancelDiscoverers()
	return m.ctx.Err()
}
```

The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:

- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever

This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.

Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.

The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:03:10 +01:00
Lukasz Mierzwa
357e652044 Add a test for a rare shutdown hang
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang

1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:110              runtime_SemacquireWaitGroup(*uint32(#166))
    sync         waitgroup.go:118         (*WaitGroup).Wait(*WaitGroup(#23))
    discovery    manager.go:276           (*Manager).ApplyConfig(#23, #167)
    main         main.go:964              main.func5(#120)
    main         main.go:1505             reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
    main         main.go:1182             main.func22()
    run          group.go:38              (*Group).Run.func1(*Group(#26), #51)

Add a test for it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:01:42 +01:00
wmTJc9IK0Q
c481aaf762
codemirror-promql: Preserve source files in npm package (#16804)
* Preserve source files in codemirror-promql package

This allows for sourcemaps to work when the package is imported via ESM-native CDNs such as esm.sh

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>

* Preserve source files in lezer-promql package

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>

---------

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>
2025-07-02 15:31:02 +02:00
jscheffner
1be2deec88 mixin: add uid to prometheus overview dashboard
Signed-off-by: jscheffner <jscheffner@users.noreply.github.com>
2025-07-02 15:02:50 +02:00
Julien
f62d0e0385
Merge pull request #16777 from roidelapluie/add-step-promql
Add step(), min() and max() in promql duration expressions
2025-07-02 14:27:45 +02:00
Julien
432f130a32 PromQL: min/max/step: Address review comments
Signed-off-by: Julien <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:36 +02:00
Julien Pivotto
984c8de0da PromQL: Fix printing +min()
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Julien Pivotto
3af0bdee68 PromQL: min/max/step: add more tests
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Julien Pivotto
ee7d5158a7 Add step(), min(a,b) and max(a,b) in promql duration expressions
step() is a new keyword introduced to represent the query step width in duration expressions.

min(a,b) and max(a,b) return the min and max from two duration expressions.

Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Bryan Boreham
4eafbcae93 lint
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:56:28 +01:00
Bryan Boreham
e7ac3f440d [TESTS] Labels: Add a test for SizeOfLabels
This requires a bit of repetition to cover all the different builds, but
it seems worth checking that the function does what is expected.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:31:27 +01:00
Bryan Boreham
507227781b [REFACTOR] Labels: Extract test case data from TestLabels_String
So we can use them in other tests.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:31:25 +01:00
Julius Volz
bfbae39931
Merge pull request #16716 from charleskorn/charleskorn/binops-docs
docs: clarify and expand binary operations documentation
2025-07-02 10:02:17 +02:00
dependabot[bot]
f7372ec7d7
build(deps): bump github.com/docker/docker
Bumps [github.com/docker/docker](https://github.com/docker/docker) from 28.2.2+incompatible to 28.3.0+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v28.2.2...v28.3.0)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-version: 28.3.0+incompatible
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-01 23:42:52 +00:00
dependabot[bot]
6bb7e088c5
build(deps): bump github.com/oklog/run from 1.1.0 to 1.2.0
Bumps [github.com/oklog/run](https://github.com/oklog/run) from 1.1.0 to 1.2.0.
- [Release notes](https://github.com/oklog/run/releases)
- [Commits](https://github.com/oklog/run/compare/v1.1.0...v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/oklog/run
  dependency-version: 1.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-01 23:42:33 +00:00
dependabot[bot]
a92a5640c0
build(deps): bump google.golang.org/api from 0.238.0 to 0.239.0
Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.238.0 to 0.239.0.
- [Release notes](https://github.com/googleapis/google-api-go-client/releases)
- [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md)
- [Commits](https://github.com/googleapis/google-api-go-client/compare/v0.238.0...v0.239.0)

---
updated-dependencies:
- dependency-name: google.golang.org/api
  dependency-version: 0.239.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-01 23:42:25 +00:00
dependabot[bot]
12c0ef6e0c
build(deps): bump github.com/linode/linodego from 1.52.1 to 1.52.2
Bumps [github.com/linode/linodego](https://github.com/linode/linodego) from 1.52.1 to 1.52.2.
- [Release notes](https://github.com/linode/linodego/releases)
- [Commits](https://github.com/linode/linodego/compare/v1.52.1...v1.52.2)

---
updated-dependencies:
- dependency-name: github.com/linode/linodego
  dependency-version: 1.52.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-01 23:41:41 +00:00
dependabot[bot]
5ca501e648
build(deps): bump github/codeql-action from 3.28.16 to 3.29.2
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.28.16 to 3.29.2.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](28deaeda66...181d5eefc2)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 3.29.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-01 23:20:21 +00:00
Lukasz Mierzwa
bb690a23b9 Make sure we never call trackStaleness with nil cache entry
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-01 14:22:01 +01:00
Lukasz Mierzwa
6687bf5653 Only add series to scrape cache if they were appended to TSDB
Scrape cache is used to emit StaleNaN markers after a series disappears so it should only hold entries for series that did end up in TSDB, which is not always the case due to sample_limit.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-01 14:22:01 +01:00
Lukasz Mierzwa
c75768739a Sort series by labels in requireEqual()
Tests that look at samples with StaleNaN values will fail because these samples are generated from map iteration and so the order can be unstable.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-01 14:22:01 +01:00
Lukasz Mierzwa
e2193f634f Add a test for StaleNaNs after hitting sample_limit
I was confused why there are no StaleNaN markers appended when a scrape hits sample_limit, but reading the code I see that's expected, so add a test for it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-01 14:22:01 +01:00
Lukasz Mierzwa
0eedc046f4 Check ref value when appending staleness markers
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-01 14:22:01 +01:00
Lukasz Mierzwa
872f03766c Pass last know ref ID when injecting staleness markers
Currently all staleness markers are appended for any sample that disappears from scrape cache, even if that sample was never appended to TSDB.
When staleness markers are appended they always use ref=0 as the SeriesRef, so the downstream appender doesn't know if the sample is for a know series or not.

This changes the scrape cache so the map used for staleness tracking stores the cache entry instead of only the label set. Having the cache entry means:
- we can ignore stale samples that didn't end up in TSDB (not in the scrape cache)
- we can append them to TSDB using correct ref value, so the appender knows if they are for know or unknown series

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-01 14:22:01 +01:00
Lukasz Mierzwa
1f7a23cced Add tests for staleness markers appended to TSDB when sample_limit is set
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-01 14:22:01 +01:00
Charles Korn
d19a9ab673
Remove other instances of "obvious"
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:13:46 +10:00
Charles Korn
1977452331
Address PR feedback: adjust docs to match current behaviour
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:10:20 +10:00
Charles Korn
665eb3d6cb
Address PR feedback: remove use of "obvious"
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:08:18 +10:00
Charles Korn
70df21a680
Address PR feedback: format Inf and NaN as monospace
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:07:07 +10:00