The azureClient struct held the armcompute and armnetwork SDK clients as
concrete fields while satisfying the client interface. Once such a client is
reachable through an interface, Go's linker conservatively retains every
exported method of the concrete type plus the entire (de)serializer graph those
operations drag in, even though discovery calls only a handful of them.
Wrap each SDK client in a small adapter that captures only the operations
discovery uses as method-value closures, and box the adapters instead of the raw
clients. The concrete clients then live only inside closure contexts, which
reflection cannot traverse, so dead-code elimination drops the unused
operations.
This drops the retained operations per client from ~60 down to the 2-3 actually
used (UsedInIface markers go from 244+66 to 0), shrinking both the prometheus
and promtool binaries by ~3.2 MB each. No functional or API change.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
The Kubernetes SD Discovery struct held the clientset as a
kubernetes.Interface field. Boxing the concrete *kubernetes.Clientset into
an interface marks it <UsedInIface>, so the Go linker conservatively retains
every API-group accessor and, transitively, every resource client and its
apply configurations, even though discovery only touches the core, apps,
batch, discovery and networking v1 groups.
Wrap the clientset in an adapter that captures only the used API-group
accessors as method-value closures and exposes them through a narrow
k8sClient interface. The concrete clientset now lives only inside closure
contexts, which reflection cannot traverse, so dead-code elimination drops
the unused groups. The fake clientset still satisfies the narrow interface,
so tests are unchanged.
This trims about 10 MB from each of the prometheus and promtool binaries.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
The Discovery struct held *compute.Service and *compute.InstancesService as
fields and is boxed into the discovery.Discoverer interface. Once reflection is
reachable in the program (it always is, via the YAML/config machinery), the Go
linker conservatively retains every exported method of any concrete type
reachable through an interface, including via struct fields. *compute.Service
exposes ~150 sub-services and their operations, so all of them — 994 list/get
operations and their serializers — were retained even though discovery only
calls Instances.List.
Wrap the single used operation in a closure over the concrete *compute.Service
so the service lives only in closure context, which reflection cannot traverse.
Returning a *compute.InstancesListCall would not help, since that type has an
s *Service back-reference that re-propagates the marker, so the closure
encapsulates the whole List/Filter/Pages chain and only exposes the
*compute.InstanceList data type discovery already uses.
compute/v1 footprint drops from ~4.9 MB to ~33 KB, and the prometheus and
promtool binaries each shrink by ~13.5 MB. No functional or API change.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
The AWS service-discovery code boxed each concrete SDK client (*ec2.Client,
*rds.Client, *lightsail.Client, *elasticache.Client, *ecs.Client and
*kafka.Client) into an interface, either directly or as a field of a struct
that is itself boxed into the Discoverer interface.
Once reflection is reachable in the program -- it always is, via the
YAML/config machinery -- the Go linker conservatively retains every exported
method of any concrete type reachable through an interface, including a type
held as a field of an interface-boxed struct. Each SDK client exposes the
service's full API (e.g. *ec2.Client has ~470 operation methods), so all of
their operation serializers and the corresponding types (de)serializer graphs
were kept, even though discovery only calls a handful of operations. EC2 alone
accounted for ~21 MB.
Wrap each client in a small adapter that captures only the operations
discovery uses as method-value closures. The concrete client then lives only
inside closure contexts, which reflection cannot traverse, so dead-code
elimination can drop the unused operations.
This reduces the binary sizes substantially:
prometheus 228.6 MB -> 162.6 MB (-66 MB, -29%)
promtool 205.1 MB -> 139.0 MB (-66 MB, -32%)
There is no functional or API change; the mocking interfaces used by the tests
are unchanged.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Add t.Parallel() to the 13 top-level test functions in marathon_test.go.
Tests have no shared mutable state — each test creates its own registry,
metrics, and config — so parallelisation is safe and speeds up the suite.
Refs #15185
Signed-off-by: Ogulcan Aydogan <ogulcanaydogan@hotmail.com>
Each test creates its own httptest.Server and prometheus.Registry so
there is no shared global state between them. Adding t.Parallel() to
all 13 top-level test functions and the subtests in TestUnmarshalConfig
allows the Go test runner to overlap them, cutting wall-clock time.
Refs: #15185
Signed-off-by: Ogulcan Aydogan <ogulcanaydogan@hotmail.com>
Each test function creates its own mock AWS client and operates on
independent data stores with no shared global state between them.
Adding t.Parallel() to the 33 top-level test functions across the
six test files (aws, ec2, ecs, elasticache, msk, rds) allows the
Go test runner to overlap their execution, cutting wall-clock time.
TestLoadRegion is excluded because its subtests use t.Setenv, which
panics when a parallel ancestor is detected (Go 1.25+).
Refs: #15185
Signed-off-by: Ogulcan Aydogan <ogulcanaydogan@hotmail.com>
When health_filter is set without explicit services, the catalog needs
to be watched to enumerate services. Add watchedFilter to the condition
that triggers catalog watching.
Improve the filter test suite:
- Replace defer with t.Cleanup for stub servers.
- Rewrite TestFilterOption to assert that the catalog receives the filter
and the health endpoint does not.
- Rewrite TestHealthFilterOption to assert that health_filter is routed
correctly to the health endpoint only.
- Add TestBothFiltersOption to verify both filters are routed to their
respective endpoints when both are configured.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
The filter field was documented as targeting the Catalog API but since
PR #17349 it was also passed to the Health API. This broke existing
configs using Catalog-only fields like ServiceTags, which the Health API
rejects (it uses Service.Tags instead).
Introduce a separate health_filter field that is passed exclusively to
the Health API, while filter remains catalog-only. Update the docs to
explain the two-phase discovery (Catalog for service listing, Health for
instances) and the field name differences between the two APIs.
Fixes#18479
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
* Adding scape on shutdown
Signed-off-by: avilevy <avilevy@google.com>
* scrape: replace skipOffsetting to make the test offset deterministic instead of skipping it entirely
Signed-off-by: avilevy <avilevy@google.com>
* renamed calculateScrapeOffset to getScrapeOffset
Signed-off-by: avilevy <avilevy@google.com>
* discovery: Add skipStartupWait to bypass initial discovery delay
In short-lived environments like agent mode or serverless, the
Prometheus process may only execute for a few seconds. Waiting for
the default 5-second `updatert` ticker before sending the first
target groups means the process could terminate before collecting
any metrics at all.
This commit adds a `skipStartupWait` option to the Discovery Manager
to bypass this initial delay. When enabled, the sender uses an
unthrottled startup loop that instantly forwards all triggers. This
ensures both the initial empty update from `ApplyConfig` and the
first real targets from discoverers are passed downstream immediately.
After the first ticker interval elapses, the sender cleanly breaks out
of the startup phase, resets the ticker, and resumes standard
operations.
Signed-off-by: avilevy <avilevy@google.com>
* scrape: Bypass initial reload delay for ScrapeOnShutdown
In short-lived environments like agent mode or serverless, the default
5-second `DiscoveryReloadInterval` can cause the process to terminate
before the scrape manager has a chance to process targets and collect
any metrics.
Because the discovery manager sends an initial empty update upon
configuration followed rapidly by the actual targets, simply waiting
for a single reload trigger is insufficient—the real targets would
still get trapped behind the ticker delay.
This commit introduces an unthrottled startup loop in the `reloader`
when `ScrapeOnShutdown` is enabled. It processes all incoming
`triggerReload` signals immediately during the first interval. Once
the initial tick fires, the `reloader` resets the ticker and falls
back into its standard throttled loop, ensuring short-lived processes
can discover and scrape targets instantly.
Signed-off-by: avilevy <avilevy@google.com>
* test(scrape): refactor time-based manager tests to use synctest
Addresses PR feedback to remove flaky, time-based sleeping in the scrape manager tests.
Add TestManager_InitialScrapeOffset and TestManager_ScrapeOnShutdown to use the testing/synctest package, completely eliminating real-world time.Sleep delays and making the assertions 100% deterministic.
- Replaced httptest.Server with net.Pipe and a custom startFakeHTTPServer helper to ensure all network I/O remains durably blocked inside the synctest bubble.
- Leveraged the skipOffsetting option to eliminate random scrape jitter, making the time-travel math exact and predictable.
- Using skipOffsetting also safely bypasses the global singleflight DNS lookup in setOffsetSeed, which previously caused cross-bubble panics in synctest.
- Extracted shared boilerplate into a setupSynctestManager helper to keep the test cases highly readable and data-driven.
Signed-off-by: avilevy <avilevy@google.com>
* Clarify use cases in InitialScrapeOffset comment
Signed-off-by: avilevy <avilevy@google.com>
* test(scrape): use httptest for mock server to respect context cancellation
- Replaced manual HTTP string formatting over `net.Pipe` with `httptest.NewUnstartedServer`.
- Implemented an in-memory `pipeListener` to allow the server to handle `net.Pipe` connections directly. This preserves `synctest` time isolation without opening real OS ports.
- Added explicit `r.Context().Done()` handling in the mock HTTP handler to properly simulate aborted requests and scrape timeouts.
- Validates that the request context remains active and is not prematurely cancelled during `ScrapeOnShutdown` scenarios.
- Renamed `skipOffsetting` to `skipJitterOffsetting`.
- Addressed other PR comments.
Signed-off-by: avilevy <avilevy@google.com>
* tmp
Signed-off-by: bwplotka <bwplotka@gmail.com>
* exp2
Signed-off-by: bwplotka <bwplotka@gmail.com>
* fix
Signed-off-by: bwplotka <bwplotka@gmail.com>
* scrape: fix scrapeOnShutdown context bug and refactor test helpers
The scrapeOnShutdown feature was failing during manager shutdown because
the scrape pool context was being cancelled before the final shutdown
scrapes could execute. Fix this by delaying context cancellation
in scrapePool.stop() until after all scrape loops have stopped.
In addition:
- Added test cases to verify scrapeOnShutdown works with InitialScrapeOffset.
- Refactored network test helper functions from manager_test.go to
helpers_test.go.
- Addressed other comments.
Signed-off-by: avilevy <avilevy@google.com>
* Update scrape/scrape.go
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: avilevy18 <105948922+avilevy18@users.noreply.github.com>
* feat(discovery): add SkipInitialWait to bypass initial startup delay
This adds a SkipInitialWait option to the discovery Manager, allowing consumers sensitive to startup latency to receive the first batch of discovered targets immediately instead of waiting for the updatert ticker.
To support this without breaking the immediate dropped target notifications introduced in #13147, ApplyConfig now uses a keep flag to only trigger immediate downstream syncs for obsolete or updated providers. This prevents sending premature empty target groups for brand-new providers on initial startup.
Additionally, the scrape manager's reloader loop is updated to process the initial triggerReload immediately, ensuring the end-to-end pipeline processes initial targets without artificial delays.
Signed-off-by: avilevy <avilevy@google.com>
* scrape: Add TestManagerReloader and refactor discovery triggerSync
Adds a new TestManagerReloader test suite using synctest to assert
behavior of target updates, discovery reload ticker intervals, and
ScrapeOnShutdown flags.
Updates setupSynctestManager to allow skipping initial config setup by
passing an interval of 0.
Also renames the 'keep' variable to 'triggerSync' in ApplyConfig inside
discovery/manager.go for clarity, and adds a descriptive comment.
Signed-off-by: avilevy <avilevy@google.com>
* feat(discovery,scrape): rename startup wait options and add DiscoveryReloadOnStartup
- discovery: Rename `SkipInitialWait` to `SkipStartupWait` for clarity.
- discovery: Pass `context.Context` to `flushUpdates` to handle cancellation and avoid leaks.
- scrape: Add `DiscoveryReloadOnStartup` to `Options` to decouple startup discovery from `ScrapeOnShutdown`.
- tests: Refactor `TestTargetSetTargetGroupsPresentOnStartup` and `TestManagerReloader` to use table-driven tests and `synctest` for better stability and coverage.
Signed-off-by: avilevy <avilevy@google.com>
* feat(discovery,scrape): importing changes proposed in 043d710
- Refactor sender to use exponential backoff
- Replaces `time.NewTicker` in `sender()` with an exponential backoff
to prevent panics on non-positive intervals and better throttle updates.
- Removes obsolete `skipStartupWait` logic.
- Refactors `setupSynctestManager` to use an explicit `initConfig` argument
Signed-off-by: avilevy <avilevy@google.com>
* fix: updating go mod
Signed-off-by: avilevy <avilevy@google.com>
* fixing merge
Signed-off-by: avilevy <avilevy@google.com>
* fixing issue: 2 variables but NewTestMetrics returns 1 value
Signed-off-by: avilevy <avilevy@google.com>
* Update discovery/manager.go
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: avilevy18 <105948922+avilevy18@users.noreply.github.com>
* Refactor setupSynctestManager initConfig into a separate function
Signed-off-by: avilevy <avilevy@google.com>
---------
Signed-off-by: avilevy <avilevy@google.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: avilevy18 <105948922+avilevy18@users.noreply.github.com>
Co-authored-by: bwplotka <bwplotka@gmail.com>
The package makes vulnerability scanners unhappy, and the functionality is available in the smaller moby/moby packages.
Signed-off-by: alex boten <223565+codeboten@users.noreply.github.com>
NewTestMetrics returns a single value but the test was
assigning it to two variables.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Building off config-specific Prometheus refresh metrics from an earlier
PR (https://github.com/prometheus/prometheus/pull/17138), this deletes
refresh metrics like `prometheus_sd_refresh_duration_seconds` and
`prometheus_sd_refresh_failures_total` when the underlying scrape job
configuration is removed on reload. This reduces un-needed cardinality
from scrape job specific metrics while still preserving metrics that
indicate overall health of a service discovery engine.
For example,
`prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1`
will no longer be exported by Prometheus when the `linode-servers`
scrape job for the Linode service provider is removed. The generic,
service discovery specific `prometheus_sd_linode_failures_total` metric
will persist however.
* fix: add targetsMtx lock for targets access
* test: validate refresh/discover metrics are gone
* ref: combine sdMetrics and refreshMetrics
Good idea from @bboreham to combine sdMetrics and refreshMetrics!
They're always passed around together and don't have much of a
reason not to be combined. mechanismMetrics makes it clear what kind of
metrics this is used for (service discovery mechanisms).
---------
Signed-off-by: Will Bollock <wbollock@linode.com>
PR #17601 extended makeNode with annotations and conditions parameters
but missed updating two call sites in pod_test.go.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Add Outscale VM service discovery using osc-sdk-go, including optional secret_key_file support, metrics, docs, and configuration examples. Document the default region (eu-west-2).
Signed-off-by: Aurelien Duboc <aurelienduboc96@gmail.com>
This adds 'databases' role to digitalocean_sd_config to discover DigitalOcean
Managed Database clusters. It follows the multi-role design pattern by
introducing a 'role' parameter (default: 'droplets').
Includes:
- Support for Managed Databases API.
- Pagination handling for Databases API.
- Comprehensive meta labels for database targets.
- Updated documentation and tests.
Signed-off-by: Vladimir Skesov <skesov@gmail.com>
* discovery/vultr: upgrade govultr from v2 to v3
The govultr/v2 library is no longer actively maintained. Upgrade to
govultr/v3 (v3.28.1) which receives regular updates and security
patches.
The v3 library is API-compatible with v2 for the Instance.List
method used by the Vultr SD, with the only change being an
additional *http.Response return value.
Signed-off-by: Pierluigi Lenoci <pierluigi.lenoci@gmail.com>
* discovery/vultr: check HTTP response status code
Validate that the Vultr API returns a 2xx status code after listing
instances, as the *http.Response from govultr v3 is now available.
Signed-off-by: Pierluigi Lenoci <pierluigi.lenoci@gmail.com>
* discovery/vultr: fix linter error in error string capitalization
Error strings should not be capitalized per Go conventions (ST1005).
Signed-off-by: Pierluigi Lenoci <pierluigi.lenoci@gmail.com>
---------
Signed-off-by: Pierluigi Lenoci <pierluigi.lenoci@gmail.com>
When using ManagedIdentity authentication with system-assigned identity,
the client_id field is intentionally left empty. However, the current code
unconditionally sets options.ID = azidentity.ClientID(cfg.ClientID), which
passes an empty string instead of nil. The Azure SDK treats an empty
ClientID as a request for a user-assigned identity with an empty client ID,
rather than falling back to system-assigned identity.
Fix by only setting options.ID when cfg.ClientID is non-empty, matching the
pattern already used in storage/remote/azuread/azuread.go.
Fixes#16634
Signed-off-by: Ogulcan Aydogan <ogulcanaydogan@hotmail.com>
[hcloud.Server.Datacenter] is deprecated and will be removed after 1 July 2026. Use [hcloud.Server.Location] instead.
See https://docs.hetzner.cloud/changelog#2025-12-16-phasing-out-datacenters
Changes to Hetzner meta labels:
- `__meta_hetzner_datacenter`
- is deprecated for the role `robot` but kept for backward compatibility. Using `__meta_hetzner_robot_datacenter` is preferred.
- is deprecated for the role `hcloud` and will stop working after the 1 July 2026.
- `__meta_hetzner_hcloud_datacenter_location` label
- is deprecated but kept for backward compatibility, the same data is available in the [`hcloud.Server.Location`](https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#Server) struct.
- using `__meta_hetzner_hcloud_location` is preferred.
- `__meta_hetzner_hcloud_datacenter_location_network_zone`
- is deprecated but kept for backward compatibility, the same data is available in the [`hcloud.Server.Location`](https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#Server) struct.
- using `__meta_hetzner_hcloud_location_network_zone` is preferred.
- `__meta_hetzner_hcloud_location`
- replacement label for `__meta_hetzner_hcloud_datacenter_location`
- `__meta_hetzner_hcloud_location_network_zone`
- replacement label for `__meta_hetzner_hcloud_datacenter_location_network_zone`
- `__meta_hetzner_robot_datacenter`
- replacement label for `__meta_hetzner_datacenter` with the role `robot`.
Signed-off-by: Jonas Lammler <jonas.lammler@hetzner-cloud.de>
PR #17269 replaced atomic os.Rename-based file writes with
os.WriteFile to fix a Windows flake. However, os.WriteFile is not
atomic (it truncates then writes), and fsnotify can fire between
the truncate and write, causing the watcher to read an empty file
and replace valid targets with empty ones.
Restore atomicity by writing to a temporary file and renaming.
On Windows, retry the rename with a short backoff to handle
transient "Access is denied" errors when the file watcher or
readFile holds an open handle to the destination.
Fixes#18237
Signed-off-by: Munem Hashmi <munem.hashmi@gmail.com>
Extended Kubernetes SD to support the following pod-based labels:
* `__meta_kubernetes_pod_deployment_name`
* `__meta_kubernetes_pod_cronjob_name`
* `__meta_kubernetes_pod_job_name`
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>