AWS SDK v1 is end of life soon, so migrate to the V2 SDK. The credential loading should work more consistently with other projects that use the SDK and load credentials from the appropriate locations including from environment variables. This affects the EC2 and Lightsail service discovery features.
Signed-off-by: Joe Adams <github@joeadams.io>
* close sync ch after sender() is stopped
* break if chan is closed
Signed-off-by: liyandi <littlepangdi@163.com>
Co-authored-by: liyandi <liyandi@xiaomi.com>
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:
```
if len(prov.newSubs) == 0 {
wg.Add(1)
prov.done = func() {
wg.Done()
}
}
```
then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.
For each provider there is a goroutine created by calling Manager.startProvider(*Provider):
```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
ctx, cancel := context.WithCancel(ctx)
updates := make(chan []*targetgroup.Group)
p.mu.Lock()
p.cancel = cancel
p.mu.Unlock()
go p.d.Run(ctx, updates)
go m.updater(ctx, p, updates)
}
```
It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:
```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
// Ensure targets from this provider are cleaned up.
defer m.cleaner(p)
for {
select {
case <-ctx.Done():
return
[...]
```
we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().
cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:
```
func (m *Manager) Run() error {
go m.sender()
<-m.ctx.Done()
m.cancelDiscoverers()
return m.ctx.Err()
}
```
The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:
- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever
This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.
Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.
The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang
1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
sync sema.go:110 runtime_SemacquireWaitGroup(*uint32(#166))
sync waitgroup.go:118 (*WaitGroup).Wait(*WaitGroup(#23))
discovery manager.go:276 (*Manager).ApplyConfig(#23, #167)
main main.go:964 main.func5(#120)
main main.go:1505 reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
main main.go:1182 main.func22()
run group.go:38 (*Group).Run.func1(*Group(#26), #51)
Add a test for it.
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
* refactor(endpointslice): use cache.Indexer to index endpointslices by LabelServiceName so not have to iterate over all endpoint objects.
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
* check the type and error early and add 'TestEndpointSliceDiscoveryWithUnrelatedServiceUpdate' unit test to give a regression test
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
* make service indexer namespaced
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
* remove unneeded test func
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
* Apply suggestions from code review
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
---------
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
With golangci-lint v2, it now has "formatters" that can be configured.
Add `golangci-lint fmt` to the `make format` in Makefile.common.
* Enable goimports formatter.
Signed-off-by: SuperQ <superq@gmail.com>
Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups().
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
Manager.ApplyConfig() uses multiple locks:
- Provider.mu
- Manager.targetsMtx
Manager.cleaner() uses the same locks but in the opposite order:
- First it locks Manager.targetsMtx
- The it locks Provider.mu
I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown.
From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running.
From that trace it also seems like they are stuck on a lock from two functions:
- cleaner waits on a RLock()
- ApplyConfig waits on a Lock()
I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:
- Manager.ApplyConfig() is called
- Manager.ApplyConfig locks Provider.mu.Lock()
- at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
- Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
- at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
- we end up with both functions locking each other out without any way to break that lock
Re-order lock calls to try to avoid this scenario.
I tried writing a test case for it but couldn't hit this issue.
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
Allows to filter the servers when sending the listing request to the API. This feature is only available when using the `role=hcloud`.
See https://docs.hetzner.cloud/#label-selector for details on how to use the label selector.
Signed-off-by: Jonas Lammler <jonas.lammler@hetzner-cloud.de>
* discovery: a change to a service with the same name but from another namespace won't enqueue the endpointSlice
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
* Update discovery/kubernetes/endpointslice.go
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
* Update endpointslice.go
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
---------
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
* refactor: simplify error handling and remove redundant checks
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
* Add the comment for return of reloading blocks failure
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
* Add the comment for return of reloading blocks failure
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
---------
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
* discovery: add aws/ec2 unit tests
* discovery: initial skeleton for aws/ec2 unit tests
This is a - very likely - not too useful unit test for the AWS SD. It is
commited so other people can check the basic logic and the
implementation.
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: fix linter complains about ec2_test.go
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: add basic unit test for aws
This tests only the basic labelling, not including the VPC related
information.
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: fix linter complains about ec2_test.go
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: other linter fixes in aws/ec2_test.go
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: implement remaining tests for aws/ec2
The coverage is not 100% but I think it is a good starting point if
someone wants to improve that.
Currently it covers all the AWS API calls.
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: make linter happy in aws/ec2_test.go
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: make utility funtcions private
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discover: no global variable in the aws/ec2 test
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: common body for some tests in ec2
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: try to make golangci-lint happy
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: make every non-test function private
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: test for errors first in TestRefresh
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: move refresh tests into the function
This way people can find both the test cases and the execution of the
test at the same place.
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: fix copyright date
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: remove misleading comment
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: rename test for easier identification
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: use static values for the test cases
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discover: try to make the linter happy
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: drop redundant data from ec2 and use common ptr functions
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: use Error instead of Equal
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* discovery: merge refreshAZIDs tests into one
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
---------
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
* azure sd: separate refresh and refreshAzure
* azure sd: create a client with mocked servers for tests
* add test for refresh function
---------
Signed-off-by: mviswanathsai <mviswanath.sai.met21@itbhu.ac.in>
For: #14355
This commit updates Prometheus to adopt stdlib's log/slog package in
favor of go-kit/log. As part of converting to use slog, several other
related changes are required to get prometheus working, including:
- removed unused logging util func `RateLimit()`
- forward ported the util/logging/Deduper logging by implementing a small custom slog.Handler that does the deduping before chaining log calls to the underlying real slog.Logger
- move some of the json file logging functionality to use prom/common package functionality
- refactored some of the new json file logging for scraping
- changes to promql.QueryLogger interface to swap out logging methods for relevant slog sugar wrappers
- updated lots of tests that used/replicated custom logging functionality, attempting to keep the logical goal of the tests consistent after the transition
- added a healthy amount of `if logger == nil { $makeLogger }` type conditional checks amongst various functions where none were provided -- old code that used the go-kit/log.Logger interface had several places where there were nil references when trying to use functions like `With()` to add keyvals on the new *slog.Logger type
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Sidecar containers are a newish feature in k8s. They're implemented
similar to init containers but actually stay running and allow you to
delay startup of your application pod until the sidecar started (like
init containers always do).
This adds the ports of the sidecar container to the list of discovered
endpoint(slice), allowing you to target those containers as well.
The implementation is a copy of that of Pod discovery
fixes: #14927
Signed-off-by: bas smit <bsmit@bol.com>
This went under the radar because the utils are never called directly.
We usually marshall/unmarshal Configs as embeded in a struct using UnmarshalYAMLWithInlineConfigs/MarshalYAMLWithInlineConfigs
which bypasses Configs' custom UnmarshalYAML/MarshalYAML
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Several things done here:
- Set `max-issues-per-linter` to 0 so that we actually see all linter
warnings and not just 50 per linter. (As we also set
`max-same-issues` to 0, I assume this was the intention from the
beginning.)
- Stop using the golangci-lint default excludes (by setting
`exclude-use-default: false`. Those are too generous and don't match
our style conventions. (I have re-added some of the excludes
explicitly in this commit. See below.)
- Re-add the `errcheck` exclusion we have used so far via the
defaults.
- Exclude the signature requirement `govet` has for `Seek` methods
because we use non-standard `Seek` methods a lot. (But we keep other
requirements, while the default excludes completely disabled the
check for common method segnatures.)
- Exclude warnings about missing doc comments on exported symbols. (We
used to be pretty adamant about doc comments, but stopped that at
some point in the past. By now, we have about 500 missing doc
comments. We may consider reintroducing this check, but that's
outside of the scope of this commit. The default excludes of
golangci-lint essentially ignore doc comments completely.)
- By stop using the default excludes, we now get warnings back on
malformed doc comments. That's the most impactful change in this
commit. It does not enforce doc comments (again), but _if_ there is
a doc comment, it has to have the recommended form. (Most of the
changes in this commit are fixing this form.)
- Improve wording/spelling of some comments in .golangci.yml, and
remove an outdated comment.
- Leave `package-comments` inactive, but add a TODO asking if we
should change that.
- Add a new sub-linter `comment-spacings` (and fix corresponding
comments), which avoids missing spaces after the leading `//`.
Signed-off-by: beorn7 <beorn@grafana.com>
This commit removes support for the following API versions:
* `discovery.k8s.io/v1beta1` API version of EndpointSlice (no longer
served as of v1.25).
* `networking.k8s.io/v1beta1` API version of Ingress (no longer served
as of v1.22).
Closes#12884
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* feat: add support for gathering flavor name in Openstack discovery
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
* Update instance.go
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Paulo Dias <44772900+paulojmdias@users.noreply.github.com>
* Update configuration.md
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Paulo Dias <44772900+paulojmdias@users.noreply.github.com>
* fix: fix linting
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
* fix: fix instance type
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
* Update docs/configuration/configuration.md
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Paulo Dias <44772900+paulojmdias@users.noreply.github.com>
---------
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
Signed-off-by: Paulo Dias <44772900+paulojmdias@users.noreply.github.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
[ENHANCEMENT] Docker SD: add MatchFirstNetwork for containers with multiple networks
Fixes docker sd service misssing in shared mode and deduplicate targets by network
* discovery: aws: expose Primary IPv6 addresses as label
Add __meta_ec2_primary_ipv6_addresses label. This label contains the
Primary IPv6 address for every ENI attached to the EC2 instance. It is
ordered by the DeviceIndex and the missing elements (interface without
Primary IPv6 address) are kept in the list.
---------
Signed-off-by: Arpad Kunszt <akunszt@hiya.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>