Commit graph

675 commits

Author SHA1 Message Date
Guangwen Feng
ad9449d1f1
Add unit test case for func HasAlertingRules in manager.go (#7502)
Signed-off-by: Guangwen Feng <fenggw-fnst@cn.fujitsu.com>
2020-07-06 10:35:16 +01:00
Guangwen Feng
487f1e07ff
Add unit test case for func State in alerting.go (#7476)
Signed-off-by: Guangwen Feng <fenggw-fnst@cn.fujitsu.com>
2020-06-29 13:16:52 +01:00
Frederic Branczyk
d17d88935c
rules: Use narrower interface for rule manager loading of for state (#7472)
To load ALERT_FOR_STATE only `storage.Queryable` interface is required,
so this patch uses this narrower interface for to perform this.

Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
2020-06-26 19:06:36 +01:00
Kemal Akkoyun
66dfb951c4
*: Consistent Error/Warning handling for SeriesSet iterator: Allowing Async Select (#7251)
* Add errors and Warnings to SeriesSet

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Change Querier interface and refactor accordingly

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Refactor promql/engine to propagate warnings at eval stage

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Address review issues

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Make sure all the series from all Selects are pre-advanced

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Address review issues

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Separate merge series sets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Clean

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Refactor merge querier failure handling

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Refactored and simplified fanout with improvements from incoming chunk iterator PRs.

* Secondary logic is hidden, instead of weird failed series set logic we had.
* Fanout is well commented
* Fanout closing record all errors
* MergeQuerier improved API (clearer)
* deferredGenericMergeSeriesSet is not needed as we return no samples anyway for failed series sets (next = false).

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Fix formatting

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Fix CI issues

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Added final tests for error handling.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Addressed Brian's comments.

* Moved hints in populate to be allocated only when needed.
* Used sync.Once in secondary Querier to achieve all-or-nothing partial response logic.
* Select after first Next is done will panic.

NOTE: in lazySeriesSet in theory we could just panic, I think however we can
totally just return error, it will panic in expand anyway.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Utilize errWithWarnings

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Fix recently introduced expansion issue

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Add tests for secondary querier error handling

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Implement lazy merge

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Add name to test cases

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Reorganize

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Address review comments

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Address review comments

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Remove redundant warnings

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Fix rebase mistake

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-06-09 17:57:31 +01:00
Brian Brazil
5368066b58
Give a bit more slack for alertmanager send failures. (#7228)
Fixes #5277

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2020-05-09 05:37:46 +01:00
Julien Pivotto
fc3fb3265a
Merge pull request #7145 from prometheus/release-2.17
Backport release 2.17 into master
2020-04-20 14:08:12 +02:00
Chris Marchbanks
a7b449320d
Fix updating rule manager never finishing (#7138)
Rather than sending a value to the done channel on a group to indicate
whether or not to add stale markers to a closing rule group use an
explicit boolean. This allows more functions than just run() to read
from the done channel and fixes an issue where Eval() could consume the
channel during an update, causing run() to never return.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-04-18 14:32:18 +02:00
ZouYu
2b7437d60e
Fix some warnings: 'redundant type from array, slice, or map composite literal' (#7109)
Signed-off-by: ZouYu <zouy.fnst@cn.fujitsu.com>
2020-04-15 11:17:41 +01:00
Marek Slabicki
8224ddec23
Capitalizing first letter of all log lines (#7043)
Signed-off-by: Marek Slabicki <thaniri@gmail.com>
2020-04-11 09:22:18 +01:00
Ben Ye
00730bfee7
add rule_group label to rule evaluation metrics (#7094)
Signed-off-by: yeya24 <yb532204897@gmail.com>
2020-04-08 23:21:37 +02:00
Muhammad Falak R Wani
2d1a80aa82
rules: manager: clarify doc string for NewGroupMetrics (#7084)
* rules: manager: clarify doc string for NewGroupMetrics

Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com>
2020-04-07 14:06:01 +02:00
Brian Brazil
7646cbca32
Use .UTC everywhere we use time.Unix (#7066)
time.Unix attaches the local timezone, which can then
leak out (e.g. in the alert json). While this is harmless,
we should be consistent.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2020-03-29 17:35:39 +01:00
Bartlomiej Plotka
c4eefd1b3a storage: Removed SelectSorted method; Simplified interface; Added requirement for remote read to sort response.
This is technically BREAKING CHANGE, but it was like this from the beginning: I just notice that we rely in
Prometheus on remote read being sorted. This is because we use selected data from remote reads in MergeSeriesSet
which rely on sorting.

I found during work on https://github.com/prometheus/prometheus/pull/5882 that
we do so many repetitions because of this, for not good reason. I think
I found a good balance between convenience and readability with just one method.
Smaller the interface = better.

Also I don't know what TestSelectSorted was testing, but now it's testing sorting.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-03-20 21:14:43 +01:00
Björn Rabenstein
1da83305be
Merge pull request #7009 from prometheus/release-2.17
Merge release-2.17 into master
2020-03-19 13:46:28 +01:00
Julien Pivotto
8907ba6235 Make TSDB use storage errors
This fixes #6992, which was introduced by #6777. There was an
intermediate component which translated TSDB errors into storage errors,
but that component was deleted and this bug went unnoticed, until we
were watching at the Prombench results. Without this, scrape will fail
instead of dropping samples or using "Add" when the series have been
garbage collected.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-03-17 22:24:25 +01:00
Björn Rabenstein
d80b0810c1
Move crucial actions to defer (#6918)
With defer having less of a performance penalty, there is no reason
not to do those crucial operations via defer.

Context: With isolation in place, if we forget to Commit/Rollback, the
low watermark will get stuck forever.

The current code should not have any bugs, but moving to defer helps
to avoid future bugs.

This is also moving the `closeAppend` in the `Commit` implementation
itself to defer. If logging to the WAL fails, we would have missed the
`closeAppend`.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-03-13 20:54:47 +01:00
Bartlomiej Plotka
fe802f29c9 storage: Removed SelectSorted method; Simplified interface; Added requirement for remote read to sort response.
This is technically BREAKING CHANGE, but it was like this from the beginning: I just notice that we rely in
Prometheus on remote read being sorted. This is because we use selected data from remote reads in MergeSeriesSet
which rely on sorting.

I found during work on https://github.com/prometheus/prometheus/pull/5882 that
we do so many repetitions because of this, for not good reason. I think
I found a good balance between convenience and readability with just one method.
Smaller the interface = better.

Also I don't know what TestSelectSorted was testing, but now it's testing sorting.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-03-13 13:06:25 +00:00
Julien Pivotto
2051ae2e6a Revert "Fix race condition in Rule manager.Update() function"
This reverts commit 8b11d2cfb6.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-03-02 08:01:21 +01:00
fuling
8b11d2cfb6 Fix race condition in Rule manager.Update() function
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
2020-03-01 12:00:51 +01:00
Tobias Guggenmos
4835bbf376
Merge branch 'master' into split_parser 2020-02-19 15:18:13 +01:00
Bartlomiej Plotka
34426766d8 Unify Iterator interfaces. All point to storage now.
This is part of https://github.com/prometheus/prometheus/pull/5882 that can be done to simplify things.
All todos I added will be fixed in follow up PRs.

* querier.Querier, querier.Appender, querier.SeriesSet, and querier.Series interfaces merged
with storage interface.go. All imports that.
* querier.SeriesIterator replaced by chunkenc.Iterator
* Added chunkenc.Iterator.Seek method and tests for xor implementation (?)
* Since we properly handle SelectParams for Select methods I adjusted min max
based on that. This should help in terms of performance for queries with functions like offset.
* added Seek to deletedIterator and test.
* storage/tsdb was removed as it was only a unnecessary glue with incompatible structs.

No logic was changed, only different source of abstractions, so no need for benchmarks.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-02-17 18:03:54 +00:00
Tobias Guggenmos
6c00f2ffcb Comment fixes
Signed-off-by: Tobias Guggenmos <tguggenm@redhat.com>
2020-02-17 16:09:23 +01:00
Tobias Guggenmos
20b1f596f6 Fix build errors in rest of prometheus
Signed-off-by: Tobias Guggenmos <tguggenm@redhat.com>
2020-02-17 16:09:23 +01:00
Julien Pivotto
135cc30063
rules: Make deleted rule series as stale after a reload (#6745)
* rules: Make deleted rule series as stale after a reload

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-02-12 16:22:18 +01:00
Julien Pivotto
0e912faf4f
rules: Cleanup unused function alert.Duration (#6734)
The function HoldDuration and Duration did the exact same thing.

Let's only keep HoldDuration() as Duration() is more confusing.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-02-01 07:31:37 +00:00
Julien Pivotto
f82d55e79f
Refactor and simplify rule_group_interval_seconds (#6711)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-01-29 11:26:08 +00:00
Julien Pivotto
9adad8ad30 Remove MaxConcurrent from the PromQL engine opts (#6712)
Since we use ActiveQueryTracker to check for concurrency in
d992c36b3a it does not make sense to keep
the MaxConcurrent value as an option of the PromQL engine.

This pull request removes it from the PromQL engine options, sets the
max concurrent metric to -1 if there is no active query tracker, and use
the value of the active query tracker otherwise.

It removes dead code and also will inform people who import the promql
package that we made that change, as it breaks the EngineOpts struct.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-01-28 20:38:49 +00:00
Julien Pivotto
56ebd5afde Delete prometheus_rule_group metrics when groups are removed (#6693)
* Delete prometheus_rule_group metrics when groups are removed

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-01-27 12:41:32 +00:00
Julien Pivotto
cf42888e4d Fix order of testutil.Equals (#6695)
Equals takes the expected value as first parameter, and the actual value
as second parameter.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-01-27 12:21:59 +00:00
Julien Pivotto
5f27ac3583 Refactor query log fields (#6694)
* Refactor query log fields

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-01-27 09:53:10 +00:00
Harkishen Singh
84e6459c4d Adds support for line-column numbers for invalid rules, promtool (#6533)
Signed-off-by: Harkishen Singh <harkishensingh@hotmail.com>
2020-01-15 18:07:54 +00:00
Julien Pivotto
0011bba19b Query Log: add origin of the rules (#6592)
* Query Log: add origin of the rules

We don't set rule name and rule kind because the added value would be
quite low, given we have now the file, the group and the query.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-01-10 08:28:17 +00:00
Julien Pivotto
e079c9ed45 manager: add full stops on comments
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2019-12-19 11:46:22 +01:00
alburthoffman
156dcb8cca avoid stopping rule groups if new rule groups are as same as old rule… (#6450)
* avoid stopping rule groups if new rule groups are as same as old rule groups

Signed-off-by: alburthoffman <alburthoffman@gmail.com>
2019-12-19 10:41:11 +00:00
Julien Pivotto
2d7c8069d0 Check that rules don't contain metrics with the same labelset (#6469)
Closes #5529

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2019-12-18 12:29:35 +00:00
Simon Pasquier
06066a3619
*: improve error messages when parsing bad rules (#5965)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-08-28 17:36:48 +02:00
Brian Brazil
e62f30d497
Correctly handle empty labels from alert templates. (#5845)
Fixes https://github.com/prometheus/common/issues/36

Move logic handling this into the labels package,
so all the cases are handled in one place and we're
less likely to have this come up again.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2019-08-13 11:19:17 +01:00
Chris Marchbanks
0685eb5395
Refactor testutil.NewStorage into a new package
This avoids a circular dependency between the testutil and storage
packages.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-08-08 19:43:04 -06:00
Brian Brazil
2184b79763
Mark deleted rule's series as stale on next evaluation. (#5759)
Fixes #5755

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2019-08-07 16:11:05 +01:00
AllenZMC
5c2c9a03e9 fix word 'consequentally' to 'consequently' (#5827)
Signed-off-by: czm <zhongming.chang@daocloud.io>
2019-08-03 14:56:58 +01:00
beorn7
dd81912554 Add objectives to Summaries
With the next release of client_golang, Summaries will not have
objectives by default. To not lose the objectives we have right now,
explicitly state the current default objectives.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-12 02:03:13 +02:00
pbhudiaBAE
43953b105b Sorting alerts by group name in /alerts (#5448)
* Working group name

Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>

* Working categorised by group name

Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>

* Changed group sorting in web

Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>

* Fixed group sorting and comments

Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>

* Fixed group sorting and comments with gofmt

Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>

* Added file and group name

Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>

* reverted back to full path to yml file

Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>
2019-05-14 23:14:27 +02:00
Yao Zengzeng
5544cb252a fix some mistakes in comments (#5533)
Signed-off-by: YaoZengzeng <yaozengzeng@zju.edu.cn>
2019-05-05 10:48:42 +01:00
Simon Pasquier
45506841e6
*: enable all default linters (#5504)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-05-03 15:11:28 +02:00
Bjoern Rabenstein
76102d570c Add test for external labels in label template
Signed-off-by: Bjoern Rabenstein <bjoern@rabenste.in>
2019-04-21 00:08:39 +02:00
Bjoern Rabenstein
38d518c0fe Rework #5009 after comments
Signed-off-by: Bjoern Rabenstein <bjoern@rabenste.in>
2019-04-17 01:40:10 +02:00
Sylvain Rabot
335a34486e Add external labels to template expansion
This affects the expansion of templates in alert labels and
annotations and console templates.

Signed-off-by: Sylvain Rabot <sylvain@abstraction.fr>
2019-04-17 01:40:10 +02:00
Tariq Ibrahim
8fdfa8abea refine error handling in prometheus (#5388)
i) Uses the more idiomatic Wrap and Wrapf methods for creating nested errors.
ii) Fixes some incorrect usages of fmt.Errorf where the error messages don't have any formatting directives.
iii) Does away with the use of fmt package for errors in favour of pkg/errors

Signed-off-by: tariqibrahim <tariq181290@gmail.com>
2019-03-26 00:01:12 +01:00
James Ravn
e15d8c5802 reload: copy state on both name and labels (#5368)
* reload: copy state on both name and labels

Fix https://github.com/prometheus/prometheus/issues/5193

Using just name causes the linked issue - if new rules are inserted with
the same name (but different labels), the reordering will cause stale
markers to be inserted in the next eval for all shifted rules, despite
them not being stale.

Ideally we want to avoid stale markers for time series that still exist
in the new rules, with name and labels being the unique identifer.

This change adds labels to the internal map when copying the old rule
data to the new rule data. This prevents the problem of staling rules
that simply shifted order.

If labels change, it is a new time series and the old series will stale
regardless. So it should be safe to always match on name and labels when
copying state.

Signed-off-by: James Ravn <james@r-vn.org>
2019-03-15 15:23:36 +00:00
David Symonds
46361a7c85 rules: Fix sorting of result from (*Manager).RuleGroups (#5260)
The previous code was defective in that it never sorted groups within a
file due to doing a multi-key sort incorrectly.

Signed-off-by: David Symonds <dsymonds@gmail.com>
2019-02-23 09:51:44 +01:00
beorn7
2db1eeb4ec Fix prometheus_rule_group_last_evaluation_timestamp_seconds
It should be a unix timestamp, not the seconds in the minute.

Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-06 11:02:49 +01:00
Ganesh Vernekar
787eb1e904 Set rule_group_last_duration_seconds to seconds (#5153)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2019-01-31 11:07:58 +01:00
Matt Layher
302148fd69 *: apply gofmt -s
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2019-01-16 17:28:14 -05:00
Vishnunarayan K I
fd3ef6ba34 Add metric rule_group_rules_loaded to get the number of rules loaded (#5090)
Signed-off-by: Vishnunarayan K I <appukuttancr@gmail.com>
2019-01-13 14:28:07 +00:00
Simon Pasquier
f678e27eb6
*: use latest release of staticcheck (#5057)
* *: use latest release of staticcheck

It also fixes a couple of things in the code flagged by the additional
checks.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Use official release of staticcheck

Also run 'go list' before staticcheck to avoid failures when downloading packages.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-01-04 14:47:38 +01:00
Tom Wilkie
121603c417
Expose rules.NewGroupMetrics and rules.Metrics. (#5059)
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-01-03 12:07:06 +00:00
Tom Wilkie
6e08029b56
Move err to be the last return value from storage.Select. (#5054)
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-01-02 11:10:13 +00:00
Bartek Płotka
de213d4a5e rule manager: Moved metric registration to custom registerer which is already available. (#4961)
Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
2018-12-28 10:20:29 +00:00
AixesHunter
fb8479a677 Variable 'labels' collides with imported package name (#5012)
Signed-off-by: aixeshunter <aixeshunter@gmail.com>
2018-12-19 09:44:03 +00:00
mknapphrt
f0e9196dca Return warnings on a remote read fail (#4832)
Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
2018-11-30 14:27:12 +00:00
Krasi Georgiev
0754e5334b
querier for RestoreForState not closed. (#4922)
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-11-28 15:25:17 +02:00
Ben Kochie
c6399296dc
Fix spelling/typos (#4921)
* Fix spelling/typos

Fix spelling/typos reported by codespell/misspell.
* UK -> US spelling changes.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-11-27 17:44:29 +01:00
Wei Guo
e329cbf673 Add metric prometheus_rule_group_last_evaluation for recording and alerting (#4852)
* add metric prometheus_rule_group_last_evaluation for recording and alerting

Signed-off-by: Wei Guo <me@imkira.com>

* fix issues from comments

Signed-off-by: Wei Guo <me@imkira.com>
2018-11-27 14:38:13 +08:00
Will Hegedus
193ebe7e34 Updates to /targets and /rules (scrape duration, last evaluation time) (#4722)
* Add evaluationTimestamp (Last Evaluation) column to display on /rules
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Add lastScrapeDuration ("Scrape Duration") to display on /targets
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Updates based on Julius' feedback

Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Update to set timestamp to when eval started (after eval completes)

Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Update /rules to display time since last evaluation

Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Re-order Last Eval/Eval Time to be consistent with targets page

Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
2018-10-12 18:26:59 +02:00
Callum Styan
9bca041285 WIP: keep track of samples per query, set a max # of samples (#4513)
* keep track of samples per query, set a max # of samples that can be in
memory at once

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2018-10-02 12:59:19 +01:00
Ganesh Vernekar
5790d23fd8 Unit testing for rules (#4350)
* Unit testing for rules
* Specifying order of group evaluation in unit tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-09-25 17:06:26 +01:00
Ganesh Vernekar
05726c5ea2 Test template expansion while loading groups (#4537)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-09-13 13:55:58 +01:00
Chris Marchbanks
63ed9d1b70 Send EndsAt along with alerts (#4550)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2018-08-28 16:05:00 +01:00
Chris Marchbanks
87f1dad16d throttle resends of alerts to 1 minute by default (#4538)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2018-08-27 17:41:42 +01:00
Goutham Veeramachaneni
f3b7c22827 rules: add comment about lock taking (#4525)
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2018-08-21 21:30:08 +02:00
Ganesh Vernekar
c663477688 Fixed TestUpdate in rules/manager_test.go (#4516)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-08-20 18:21:05 +05:30
Julius Volz
8fbe1b5133
Handle a bunch of unchecked errors (#4461)
There are many more (mostly finalizers like Close/Stop/etc.), but most of
the others seemed like one couldn't do much about them anyway.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-17 17:24:35 +02:00
Ganesh Vernekar
a0a9e7df91 Fix TestForStateRestore (#4476) (#4512)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-08-16 22:56:15 +05:30
Julien Pivotto
0b4d22b245 rules/manager: remove a no-longer-relevant comment (#4503)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2018-08-15 09:33:39 +01:00
Chris Marchbanks
11155c7028 Existing alert labels will update based on templates (#4500)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2018-08-15 08:52:08 +01:00
Fabian Reinartz
b7e2f407de rules: Fix double-locking of mutex
Signed-off-by: Fabian Reinartz <freinartz@google.com>
2018-08-07 07:33:39 -04:00
Benji Visser
8bb6e0dd6e Show rule evaluation errors on rules page (#4457)
* adding information about the health and errors for Rules

adding Health() and LastError() to the Rule interface. This will allow
us to easily surface information about rules.

Signed-off-by: noqcks <benny@noqcks.io>

* updating rules.html with fields for Rule errors and health state

Signed-off-by: noqcks <benny@noqcks.io>

* fix code comment grammar & access Rule health/error info using a mutex

Signed-off-by: noqcks <benny@noqcks.io>

* s/Errors/Error/ in rules.html to remain consistent with targets.html

Signed-off-by: noqcks <benny@noqcks.io>

* adding periods to code comments in reporting/alerting

Signed-off-by: noqcks <benny@noqcks.io>

* putting health/error below mutex in struct field

Signed-off-by: noqcks <benny@noqcks.io>
2018-08-07 00:33:45 +02:00
Julius Volz
2b8fc062a8
rules: HTML-escape rule YAML marshal errors (#4464)
This was pointed out by `gosec`.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-05 14:01:51 +02:00
Julius Volz
90521a65f8
Remove error return value from NotifyFunc() (#4459)
It's always nil and we also forgot to check it.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-04 21:31:12 +02:00
Ganesh Vernekar
f1db699dff Persist alert 'for' state across restarts (#4061)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-08-02 11:18:24 +01:00
Max Leonard Inden
71fafad099
api/v1: Coninue work exposing rules and alerts
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-07-30 15:31:51 +02:00
mg03
31f8ca0dfb
api v1 alerts/rules json endpoint
Signed-off-by: mg03 <mgeng03@gmail.com>
2018-07-30 15:29:44 +02:00
Bryan Boreham
afdb66dfac Expose Group.CopyState() (#4304)
This makes the `rules` package more useful to projects that use
Prometheus as a library.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2018-07-18 15:14:38 +02:00
Julius Volz
9e3171f6e3 rules: Minor naming/comment cleanups (#4328)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-18 04:54:33 +01:00
Bryan Boreham
2bd510a63e Make TestUpdate() do some work (#4306)
Previously it would set no preconditions and check no postconditions,
as the `groups` member was empty.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2018-06-22 15:21:04 +01:00
Alin Sinpalean
9dc763cc03 Run rule evaluation with timestamps precisely evaluation_interval apart (#4201)
* Run rule evaluation with timestamps precisely evaluation_interval apart from one another.

Signed-off-by: Alin Sinpalean <alin.sinpalean@gmail.com>
2018-06-01 15:23:07 +01:00
Mario Trangoni
464e747f1e fix some comments typos (#4059) 2018-04-08 10:51:54 +01:00
Bryan Boreham
93494d8b7e Add an OpenTracing span for each rule (#4027)
* Add an OpenTracing span for each rule

So that tags and child spans can be traced back to the rule that they
refer to.
2018-03-30 21:29:19 +01:00
ferhat elmas
ec8e4d8a7c all: remove unnecessary type conversions (#3992)
excep promql due to not to create conflict with #3966.
2018-03-21 09:25:22 +00:00
Warren Fernandes
58e2a31db8 Cleans up test by removing unused function (#3969) 2018-03-15 08:59:19 +00:00
ferhat elmas
ffa673f7d8 General simplifications (#3887)
Another try as in #1516
2018-02-26 07:58:10 +00:00
Fabian Reinartz
7ccd4b39b8 *: implement query params
This adds a parameter to the storage selection interface which allows
query engine(s) to pass information about the operations surrounding a
data selection.
This can for example be used by remote storage backends to infer the
correct downsampling aggregates that need to be provided.
2018-02-13 12:17:22 +01:00
Simon Pasquier
81c0ab69e0 Don't reset FiredAt for inactive alerts
Otherwise AlertManager receives resolved alerts where StartsAt is zero which
fails the validation.
2018-01-22 17:17:33 +01:00
Brian Brazil
30b4439bbd Remove rule_type label from rule metrics.
This is not really needed now that we have rule groups
to distinguish rules.
2017-12-04 11:44:38 +00:00
Brian Brazil
b97f4cf48c Add metrics for rule group interval and last duration. 2017-12-04 11:44:38 +00:00
Brian Brazil
0a42a9fc8f Copy over rule group duration on reload.
This is currently getting lost, this will soon be in a metric and we
don't want it dropping to 0 on every reload.
2017-12-04 11:44:38 +00:00
Brian Brazil
aa370fa568 Clarify metric names around rule groups.
Make it clear they're about overall rule groups.
2017-12-04 11:44:38 +00:00
Fabian Reinartz
62461379b7 rules: decouple notifier packages
The dependency on the notifier packages caused a transitive dependency
on discovery and with that all client libraries our service discovery
uses.
2017-11-27 16:38:14 +01:00
Fabian Reinartz
4d964a0a0d rules: make glob expansion a concern of main 2017-11-24 08:22:57 +01:00
Fabian Reinartz
bd9f7460eb rules: remove config package dependency 2017-11-24 07:57:54 +01:00
Fabian Reinartz
2d0e3746ac rules: remove dependency on promql.Engine 2017-11-24 07:57:54 +01:00
Fabian Reinartz
2ec5965b75
Merge pull request #3508 from prometheus/uptsdb
update TSDB
2017-11-23 19:11:54 +01:00
Fabian Reinartz
83cd270ea4 *: adapt to storage interface changes 2017-11-23 19:05:04 +01:00
Goutham Veeramachaneni
a880c86375
Fix unexported method on exported interface.
Also move to model.Duration

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-11-23 19:13:57 +05:30
conorbroderick
55aaece116 Add rule evaluation time 2017-11-22 15:22:02 +00:00
Goutham Veeramachaneni
e1117715fe
rules: remove skipped iterations cuz no throttling
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-11-14 17:33:00 +05:30
Jorge Hernández
6cd0f63eb1 Use testutil in rules subpackage (#3278)
* Use testutil in rules subpackage

* Fix manager test

* Use testutil in rules subpackage

* Fix manager test

* Fix rebase

* Change to testutil for applyConfig tests
2017-11-11 11:29:47 +01:00
Krasi Georgiev
e86d82ad2d Fix regression of alert rules state loss on config reload. (#3382)
* incorrect map name for the group prevented copying state from existing alert rules on config reload

* applyConfig test

* few nits

* nits 2
2017-11-01 12:58:00 +01:00
Julius Volz
099df0c5f0 Migrate "golang.org/x/net/context" -> "context" (#3333)
In some places, where ctxhttp or gRPC are concerned, we still need to use the
old contexts.
2017-10-24 21:21:42 -07:00
Brian Brazil
cc5499fcad Only close after checking for err. 2017-10-09 19:44:03 +01:00
Brian Brazil
ee88f0d222 Ensure all values are used or _ 2017-10-09 19:44:03 +01:00
Fabian Reinartz
2d0b8e8b94 Merge branch 'master' into dev-2.0 2017-10-05 13:09:18 +02:00
Julius Volz
f7e8348a88 Re-add contexts to storage.Storage.Querier() (#3230)
* Re-add contexts to storage.Storage.Querier()

These are needed when replacing the storage by a multi-tenant
implementation where the tenant is stored in the context.

The 1.x query interfaces already had contexts, but they got lost in 2.x.

* Convert promql.Engine to use native contexts
2017-10-04 21:04:15 +02:00
beorn7
c2e9a151ab Make all rule links link to the "Console" tab rather than "Graph"
Clicking on a rule, either the name or the expression, opens the rule
result (or the corresponding expression, repsectively) in the
expression browser. This should by default happen in the console tab,
as, more often than not, displaying it in the graph tab runs into a
timeout.
2017-09-21 18:28:00 +02:00
Fabian Reinartz
d21f149745 *: migrate to go-kit/log 2017-09-08 22:01:51 +05:30
Goutham Veeramachaneni
e1fc9dc78d Move /rules to new format (#2901)
Fixes #2891

Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>
2017-07-08 11:38:02 +02:00
Goutham Veeramachaneni
37e7b69f56
Merge remote-tracking branch 'upstream/dev-2.0' into rulegroups
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-19 16:34:55 +05:30
Goutham Veeramachaneni
c472316fb3
Check done before every rule evaluation.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 16:57:22 +05:30
Goutham Veeramachaneni
6b70a4d850
Incorporate PR feedback
* Move fingerprint to Hash()
* Move away from tsdb.MultiError
* 0777 -> 0666 for files
* checkOverflow of extra fields

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 16:44:33 +05:30
Goutham Veeramachaneni
507790a357
Rework logging to use explicitly passed logger
Mostly cleaned up the global logger use. Still some uses in discovery
package.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 15:52:44 +05:30
Goutham Veeramachaneni
dc69645e92
Move back to go-yaml
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 10:46:21 +05:30
Goutham Veeramachaneni
5ff283a7b7
Reflect the grouping in the UI
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 16:09:14 +05:30
Goutham Veeramachaneni
8cca666cf2
Add file name to group.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 15:18:39 +05:30
Goutham Veeramachaneni
e893c89333
Validate labels and annotations
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 15:07:54 +05:30
Goutham Veeramachaneni
a48a018368
Make sure groups are unique in a single file
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 12:19:21 +05:30
Goutham Veeramachaneni
cea1e99f78
Add update-rules command to promtool
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 11:38:54 +05:30
Goutham Veeramachaneni
e8f55669ea Move rules to new format
Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>
2017-06-12 18:14:39 +05:30
Brian Brazil
dcea3e4773 Don't append a 0 when alert is no longer pending/firing
With staleness we no longer need this behaviour.
2017-05-24 13:52:45 +01:00
Brian Brazil
cc867dae60 Copy previous series and alert state more intelligently.
Usually rules don't more around, and if they do it's likely
that rules/alerts with the same name stay in the same order.

If rules/alerts with the same name are added/removed this
could cause a blip for one cycle, but this is unavoidable
without requiring rule and alert names to be unique - which we don't
want to do.
2017-05-24 13:52:45 +01:00
Brian Brazil
9bc68db7e6 Track staleness per rule rather than per group. 2017-05-24 13:52:45 +01:00
Brian Brazil
0451d6d31b Add unittest for rule staleness, and rules generally. 2017-05-24 13:52:45 +01:00
Brian Brazil
0400f3cfd2 Very basic staleness handling for rules. 2017-05-24 13:52:45 +01:00
Fabian Reinartz
06c2b76cd4 Merge branch 'master' into uptsdb 2017-05-16 16:48:37 +02:00
Alexey Palazhchenko
b0e1ea7c6c Simplify code, fix typos. (#2719) 2017-05-15 09:56:09 +01:00
Julius Volz
ac203ef0ee Add externalURL template function (#2716)
This allows users to e.g. add links back to the generating Prometheus
right in their alert templates.
2017-05-13 15:47:04 +02:00
Julius Volz
fe11c5933a Fix mutation of active alert elements by notifier (#2656)
This caused the external label application in the notifier to bleed back
into the rule manager's active alerting elements.
2017-04-26 10:29:42 -05:00
Fabian Reinartz
8ffc851147 Merge branch 'master' into dev-2.0 2017-04-04 15:17:56 +02:00
Tobias Schmidt
eaf33759fb Register forgotten prometheus_evaluator_iterations_total metric 2017-04-02 20:32:56 -03:00
Tobias Schmidt
aaaba57184 Export number of missed rule evaluations
In case the execution of all rules takes longer than the configured rule
evaluation interval, one or more iterations will be skipped. This needs
to be visible to the opterator.
2017-04-02 20:03:28 -03:00
Fabian Reinartz
5772f1a7ba retrieval/storage: adapt to new interface
This simplifies the interface to two add methods for
appends with labels or faster reference numbers.
2017-02-02 13:05:46 +01:00
Fabian Reinartz
ad9bc62e4c storage: extend appender and adapt it 2017-01-13 14:48:01 +01:00
Fabian Reinartz
e94b0899ee rules: fix tests, remove model types 2016-12-29 17:31:14 +01:00
Fabian Reinartz
f8fc1f5bb2 *: migrate ingestion to new batch Appender 2016-12-29 11:03:56 +01:00
Fabian Reinartz
fecf9532b9 *: fix misc compile errors 2016-12-25 11:42:57 +01:00
Fabian Reinartz
622ece6273 *: fix recording tests, migrate matcher types 2016-12-25 11:12:57 +01:00
Fabian Reinartz
5817cb5bde *: migrate from model.* to promql.* types 2016-12-25 00:37:46 +01:00
Fabian Reinartz
e68a3cf21f rules: update annotations on each iteration 2016-11-22 15:43:07 +01:00
Jonathan Lange
d78dd3593d Set evaluation interval on Group construction
Prevents having object in invalid state, and allows users of public API
to construct valid Groups.
2016-11-18 16:32:30 +00:00
Jonathan Lange
31fc357cd8 Make NewGroup and Group.Eval public
Allows callers to execute evaluate lists of rules without first writing
them to disk.
2016-11-18 16:25:58 +00:00
Jonathan Lange
2a2da40223 Make rule evaluation publicly available
Means that a third-party can parse rules and run them with their own
execution model.
2016-11-18 16:12:50 +00:00
Matt Bostock
926a5ab3dd rules/manager.go: Fix race between reload and stop
On one relatively large Prometheus instance (1.7M series), I noticed
that upgrades were frequently resulting in Prometheus undergoing crash
recovery on start-up.

On closer examination, I found that Prometheus was panicking on
shutdown.

It seems that our configuration management (or misconfiguration thereof)
is reloading Prometheus then immediately restarting it, which I suspect
is causing this race:

    Sep 21 15:12:42 host systemd[1]: Reloading prometheus monitoring system.
    Sep 21 15:12:42 host prometheus[18734]: time="2016-09-21T15:12:42Z" level=info msg="Loading configuration file /etc/prometheus/config.yaml" source="main.go:221"
    Sep 21 15:12:42 host systemd[1]: Reloaded prometheus monitoring system.
    Sep 21 15:12:44 host systemd[1]: Stopping prometheus monitoring system...
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:203"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="See you next time!" source="main.go:210"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="Stopping target manager..." source="targetmanager.go:90"
    Sep 21 15:12:52 host prometheus[18734]: time="2016-09-21T15:12:52Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:548"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=1 source="scrape.go:467"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping rule manager..." source="manager.go:366"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Rule manager stopped." source="manager.go:372"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping notification handler..." source="notifier.go:325"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping local storage..." source="storage.go:381"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping maintenance loop..." source="storage.go:383"
    Sep 21 15:13:01 host prometheus[18734]: panic: close of closed channel
    Sep 21 15:13:01 host prometheus[18734]: goroutine 7686074 [running]:
    Sep 21 15:13:01 host prometheus[18734]: panic(0xba57a0, 0xc60c42b500)
    Sep 21 15:13:01 host prometheus[18734]: /usr/local/go/src/runtime/panic.go:500 +0x1a1
    Sep 21 15:13:01 host prometheus[18734]: github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1(0xc6645a9901, 0xc420271ef0, 0xc420338ed0, 0xc60c42b4f0, 0xc6645a9900)
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:412 +0x3c
    Sep 21 15:13:01 host prometheus[18734]: created by github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:423 +0x56b
    Sep 21 15:13:03 host systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
2016-09-21 22:03:02 +01:00
Julius Volz
c187308366 storage: Contextify storage interfaces.
This is based on https://github.com/prometheus/prometheus/pull/1997.

This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
2016-09-19 16:29:07 +02:00
Julius Volz
ed5a0f0abe promql: Allow per-query contexts.
For Weaveworks' Frankenstein, we need to support multitenancy. In
Frankenstein, we initially solved this without modifying the promql
package at all: we constructed a new promql.Engine for every
query and injected a storage implementation into that engine which would
be primed to only collect data for a given user.

This is problematic to upstream, however. Prometheus assumes that there
is only one engine: the query concurrency gate is part of the engine,
and the engine contains one central cancellable context to shut down all
queries. Also, creating a new engine for every query seems like overkill.

Thus, we want to be able to pass per-query contexts into a single engine.

This change gets rid of the promql.Engine's built-in base context and
allows passing in a per-query context instead. Central cancellation of
all queries is still possible by deriving all passed-in contexts from
one central one, but this is now the responsibility of the caller. The
central query context is now created in main() and passed into the
relevant components (web handler / API, rule manager).

In a next step, the per-query context would have to be passed to the
storage implementation, so that the storage can implement multi-tenancy
or other features based on the contextual information.
2016-09-19 15:38:17 +02:00
beorn7
75bae065fd Revert "Modify tests to adjust to reverting the /graph changes"
This reverts commit f1ea5bf232.

Part two necessary for reverting the /graph revert.
2016-09-03 21:08:33 +02:00
beorn7
f1ea5bf232 Modify tests to adjust to reverting the /graph changes
These tests have been added after the /graph changes and therefore
already test the new syntax.

This commit has to be reverted together with the previous one to get
back to the old new state. *sigh*
2016-09-02 14:12:31 +02:00
Julius Volz
fe7b8b7fd1 Add missing license header to alerting_test.go 2016-08-13 00:11:52 +02:00
Julius Volz
da7206ec29 Fix rule HTML escaping issues
This was mentioned as part of https://github.com/prometheus/alertmanager/issues/452
2016-08-12 02:59:41 +02:00
Brian Brazil
6fc88d4b4d Remove __name__ from alerts sent to AM.
Fixes #1861
2016-08-01 23:32:41 +01:00
Dmitry Vorobev
273e457da4 web: return status code and error message for config resource 2016-07-15 10:15:24 +02:00
Brian Brazil
0509b0f2db Expand alert templates at eval time.
Fixes #1678 #1677
2016-07-12 17:13:55 +01:00
beorn7
064b57858e Consistently use the Seconds() method for conversion of durations
This also fixes one remaining case of recording integral numbers
of seconds only for a metric, i.e. this will probably fix #1796.
2016-07-07 15:24:35 +02:00
Fabian Reinartz
f7ed2ff706 Merge pull request #1644 from prometheus/beorn7/logging
Add missing logging of out-of-order samples
2016-05-20 05:52:00 -07:00
beorn7
b95c096a45 Fix style issues in rules/... 2016-05-19 16:59:53 +02:00
beorn7
45e5775f9b Add missing logging of out-of-order samples
So far, out-of-order samples during rule evaluation were not logged,
and neither scrape health samples. The latter are unlikely to cause
any errors. That's why I'm logging them always now. (It's alway highly
irregular should it happen.) For rules, I have used the same plumbing
as for samples, just with a different wording in the message to mark
them as a result of rule evaluation.
2016-05-19 16:22:53 +02:00
beorn7
4b574e8a61 Switch chunk encoding to type 2 where it was hardcoded type 1 before
The chunk encoding was hardcoded there because it mostly doesn't
matter what encoding is chosen in that test. Since type 1 is
battle-hardened enough, I'm switching to type 2 here so that we can
catch unexpected problems as a byproduct. My expectation is that the
chunk encoding doesn't matter anyway, as said, but then "unexpected
problems" contains the word "unexpected".
2016-03-20 23:32:20 +01:00
Fabian Reinartz
d89c254849 Make copying alerting state safer.
This considers static labels in the equality of alerts to
avoid falsely copying state from a different alert definition with
the same name across reloads.

To be safe, it also copies the state map rather than just its pointer
so that remaining collisions disappear after one evaluation interval.
2016-03-02 12:21:54 +01:00
Fabian Reinartz
bfa8aaa017 Rename notification to notifier 2016-03-01 12:39:08 +01:00
beorn7
663a1550d0 Fix the instrumentation fixes 2016-02-17 15:50:55 +01:00
Tobias Schmidt
f1f8317fa5 Fix detection of flapping alerts
Alerts in the resolve retention period must be transitioned to the
active state again when their condition is met.
2016-02-04 23:55:12 -05:00
Björn Rabenstein
9ea3897ea7 Merge pull request #1354 from prometheus/beorn7/storage
Rework the way to communicate backpressure (AKA suspended ingestion)
2016-02-01 15:10:13 +01:00
beorn7
ec08c9a391 Rework the way to communicate backpressure (AKA suspended ingestion)
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.

Logging of throttled mode has been streamlined and will create at most
one message per minute.
2016-02-01 14:45:44 +01:00
beorn7
a7408bfb47 Unify duration parsing
It's actually happening in several places (and for flags, we use the
standard Go time.Duration...). This at least reduces all our
home-grown parsing to one place (in model).
2016-01-29 15:41:50 +01:00
Fabian Reinartz
a6935024e1 Remove old WITH clause in alert printing 2016-01-26 15:45:27 +01:00
Fabian Reinartz
b0adfea8d5 Fix swapped constants, improve instrumentation 2016-01-21 12:15:29 +01:00
Fabian Reinartz
a8c38c3ac5 Don't log rule evaluation failure on shutdown 2016-01-18 17:34:25 +01:00
Fabian Reinartz
6eee86dce8 Terminate rule groups during initial sleep
When an evaluation group runs initially, it waits a deterministic
amount of time. During that time it also has to accept
a termination singnal so shutdown doesn't hang during the first
evaluation iteration after a configuration reload.

Fixes #1307
2016-01-12 10:54:09 +01:00
Fabian Reinartz
26eb3ac2f8 Don't skip recording rule errors 2016-01-12 10:26:06 +01:00
Fabian Reinartz
37d80c4b25 Fix premature rule evaluation
This commit prevents rule evaluation from starting until after
the storage is ready.
2016-01-08 17:51:22 +01:00
Fabian Reinartz
0cf3c6a9ef Add comments, rename a method 2015-12-23 12:29:28 +01:00
Fabian Reinartz
bf6abac8f4 Send resolved notifications 2015-12-17 15:42:26 +01:00
Fabian Reinartz
f69e668fc4 Improve rules/ instrumentation
This commit adds a counter for the total number of rule evaluations
and standardizes the units to seconds.
2015-12-17 15:42:26 +01:00
Fabian Reinartz
62075aa037 Reduce noisy no-alertmanager warning 2015-12-17 15:42:26 +01:00
Fabian Reinartz
52e5224f5a Refactor rules/ package 2015-12-17 15:42:25 +01:00
Fabian Reinartz
e4fabe135a Set StartsAt to time of first firing state 2015-12-17 11:36:58 +01:00
Fabian Reinartz
7c90db22ed Use annotation based alerts in rules/
This commit breaks the previously used alert format.
2015-12-14 10:16:07 +01:00
Fabian Reinartz
e114ce0ff7 Refactor notification handler 2015-12-11 15:17:32 +01:00
Fabian Reinartz
e3b6ec9784 Switch to common/log 2015-10-03 10:21:43 +02:00
Fabian Reinartz
171f50706a Fix unkeyed field errors. 2015-09-18 17:00:08 +02:00
Brian Brazil
4d196fea6b Merge pull request #1032 from prometheus/scalar-metric
rules: Allow for setting labels on LHS on scalars
2015-08-26 16:56:16 +01:00
Brian Brazil
3bcdb2bbba rules: Allow for setting labels on LHS on scalars 2015-08-26 16:54:28 +01:00
Julius Volz
995d3b831d Fix most golint warnings.
This is with `golint -min_confidence=0.5`.

I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.
2015-08-26 12:44:46 +02:00
Fabian Reinartz
d6b8da8d43 Switch promql types to common/model 2015-08-25 13:49:14 +02:00
Brian Brazil
fdf0d0642e Cast value to float, as that's what the console templates expect. 2015-08-24 16:59:08 +01:00
Fabian Reinartz
438e232c9b Fix grouping of import blocks 2015-08-22 09:42:45 +02:00
Fabian Reinartz
306e8468a0 Switch from client_golang/model to common/model 2015-08-21 13:33:38 +02:00
Brian Brazil
e6a67476c2 rules: Allow recorded rules expressions to be scalars.
This is useful if you want to build up a constant metric,
such as a set of alert thresholds that vary by label value.
2015-08-19 21:09:00 +01:00
Fabian Reinartz
7a67472fc1 Resolve relative paths on configuration loading
This moves the concern of resolving the files relative to the config
file into the configuration loading itself.
It also fixes #921 which did not load the cert and token files relatively.
2015-08-05 18:08:04 +02:00
Fabian Reinartz
feb8a03503 rules: load rule files relative to a base dir 2015-07-03 15:10:37 +02:00
Julius Volz
fcff35b43e Consolidate external reachability flags into one.
Besides fixing https://github.com/prometheus/prometheus/issues/805 by
making the entire externally reachable server URL configurable, this
adds tests for the "globalURL" template function and makes it easier to
test other such functions in the future.

This breaks the `web.Hostname` flag (and introduces `web.external-url`).
This flag is likely only used by few users, so I hope that's
justifiable.

Fixes https://github.com/prometheus/prometheus/issues/805
2015-07-03 13:39:10 +02:00
Fabian Reinartz
f06cf664e1 rules: cleanup alerting test 2015-06-30 18:22:24 +02:00
Fabian Reinartz
9bd4f6d017 rules: preserve alert state across reloads. 2015-06-30 11:32:07 +02:00
Fabian Reinartz
4625485b84 rules: move rules*.go contents to manager*.go 2015-06-30 11:32:07 +02:00
Fabian Reinartz
749ae450c5 promql: add runbook to alert statement.
This commit adds the RUNBOOK keyword to alert statements. The field
is optional and expected to be a link.
2015-06-25 13:00:52 +02:00
Julius Volz
d868264bb8 Improve UI of /alerts page.
Changes to the UI:
- "Active Since" timestamps are now human-readable.
- Alerting rules are now pretty-printed better.
- Labels are no longer just strings, but alert bubbles (like we do on
  the status page for base labels).
- Alert states and target health states are now capitalized in the
  presentation layer rather than at the source.
2015-06-23 18:48:45 +02:00
Fabian Reinartz
fe301d7946 promql: remove global flags 2015-06-15 19:01:06 +02:00
Fabian Reinartz
5e13880201 General cleanup of rules. 2015-06-06 21:40:52 +02:00
Fabian Reinartz
75c920c95e Remove DotGraph method from Rule interface 2015-06-06 21:35:59 +02:00
Fabian Reinartz
83d07516e8 Remove EvalRaw methods from Rule interface 2015-06-06 21:34:09 +02:00
Fabian Reinartz
280d11dca8 main: exit on invalid rule files on startup. 2015-06-02 18:44:41 +02:00
Fabian Reinartz
0de6edbdfc Move pkg/ to util/ 2015-06-01 21:12:32 +02:00
Fabian Reinartz
02717e6fde Remove generic set type 2015-06-01 21:12:32 +02:00
Fabian Reinartz
dbc0d30e3e Move string functionality to pkg/strutil 2015-06-01 21:12:32 +02:00
Fabian Reinartz
f45a5cab60 Move templates package to pkg/template 2015-06-01 21:12:31 +02:00
Fabian Reinartz
c44ac7bc26 Load rule files from entire directories 2015-06-01 21:12:31 +02:00
Julius Volz
d7c015c149 Convert pathPrefix to not have trailing slash. 2015-06-01 12:43:17 +02:00
Julius Volz
ff53d10849 Fix double slash in GeneratorURL sent to alertmanager.
Fixes https://github.com/prometheus/prometheus/issues/722
2015-05-23 19:16:57 +02:00
Julius Volz
267fd34156 Switch Prometheus to use github.com/prometheus/log.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
2015-05-20 18:19:32 +02:00
Fabian Reinartz
e2ed921505 Merge branch 'master' into fabxc/servdisc 2015-05-20 14:13:08 +02:00
Mitsuhiro Tanda
3e914a8cb1 fix graph links with path prefix 2015-05-19 02:45:05 +09:00
Fabian Reinartz
bb540fd9fd Implement config reloading on SIGHUP.
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
2015-05-13 16:49:46 +02:00
Fabian Reinartz
fe935179cd Stop routing rule statements through the engine. 2015-04-29 18:01:43 +02:00
Fabian Reinartz
8d7c479fed Merge pull request #658 from prometheus/fabxc/pql/rules-manager
Rename RuleManager to Manager, remove interface.
2015-04-29 16:54:21 +02:00
Fabian Reinartz
479891c9be Rename RuleManager to Manager, remove interface.
This commits renames the RuleManager to Manager as the package
name is 'rules' now. The unused layer of abstraction of the
RuleManager interface is removed.
2015-04-29 16:42:10 +02:00
Fabian Reinartz
25cdff3527 Remove name arg from Parse* functions, enhance parsing errors. 2015-04-29 16:38:41 +02:00
Fabian Reinartz
3ca11bcaf5 Switch Prometheus to promql package.
This commit removes all functionality from rules/ that is now handled in
promql/.
All parts of Prometheus are changed to use the promql/ package.
2015-04-28 16:19:23 +02:00
Ceesjan Luiten
0e18784c64 Make all paths absolute to support proxies 2015-04-02 20:36:47 +02:00
Brian Brazil
941f585164 Avoid +InfYs and similar, just display +Inf. 2015-03-28 18:51:41 +00:00
beorn7
a075900f9a Merge branch 'beorn7/persistence' into beorn7/ingestion-tweaks 2015-03-18 19:09:31 +01:00
Fabian Reinartz
624f27f4b6 Add ln, log2, log10 and exp functions to the query language. 2015-03-16 18:26:19 +01:00
Julius Volz
b2651027fc Fix special value handling in division and modulo.
This fixes https://github.com/prometheus/prometheus/issues/597
2015-03-16 14:23:40 +01:00
beorn7
be11cb2b07 Remove the sample ingestion channel.
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.

Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...

Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.

The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.

Also, remove lint and vet warnings in ast.go.
2015-03-15 14:08:22 +01:00
beorn7
13fcf1ddbc Implement double-delta encoded chunks. 2015-03-05 20:33:26 +01:00
beorn7
9e85ab0eef Apply the new signature/fingerprinting functions from client_golang.
This requires the new version of client_golang (vendoring will follow
in the next commit), which changes the fingerprinting for
clientmodel.Metric.
2015-03-03 18:34:01 +01:00
Fabian Reinartz
182de6b99f Fix unary +/- expressions.
Unary expressions cause parsing errors if they are done in the lexer
by tokenizing them into the number.
This fix moves unary expressions to the parser.
2015-03-03 13:30:08 +01:00
Fabian Reinartz
6f754073d5 Add OR operation and vector matching options.
This commits implements the OR operation between two vectors.
Vector matching using the ON clause is added to limit the set of
labels that define a match between two elements. Group modifiers
(GROUP_LEFT/GROUP_RIGHT) to request many-to-one matching are added.
2015-03-03 11:35:10 +01:00
Julius Volz
0ac931aed1 Also support parsing float formats like "2.". 2015-03-02 12:58:05 +01:00
Julius Volz
c2ab54e9a6 Support scientific notation and special float values.
This adds support for scientific notation in the expression language, as
well as for all possible literal forms of +Inf/-Inf/NaN.

TODO: Keep enough state in the parser/lexer to distinguish contexts in
which "Inf", "NaN", etc. should be parsed as a number vs. parsed as a
label name. Currently, foo{nan="bar"} would be a syntax error. However,
that is an existing bug for all our reserved words. E.g. foo{sum="bar"}
is a syntax error as well. This should be fixed separately.
2015-03-01 19:31:16 +01:00
beorn7
1a61bcae07 Fix plural of 'histogram'.
Actually, 'histogram' is Ancient Greek and 3rd declension... ;-)
2015-02-23 15:29:26 +01:00
beorn7
17443d288b Avoid copying of the COWMetric if we already have the metric available. 2015-02-22 01:04:52 +01:00
beorn7
9e7c3e3bcd Add the histogram_quantile function.
Since we are now getting really deep into floating point calculation,
the tests had to take into account the precision loss. Since the rule
tests are based on direct line matching in the output, implementing
the "almost equal" semantics was pretty cumbersome, but here we are.
2015-02-22 01:04:51 +01:00
Julius Volz
42601acfde Replace labelsToKey() with metric Fingerprint (fixes grouping bug). 2015-02-21 17:45:47 +01:00
Julius Volz
7fefccd929 Write() directly into hash and use model.SeparatorByte. 2015-02-21 17:19:13 +01:00
Julius Volz
645cf57bed Fix aggregation grouping key calculation. 2015-02-21 14:05:50 +01:00
Julius Volz
15b2b5aa66 Add tests for invalid uses of "offset". 2015-02-18 02:56:40 +01:00
Julius Volz
67e20acc6c Lower-case some package-internal names. 2015-02-18 02:45:54 +01:00
Julius Volz
72d7b325a1 Implement offset operator.
This allows changing the time offset for individual instant and range
vectors in a query.

For example, this returns the value of `foo` 5 minutes in the past
relative to the current query evaluation time:

    foo offset 5m

Note that the `offset` modifier always needs to follow the selector
immediately. I.e. the following would be correct:

    sum(foo offset 5m) // GOOD.

While the following would be *incorrect*:

    sum(foo) offset 5m // INVALID.

The same works for range vectors. This returns the 5-minutes-rate that
`foo` had a week ago:

    rate(foo[5m] offset 1w)

This change touches the following components:

* Lexer/parser: additions to correctly parse the new `offset`/`OFFSET`
  keyword.
* AST: vector and matrix nodes now have an additional `offset` field.
  This is used during their evaluation to adjust query and result times
  appropriately.
* Query analyzer: now works on separate sets of ranges and instants per
  offset. Isolating different offsets from each other completely in this
  way keeps the preloading code relatively simple.

No storage engine changes were needed by this change.

The rules tests have been changed to not probe the internal
implementation details of the query analyzer anymore (how many instants
and ranges have been preloaded). This would also become too cumbersome
to test with the new model, and measuring the result of the query should
be sufficient.

This fixes https://github.com/prometheus/prometheus/issues/529
This fixed https://github.com/prometheus/promdash/issues/201
2015-02-18 02:41:27 +01:00
Brian Brazil
60271d58bf Change the 2nd argument of round to toNearest.
This is more useful if you want get a multiple of 2 or 5, while
still working for .001.
2015-02-05 16:13:40 +00:00
Julius Volz
82613527f3 Remove unnecessary float64() conversion in round(). 2015-02-05 15:14:05 +01:00
Marko Mikulicic
8fdacbdf17 Add floor, ceil and round functions. Closes #402 2015-02-04 17:20:56 +01:00
Fabian Reinartz
fa1e90003b Query timeout added.
This is related to #454. Queries now timeout after a duration set by
the -query.timeout flag. The TotalEvalTimer is now started/stopped
inside any of the ast.Eval* functions.
2015-02-03 08:04:27 +01:00
Bjoern Rabenstein
26e22e6ad6 Fix rule manager shutdown. 2015-01-29 15:05:10 +01:00
Julius Volz
d4374a9265 More efficient JSON query result format.
This depends on https://github.com/prometheus/client_golang/pull/51.

For vectors, the result format looks like this:

```json
{
   "version": 1,
   "type" : "vector",
   "value" : [
      {
         "timestamp" : 1421765411.045,
         "value" : "65.475000",
         "metric" : {
            "quantile" : "0.5",
            "instance" : "http://localhost:9090/metrics",
            "job" : "prometheus",
            "__name__" : "http_request_duration_microseconds",
            "handler" : "/static/",
            "method" : "get",
            "code" : "304"
         }
      },
      {
         "timestamp" : 1421765411.045,
         "value" : "5826.339000",
         "metric" : {
            "quantile" : "0.9",
            "instance" : "http://localhost:9090/metrics",
            "job" : "prometheus",
            "__name__" : "http_request_duration_microseconds",
            "handler" : "prometheus",
            "method" : "get",
            "code" : "200"
         }
      },
      /* ... */
   ]
}
```

For matrices, it looks like this:

```json
{
   "version": 1,
   "type" : "matrix",
   "value" : [
      {
         "metric" : {
            "quantile" : "0.99",
            "instance" : "http://localhost:9090/metrics",
            "job" : "prometheus",
            "__name__" : "http_request_duration_microseconds",
            "handler" : "/static/",
            "method" : "get",
            "code" : "200"
         },
         "values" : [
            [
               1421765547.659,
               "29162.953000"
            ],
            [
               1421765548.659,
               "29162.953000"
            ],
            [
               1421765549.659,
               "29162.953000"
            ],
            /* ... */
         ]
      }
   ]
}
```
2015-01-26 13:06:22 +01:00
Brian Brazil
a31730e88b Make 2nd arg to delta optional. Add a deriv() function.
The 2nd isCounter argument to delta is ugly, make it optional as the first step
of deprecating it. This will makes delta only ever applied to gauges.

Add a deriv function to calculate the least squares
slope of a gauge. This is more useful for prediction than delta,
as it isn't as heavily influenced by outliers at the boundaries.
2015-01-23 14:50:27 +00:00
Bjoern Rabenstein
5859b74f1b Clean up license issues.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
2015-01-21 20:07:45 +01:00
Bjoern Rabenstein
b09453af1d Adjust to new client_golang API. 2015-01-21 15:42:25 +01:00
Julius Volz
bb1e49383e Log rule evalation errors. 2015-01-08 17:50:55 +01:00
Julius Volz
d6b9e97655 Remove extraction.Result type, simplify code. 2015-01-08 16:34:01 +01:00
Julius Volz
9a4ca68a61 Add metrics for rule evaluation failures.
Fixes https://github.com/prometheus/prometheus/issues/417
2015-01-08 16:33:35 +01:00
Brian Brazil
ffa2e73803 Fix regression from 5e8d57bec1
0 is a false value, so shortcutting no longer works.
Update other places in the code that assumed graph was the default.
2014-12-27 00:28:36 +00:00
Julius Volz
cc27fb8aab Rename remaining all-caps constants in AST layer.
Change-Id: Ibe97e30981969056ffcdb89e63c1468ea1ffa140
2014-12-25 01:30:47 +01:00
Julius Volz
895523ad14 Include necessary Makefile.INCLUDE from rules/Makefile.
Change-Id: I077d018dbe4093cd40ddf38d66a996df222bf5e4
2014-12-25 01:13:59 +01:00
Julius Volz
2ade9d40cf Clarify why we need int constants for expression types.
Change-Id: I053fc5d32c118dbdb204dc8193337f981aff796e
2014-12-25 00:45:30 +01:00
Julius Volz
00a2a93a05 Add regression tests for metrics mutations in AST.
It turned out in the end, that only drop_common_metrics() produced any
erroneous output in the old system. The second expression in the test
("sum(testmetric) keeping_extra") already worked in the old code, but
why not keep it in...

The way to test ranged evaluations is a bit clumsy so far, so I want to
build a nicer test framework in the end, where all the test cases can be
specified as text files which specify desired inputs, outputs, query
step widths, etc.

Change-Id: I821859789e69b8232bededf670a1b76e9e8c8ca4
2014-12-12 20:34:55 +01:00
Julius Volz
c9618d11e8 Introduce copy-on-write for metrics in AST.
This depends on changes in:

https://github.com/prometheus/client_golang/tree/cow-metrics.

Change-Id: I80b94833a60ddf954c7cd92fd2cfbebd8dd46142
2014-12-12 20:34:55 +01:00
Bjoern Rabenstein
b1e4956142 Apply a giant code cleanup.
Essentially:

- Remove unused code.

- Make it 'go vet' clean. The only remaining warnings are in generated code.

- Make it 'golint' clean. The only remaining warnings are in gerenated code.

- Smoothed out same minor things.

Change-Id: I3fe5c1fbead27b0e7a9c247fee2f5a45bc2d42c6
2014-12-10 16:16:49 +01:00
Bjoern Rabenstein
fee88a7a77 Remove the remaining races, new and old.
Also, resolve a few other TODOs.

Change-Id: Icb39b5a5e8ca22ebcb48771cd8951c5d9e112691
2014-12-03 18:07:23 +01:00
Bjoern Rabenstein
7d11019aa2 Squash a few trivial TODOs.
- Delete unneeded file view_adapter.go.
- Assessed that we still need the fingerprints in nodes
  (to create iterators).
- Turned numMemChunkDescs into a metric.

Change-Id: I29be963c795a075ec00c095f76bf26405535609d
2014-11-27 18:26:06 +01:00
Julius Volz
6eecee55b7 Fix acronym caps in GeneratorURL.
Change-Id: Ib18c1f617dcde1039e848059545a6d8831d9bf66
2014-11-25 17:13:04 +01:00
Bjoern Rabenstein
0ae1d8889a Fix tests after merge.
Change-Id: Ia90da9a3e48ed780ec38c4a6a1fd9ea34e7f6a58
2014-11-25 17:13:04 +01:00
Julius Volz
b7bf11230a Add absent() function.
A common problem in Prometheus alerting is to detect when no timeseries
exist for a given metric name and label combination. Unfortunately,
Prometheus alert expressions need to be of vector type, and
"count(nonexistent_metric)" results in an empty vector, yielding no
output vector elements to base an alert on. The newly introduced
absent() function solves this issue:

  ALERT FooAbsent IF absent(foo{job="myjob"}) [...]

absent() has the following behavior:

- if the vector passed to it has any elements, it returns an empty
  vector.

- if the vector passed to it has no elements, it returns a 1-element
  vector with the value 1.

In the second case, absent() tries to be smart about deriving labels of
the 1-element output vector from the input vector:

  absent(nonexistent{job="myjob"})                   => {job="myjob"}
  absent(nonexistent{job="myjob",instance=~".*"})    => {job="myjob"}
  absent(sum(nonexistent{job="myjob"}))              => {}

That is, if the passed vector is a literal vector selector, it takes all
"=" label matchers as the basis for the output labels, but ignores all
non-equals or regex matchers. Also, if the passed vector results from a
non-selector expression, no labels can be derived.

Change-Id: I948505a1488d50265ab5692a3286bd7c8c70cd78
2014-11-25 17:13:04 +01:00
Julius Volz
3d47f94149 Drop metric names after transformations.
After many transformations, it doesn't make sense to keep the metric
names, since the result of the transformation is no longer that metric.
This drops the metric name after such transformations and makes the web
UI deal well with missing metric names.

This depends on the current branch on the following things:

- prometheus/client_golang needs to be at
  e237cf15c6
  in branch "julius/int-fingerprints" (to be merged with new storage)

- prometheus/promdash needs to be at
  dd7691c9c2

Change-Id: Ib3c8cad8d647d9854e8c653c424b8c235ccc231d
2014-11-25 17:13:04 +01:00
Bjoern Rabenstein
14bda4180c Changes after pair code review.
Change-Id: Ib72d40f8e9027818cfbbd32a7a7201eebda07455
2014-11-25 17:12:59 +01:00
Bjoern Rabenstein
006b5517e2 Simplify makefiles.
This removes the dependancy on C leveldb and snappy.
It also takes care of fewer dependencies as they would
anyway not work on any non-Debian, non-Brew system.

Change-Id: Ia70dce1ba8a816a003587927e0b3a3f8ad2fd28c
2014-11-25 17:10:39 +01:00
Bjoern Rabenstein
74c143c4c9 Improve scraper shutdown time.
- Stop target pools in parallel.
- Stop individual scrapers in goroutines, too.
- Timing tweaks.

Change-Id: I9dff1ee18616694f14b04408eaf1625d0f989696
2014-11-25 17:10:39 +01:00
Julius Volz
0712d738d1 Allow alternative "by"-clause position in grammar.
In addition to the existing by-clause syntax:

  sum(<expression>) by (<labels>) [keeping_extra]

...this allows the following new syntax:

  sum by (<labels>) [keeping_extra] (<expression>)

Both orderings may be used in a single expression. It is up to the users
to establish guidelines around their usage.

Change-Id: Iba10c9cc5fb6ac62edfcf246d281473e82467992
2014-11-25 17:09:04 +01:00
Julius Volz
0e48c18bbf Allow omitting the metric name in queries.
This allows the following expression syntaxes for selecting timeseries:

  foo                    (already valid before)
  foo{}                  (already valid before)
  {job="prometheus"}     (new, select all timeseries for job "prometheus")

Omitting both the metric name *and* any label matchers ("" or "{}") will
still yield a syntax error.

To get all timeseries, you could do:

    {__name__=~".*"}
       or, without relying on knowledge about __metric__:
    {job=~".*"}

Change-Id: Ifee000b9ac0184ef6ced18411069c7f2699a2dda
2014-11-25 17:09:04 +01:00
Bjoern Rabenstein
096fa0f8b2 Squash a number of TODOs.
- Staleness delta is no a proper function parameter and not replicated
  from package ast.

- Named type 'chunks' replaced by explicit '[]chunk' to avoid confusion.

- For the same reason, replaced 'chunkDescs' by '[]*chunkDescs'.

- Verified that math.Modf is not a speed enhancement over conversion
  (actually 5x slower).

- Renamed firstTimeField, lastTimeField into chunkFirstTime and
  chunkLastTime.

- Verified unpin() is sufficiently goroutine-safe.

- Decided not to update archivedFingerprintToTimeRange upon series
  truncation and added a rationale why.

Change-Id: I863b8d785e5ad9f71eb63e229845eacf1bed8534
2014-11-25 17:09:04 +01:00
Bjoern Rabenstein
b3ed9aa7a2 Clean up start-up and shut-down.
Change-Id: Idff4bbb0a15a9f879bfbb3da5b1025179cab5e2c
2014-11-25 17:08:45 +01:00
Bjoern Rabenstein
38fc24d0ed Fix targetpool_test.go and other tests.
Change-Id: I91a4dd1d39e01f174e1aaae653ce1ed7aecaa624
2014-11-25 17:08:26 +01:00
Julius Volz
7f5d3c2c29 Fix and improve the fp locker.
Benchmark:
$ go test -bench 'Fingerprint' -test.run 'Fingerprint' -test.cpu=1,2,4

OLD
BenchmarkFingerprintLockerParallel        500000              3618 ns/op
BenchmarkFingerprintLockerParallel-2      100000             12257 ns/op
BenchmarkFingerprintLockerParallel-4      500000             10164 ns/op
BenchmarkFingerprintLockerSerial        10000000               283 ns/op
BenchmarkFingerprintLockerSerial-2      10000000               284 ns/op
BenchmarkFingerprintLockerSerial-4      10000000               288 ns/op

NEW
BenchmarkFingerprintLockerParallel       1000000              1018 ns/op
BenchmarkFingerprintLockerParallel-2     1000000              1164 ns/op
BenchmarkFingerprintLockerParallel-4     2000000               910 ns/op
BenchmarkFingerprintLockerSerial        50000000                56.0 ns/op
BenchmarkFingerprintLockerSerial-2      50000000                47.9 ns/op
BenchmarkFingerprintLockerSerial-4      50000000                54.5 ns/op

Change-Id: I3c65a43822840e7e64c3c3cfe759e1de51272581
2014-11-25 17:07:45 +01:00
Julius Volz
358f97791d Minor cleanups.
Change-Id: Ia8685d8439a421fe2143d9ec7120d5bb5ab88d78
2014-11-25 17:07:44 +01:00
Bjoern Rabenstein
f5f9f3514a Major code cleanup.
- Make it go-vet and golint clean.
- Add comments, TODOs, etc.

Change-Id: If1392d96f3d5b4cdde597b10c8dff1769fcfabe2
2014-11-25 17:02:53 +01:00
Julius Volz
e7ed39c9a6 Initial experimental snapshot of next-gen storage.
Change-Id: Ifb8709960dbedd1d9f5efd88cdd359ee9fa9d26d
2014-11-25 17:02:00 +01:00
Julius Volz
85497e3f38 Add function to drop common labels in a vector.
This fixes https://github.com/prometheus/prometheus/issues/384.

Change-Id: I2973c4baeb8a4618ec3875fb11c6fcf5d111784b
2014-11-25 17:02:00 +01:00
Julius Volz
3fdb74e571 Add more topk() / bottomk() tests.
Test what happens if k > number of input elements.

Change-Id: Ie724b850939e297ebf085f0a5a3522e9cfcc6534
2014-11-25 17:02:00 +01:00
Julius Volz
c582ae73c2 Implement topk() and bottomk() functions.
To achieve O(log n * k) runtime, this uses a heap to track the current
bottom-k or top-k elements while iterating over the full set of
available elements.

It would be possible to reuse more code between topk and bottomk, but I
decided for some more duplication for the sake of clarity.

This fixes https://github.com/prometheus/prometheus/issues/399

Change-Id: I7487ddaadbe7acb22ca2cf2283ba6e7915f2b336
2014-11-25 17:02:00 +01:00
Bjoern Rabenstein
1909686789 Make metrics exported by the Prometheus server itself more consistent.
- Always spell out the time unit (e.g. milliseconds instead of ms).

- Remove "_total" from the names of metrics that are not counters.

- Make use of the "Namespace" and "Subsystem" fields in the options.

- Removed the "capacity" facet from all metrics about channels/queues.
  These are all fixed via command line flags and will never change
  during the runtime of a process. Also, they should not be part of
  the same metric family. I have added separate metrics for the
  capacity of queues as convenience. (They will never change and are
  only set once.)

- I left "metric_disk_latency_microseconds" unchanged, although that
  metric measures the latency of the storage device, even if it is not
  a spinning disk. "SSD" is read by many as "solid state disk", so
  it's not too far off. (It should be "solid state drive", of course,
  but "metric_drive_latency_microseconds" is probably confusing.)

- Brian suggested to not mix "failure" and "success" outcome in the
  same metric family (distinguished by labels). For now, I left it as
  it is. We are touching some bigger issue here, especially as other
  parts in the Prometheus ecosystem are following the same
  principle. We still need to come to terms here and then change
  things consistently everywhere.

Change-Id: If799458b450d18f78500f05990301c12525197d3
2014-11-25 17:02:00 +01:00
Julius Volz
00b9489f1c Fix time() behavior.
time() should return the timestamp for which the query is executed, not
the actual current time.

Change-Id: I430a45cabad7785cd58f95b1028a71dff4c87710
2014-11-25 17:02:00 +01:00
Julius Volz
c5984f1818 Add abs() and over-time aggregation functions.
This implements aggregation functions over time as request in
https://github.com/prometheus/prometheus/issues/383.

Change-Id: Ifd69b850de8cfdf6e7a6c0e042056fa4c672410e
2014-11-25 17:02:00 +01:00
Brian Brazil
f525ca5d9e Let consoles get graph links from experssions.
Rename ConsoleLinkFromExpression, as we now have consoles.

Change-Id: I7ed2c9c83863adb390b51121dd9736845f7bcdfc
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein
8956faeccb Migrate to new client_golang.
This change will only be submitted when the new client_golang has been
moved to the new version.

Change-Id: Ifceb59333072a08286a8ac910709a8ba2e3a1581
2014-11-25 17:01:59 +01:00
Brian Brazil
960ede66dc Use html/template for console templates and add template libary support.
Add a function to bypass the new auto-escaping.
Add a function to workaround go's templates only allowing passing in one argument.

Change-Id: Id7aa3f95e7c227692dc22108388b1d9b1e2eec99
2014-11-25 17:01:59 +01:00
Brian Brazil
e041c0cd46 Add console and alert templates with access to all data.
Move rulemanager to it's own package to break cicrular dependency.
Make NewTestTieredStorage available to tests, remove duplication.

Change-Id: I33b321245a44aa727bfc3614a7c9ae5005b34e03
2014-05-30 16:24:56 +01:00
Bjoern Rabenstein
ca6a4fccef Weed out our homegrown test.Tester.
The Go stdlib has testing.TB now, which fulfills the exact same
purpose.

Change-Id: I0db9c73400e208ca376b932a02b7e3402234b87c
2014-05-21 19:27:24 +02:00
Julius Volz
6297a405f2 Do not indent API JSON responses.
In one example response, this reduced the uncompressed size by 25% and
the gzipped size by 11%.

Change-Id: Ie80d44253124b9f8601b8ef9fc978e92dacff523
2014-04-22 15:16:37 +02:00
Julius Volz
01f652cb4c Separate storage implementation from interfaces.
This was initially motivated by wanting to distribute the rule checker
tool under `tools/rule_checker`. However, this was not possible without
also distributing the LevelDB dynamic libraries because the tool
transitively depended on Levigo:

rule checker -> query layer -> tiered storage layer -> leveldb

This change separates external storage interfaces from the
implementation (tiered storage, leveldb storage, memory storage) by
putting them into separate packages:

- storage/metric: public, implementation-agnostic interfaces
- storage/metric/tiered: tiered storage implementation, including memory
                         and LevelDB storage.

I initially also considered splitting up the implementation into
separate packages for tiered storage, memory storage, and LevelDB
storage, but these are currently so intertwined that it would be another
major project in itself.

The query layers and most other parts of Prometheus now have notion of
the storage implementation anymore and just use whatever implementation
they get passed in via interfaces.

The rule_checker is now a static binary :)

Change-Id: I793bbf631a8648ca31790e7e772ecf9c2b92f7a0
2014-04-16 13:30:19 +02:00
Julius Volz
d411a7d810 Allow reversing vector and scalar arguments in binops.
This allows putting a scalar as the first argument of a binary operator
in which the second argument is a vector:

  <scalar> <binop> <vector>

For example,

  1 / http_requests_total

...will output a vector in which every sample value is 1 divided by the
respective input vector element.

This even works for filter binary operators now:

  1 == http_requests_total

Returns a vector with all values set to 1 for every element in
http_requests_total whose initial value was 1.

Note: For filter binary operators, the resulting values are always taken
from the left-hand-side of the operation, no matter whether the scalar
or the vector argument is the left-hand-side. That is,

  1 != http_requests_total

...will set all result vector sample values to 1, although these are
exactly the sample elements that were != 1 in the input vector.

If you want to just filter elements without changing their sample
values, you still need to do:

  http_requests_total != 1

The new filter form is a bit exotic, and so probably won't be used
often. But it was easier to implement it than disallow it completely or
change its behavior.

Change-Id: Idd083f2bd3a1219ba1560cf4ace42f5b82e797a5
2014-04-08 17:16:18 +02:00
Julius Volz
c7c0b33d0b Add regex-matching support for labels.
There are four label-matching ops for selecting timeseries now:

- Equal: =
- NotEqual: !=
- RegexMatch: =~
- RegexNoMatch: !~

Instead of looking up labels by a simple clientmodel.LabelSet (basically
an equals op for every key/value pair in the set), timeseries
fingerprint selection is now done via a list of metric.LabelMatchers.

Change-Id: I510a83f761198e80946146770ebb64e4abc3bb96
2014-04-01 14:24:53 +02:00
Bjoern Rabenstein
0a65b691cc Disallow ":" in identifiers, but still allow it in metric names.
Change-Id: Iace925ab1b71a360bd63357e87f68e727f7afbcb
2014-03-21 13:44:37 +01:00
Julius Volz
86fc13a52e Convert metric.Values to slice of values.
The initial impetus for this was that it made unmarshalling sample
values much faster.

Other relevant benchmark changes in ns/op:

Benchmark                                 old        new   speedup
==================================================================
BenchmarkMarshal                       179170     127996     1.4x
BenchmarkUnmarshal                     404984     132186     3.1x

BenchmarkMemoryGetValueAtTime           57801      50050     1.2x
BenchmarkMemoryGetBoundaryValues        64496      53194     1.2x
BenchmarkMemoryGetRangeValues           66585      54065     1.2x

BenchmarkStreamAdd                       45.0       75.3     0.6x
BenchmarkAppendSample1                   1157       1587     0.7x
BenchmarkAppendSample10                  4090       4284     0.95x
BenchmarkAppendSample100                45660      44066     1.0x
BenchmarkAppendSample1000              579084     582380     1.0x
BenchmarkMemoryAppendRepeatingValues 22796594   22005502     1.0x

Overall, this gives us good speedups in the areas where they matter
most: decoding values from disk and accessing the memory storage (which
is also used for views).

Some of the smaller append examples take minimally longer, but the cost
seems to get amortized over larger appends, so I'm not worried about
these. Also, we're currently not bottlenecked on the write path and have
plenty of other optimizations available in that area if it becomes
necessary.

Memory allocations during appends don't change measurably at all.

Change-Id: I7dc7394edea09506976765551f35b138518db9e8
2014-03-11 18:23:37 +01:00