Service instrumentation, monitoring, and alerting with Prometheus Julius Volz, Björn “Beorn” Rabenstein. Production Engineers, SoundCloud Ltd. Velocity New York, 2015-10-12 Velocity Amsterdam, 2015-10-28
Architecture
Architecture
Resources Project homepage: http://prometheus.io
These Th ese slides: https://goo.gl/qTs1BI
Instructions and examples: https://github.com/juliusv/prometheus_workshop
If you didn’t download the files from the pre-work, go to http://10.10.32.101
If I had to tell you only four things... 1. 2. 3. 4.
Multi-dimensional data model (like OpenTSDB). Operational simplicity (unlike OpenTSDB). Scalable data collection (yes, it's pull, not push). Powerful query language (the same for exploring, graphing, alerting) .
Operational simplicity $ go build $ ./prometheus
Hands on! Work through the following sections in the instructions: ➔ Getting Prometheus (hopefully already done...) ➔ Configuring Prometheus to monitor itself ➔ Starting Prometheus ➔ Using the expression browser
Architecture
Multi-dimensional data model api_http_requests_total{method="GET", endpoint="/api/tracks", status="200"} 2034834
(like OpenTSDB)
Powerful query language topk(3, sum(rate(bazooka_instance_cpu_time_seconds_total[5m])) by (app, proc))
sort_desc(sum(bazooka_instance_memory_limit_bytes bazooka_instance_memory_usage_bytes) by (app, proc))
Scalable data collection Thousands of targets. Hundreds of thousands of samples per second. Millions of time series. On a single monitoring server. Running many servers is easy, too… Pull, not push.
Expression browser
Built-in graphing
Hands on! Work through the following sections in the instructions: ➔ Start the node exporter ➔ Configure Prometheus to monitor node exporter ➔ Use the node exporter to export the contents of a text file ➔ Configuring targets with service discovery
Architecture
Example: Request Duration http_request_duration_seconds_total http_requests_total http_request_duration_seconds_total / http_requests_total
http_request_duration_seconds
http_request_duration_seconds_sum http_request_duration_seconds_count http_request_duration_seconds_sum / http_request_duration_seconds_count
Request Duration Average ...and how to aggregate it.
http_request_duration_seconds_sum / http_request_duration_seconds_count
sum(http_request_duration_seconds_sum) / sum(http_request_duration_seconds_count)
sum(http_request_duration_seconds_sum) by (job) / sum(http_request_duration_seconds_count) by (job)
Request Duration Average How to specify the time range.
rate(http_request_duration_seconds_sum[10m]) / rate(http_request_duration_seconds_count[10m])
sum(rate(http_request_duration_seconds_sum[10m])) by (job) / sum(rate(http_request_duration_seconds_count[10m])) by (job)
Prometheus Summary Ruby, Go, legacy Java client only...
temps := prometheus.NewSummary(prometheus.SummaryOpts{ Name: "http_request_duration_seconds", Help: "Summary for the duration of all HTTP requests.", Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01}, }) temps.Observe(0.083) temps.Observe(0.119) http_request_duration_seconds{quantile="0.5"} http_request_duration_seconds{quantile="0.9"} http_request_duration_seconds_count http_request_duration_seconds_sum
Hands on! Work through the whole chapter The expression language. (End before Instrument code: Go.)
Prometheus Histogram Let's do the bucketing ourselves.
temps := prometheus.NewHistogram(prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "Histogram for the duration of all HTTP requests.", Buckets: []float64{0.02, 0.05, 0.1}, }) temps.Observe(0.153) http_request_duration_seconds_bucket{le="0.02"} http_request_duration_seconds_bucket{le="0.05"} http_request_duration_seconds_bucket{le="0.1"} http_request_duration_seconds_bucket{le="+Inf"} http_request_duration_seconds_count http_request_duration_seconds_sum
Bucketing utilities temps := prometheus.NewHistogram(prometheus.HistogramOpts{ Name: "http_request_duration", Help: "Histogram for the duration of all HTTP requests.", Buckets: prometheus.LinearBuckets(20, 5, 5), }) temps := prometheus.NewHistogram(prometheus.HistogramOpts{ Name: "http_request_duration", Help: "Histogram for the duration of all HTTP requests.", Buckets: prometheus.ExponentialBuckets(10, 1.5, 10), })
Am I within SLA? “Serve 95% of requests within 300ms.”
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job)
Apdex score Target request duration 300ms, tolerable request duration 1.2s.
( sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job) + sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) by (job) ) / 2 / sum(rate(http_request_duration_seconds_count[5m])) by (job)
Finally aggregatable quantiles... Plus: pick φ-quantile and time window at evaluation time.
histogram_quantile(0.9, http_request_duration_seconds_bucket)
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,job))
Integrations Official exporters
3rd party exporters and probers
Direct instrumentation
Node/system metrics exporter JMX exporter MySQL server exporter SNMP exporter Graphite exporter Collectd exporter HAProxy exporter StatsD bridge AWS CloudWatch exporter Hystrix metrics publisher Mesos task exporter Consul exporter
Bind exporter CouchDB exporter Django exporter Google's mtail log data extractor HTTP(s)/TCP/ICMP blackbox prober Memcached exporter Meteor JS web framework exporter Minecraft exporter module MongoDB exporter Munin exporter New Relic exporter RabbitMQ exporter Redis exporter RethinkDB exporter Rsyslog exporter scollector exporter SMTP/Maildir MDA blackbox prober
cAdvisor Kubernetes Kubernetes-Mesos Etcd gokit go-metrics instrumentation library RobustIRC
Client libraries Official
Unofficial
Go Java (JVM) Ruby Python
.NET / C# Node.js Haskell Bash (more to come...)
Hands on! ➔ ➔ ➔
Now instrument your code. Pick the Go chapter or the Python chapter, whatever you prefer. Point Prometheus to your instrumented code. Use the expression browser to explore.
PromDash
Hands on! Work through the following chapters in the instructions: ➔ Dashboard Building: Console Templates ➔ Dashboard Building: PromDash
Architecture
Alertmanager
Hands on! Work through the Alerting chapter in the instructions.
Architecture
Hands on! Work through the Pushing Metrics chapter in the instructions.
Architecture