Prometheus monitoring series best practices

Prometheus is the second project graduated from CNCF after kubernetes. I also like this open-source monitoring platform for finding and predicting alarms through data indicators very much. Official words are not enough. According to the introduction of the official website, it has the following functions, but some brief summaries are not necessarily known to you, so I added some personal words
Official screenshots

A paragraph of the vernacular of Prometheus

  • Implement high latitude data model
    • Time series data is distinguished by metric name and key value pair. Here you can distinguish monitoring indicators of multiple (isolated) environments.
    • All metrics can be set with arbitrary multi-dimensional tags, and multiple tags can be added by customization, such as which team the monitoring of this service belongs to.
    • The data model is more arbitrary and does not need to be set as a string separated by dots;
    • Data model can be aggregated, cut and sliced;
    • Support double precision floating-point type, and the label can be set to full unicode;
      You still don't know what you mean when you see this, then you will realize it when you use it next
  • Powerful PromQL statements
    • It supports query statements and can compare values through PromSQL
    • You can use the functions embedded in PromSQL to calculate the change of indicators, such as average value, growth rate, etc
  • Excellent visualization
    • I don't think it's outstanding at all. Haha, let's use it in combination with Grafana. After all, they are professional~
  • Efficient storage
    • You can set the storage days of indicator data according to the requirements, or you can persist the storage, for example, through the remote storage adapter
  • Easy to use
    • Simple deployment
    • Support dynamic discovery
    • Hot load support
    • Support profile format check
  • Accurate alarm
    • Alarm refers to not Prometheus, but Alertmanager
    • The silence time can be set, the alarms can be grouped, and the alarms can be matched to determine who is responsible for sending the alarm email
    • Support a variety of alarm media, such as the commonly used slack, enterprise wechat, pin, email and some commonly used foreign ones, you can also customize yourself;
  • Support for multilingual client Libraries
    • Support for common programming languages
  • Rich exporter ecosystem
    • Perfect support for common middleware, database, host and other monitoring
    • There are also some monitoring objects that are sometimes ignored, such as certificate validity, domain name validity, etc
    • For example, there are jmx,snmp,vmi and other exporters, which you can see in github.com by searching prometheus exporter

It means that there's nothing that Prometheus can't monitor except Zabbix. It's even simpler and more user-friendly. But there won't be too many Prometheus indicator types introduced here. There are so many on-line indicators that you don't want to adjust them. You can take a look at https://yunlzheng.gitbook.io/prometheus-book/introduction. Most of them still need to be corrected I think about how to do it in practice.

Prometheus's indispensable deployment

ServerName ServerVersion Functions configuration file
Promethues v2.12.0 data processing prometheus.yaml
influxdb v1.7 Persistent storage of monitoring indicators influxdb.conf
remotestorageadapter latest Data remote transfer adapter
alertmanager v0.19.0 alarm management config.yml
pushgateway v0.10.0 Realize push mode push index
grafana v6.0.0 Visual display platform of data grafana.ini
cadvisor v0.32.0 Analyze metrics and performance data for running containers
Docker v18.03.0-ce Container runtime
docker-compose v1.11.2 Container choreographer

But you can use and test it directly, and use the configuration list managed by docker compose. It's also a good news for those who don't have k8s environment. docker-compose-monitor-platform.yml:

version: '3.4'
services:
  influxdb:
    image: influxdb:1.7
    command: -config /etc/influxdb/influxdb.conf
    container_name: influxdb
    ports:
      - "8086:8086"
    restart: always
    volumes:
      - /data/influxdb:/var/lib/influxdb
    environment:
      - INFLUXDB_DB=prometheus
      - INFLUXDB_ADMIN_ENABLED=true
      - INFLUXDB_ADMIN_USER=admin
      - INFLUXDB_ADMIN_PASSWORD=admin
      - INFLUXDB_USER=prom
      - INFLUXDB_USER_PASSWORD=prom
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 300M
        reservations:
          cpus: '0.25'
          memory: 200M
  remotestorageadapter:
    image: gavind/prometheus-remote-storage-adapter:1.0
    container_name: prometheus-remote-storage-adapter
    ports:
      - 9201:9201
    environment:
      - INFLUXDB_PW=prom
    restart: always
    command: ['-influxdb-url=http://192.168.0.112:8086', '-influxdb.database=prometheus', '-influxdb.retention-policy=autogen','-influxdb.username=prom']
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    restart: always
    volumes:
      - /opt/alertmanager/config.yml:/etc/alertmanager/config.yml
    command: ['--config.file=/etc/alertmanager/config.yml']
  prometheus:
    image: prom/prometheus:v2.12.0
    container_name: prometheus
    restart: always
    volumes:
      - /opt/prometheus/conf/:/etc/prometheus/
    ports:
      - "9090:9090"
    command: ['--web.external-url=http://192.168.0.112:9090 ',' -- config. File = / etc / Prometheus / Prometheus. YML ',' -- storage. TSDB. Path = / Prometheus / data ',' -- web. Enable lifecycle ',' -- web. Enable admin API ',' -- web. Console. Templates = / Prometheus / consoletest ',' -- web. Page title = Prometheus monitoring platform ',]
  pushgateway:
    container_name: pushgateway
    image: prom/pushgateway:v1.0.0
    restart: always
    ports:
      - "9091:9091"
    command: ['--persistence.file="/pushgateway/data"','--persistence.interval=5m','--web.external-url=http://192.168.0.112:9091','--web.enable-admin-api','--log.format=json','--log.level=info','--web.enable-lifecycle']
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 300M
        reservations:
          cpus: '0.25'
          memory: 200M
  grafana:
    container_name: grafana
    image: grafana/grafana:6.4.0
    restart: always
    ports:
      - "3000:3000"
    volumes:
      - /data/grafana/grafana.ini:/etc/grafana/grafana.ini
      - /data/grafana:/var/lib/grafana
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 300M
        reservations:
          cpus: '0.25'
          memory: 200M
#    user: "104"
  cadvisor:
    image: google/cadvisor:latest
    container_name: cadvisor
    restart: always
    ports:
      - 8080:8080
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

Several points should be noted:

  1. You need to create the required directory in docker-compose-monitor-platform.yml
  2. I think you can find the configuration file format, such as docker cp, official website or github
  3. Here are the main configuration files, alert manager and Prometheus

Prometheus, you can customize the modified configuration file

prometheus.yml

global:
  scrape_interval:     2m # Set the time for collecting data indicators to 2m. By default, the data is collected every 1 minute. The frequency of collection will affect the storage and server performance
  evaluation_interval: 15s # Evaluate the alarm rule once every 15 seconds. The default is once every minute
  external_labels:
      monitor: 'Prometheues Monitoring platform'
rule_files:
  - "prom.rules"

alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets: ['192.168.0.112:9093']
scrape_configs:
  - job_name: 'qa-prometheus'
    # The default values of metrics'u path label are '/ metrics'
    # The default scheme value is: 'http'
    static_configs:
      - targets: ['192.168.0.112:9090']
  - job_name: pushgateway
    static_configs:
      - targets: ['192.168.0.112:9091']
        labels:
          instances: pushgateway
          instanceserver: 192.168.0.112
          honor_labels: true

config.yaml

global:
  resolve_timeout: 1m #This parameter defines how long the alert manager does not receive the alert and then marks the alert status as resolved. The definition of this parameter may affect the receiving time of the alert recovery notice. The default value is 5 minutes
  smtp_smarthost: smtp.163.net:465 # Mailbox server, please add port
  smtp_from: xxx # Sender email
  smtp_auth_username: xxx # The user name to use for authentication using the sender's mailbox
  smtp_auth_password: xxx # Password (client authorization code) used for authentication using sender's mailbox
  smtp_require_tls: false # Whether tls verification is required
  slack_api_url: 'xxx'

templates:
  - '/etc/alertmanager/template/*.tmpl'
# The root route after all alarm information enters, which is used to set the distribution policy of alarm
route: # It mainly defines the routing matching rules of alarms and which receiver the alarm manager needs to send the matched alarms to. [therefore, the detailed settings here can flexibly filter the alarms to the corresponding development owner through the matching tag]
  # The label list here is a re grouping label after receiving the alarm information. For example, there are many alarm information with labels such as cluster=A and alertname=LatncyHigh in the received alarm information, which will be aggregated into a group in batches
  group_by: ['alertname','cluster']
  # When a new alarm group is created, you need to wait at least group "wait time to initialize the notification, which can ensure that you have enough time for the same group to get multiple alarms, and then trigger the alarm information together.
  group_wait: 10s
  # When the first alarm is sent, wait for the time of "group" interval to send a new set of alarm information.
  group_interval: 5m
  # If an alarm message has been sent successfully, wait for 'repeat' interval time to resend them
  repeat_interval: 4h
  # Default receiver: if an alarm is not matched by a route, it is sent to the default receiver
  receiver: default
  # All the above properties are inherited by all sub routes, and can be overridden on each sub route.
  routes:
  - receiver: 'default'
    group_wait: 10s
    continue: true
  - receiver: 'slack'
    group_wait: 10s
    match:
      env: yourenv
    continue: true
inhibit_rules:
- source_match:
   env: yourenv
  target_match:
   env: yourenv
  equal: ['alertname', 'cluster']
receivers:
- name: 'default'
  email_configs:
  - to: 'xxx' # To whom?
    send_resolved: true

Here, the Prometheus monitoring platform is basically deployed. Next, we need to see which services we monitor and access to Prometheus according to our monitoring objects

Tags: Linux InfluxDB Docker Database github

Posted on Sat, 21 Mar 2020 10:40:50 -0700 by sarabjit