首页 网维知识库 prometheus+grafana+alertmanager 安装配置文档

prometheus+grafana+alertmanager 安装配置文档

1. 安装组件基本介绍: prometheus: server端守护进程,负责拉取各个终端exporter收集到metrics(监控指标数据),并记录在本身提供的tsdb时序记录数…

1. 安装组件基本介绍:

  • prometheus:
    • server端守护进程,负责拉取各个终端exporter收集到metrics(监控指标数据),并记录在本身提供的tsdb时序记录数据库中,默认保留天数15天,可以通过启动参数自行设置数据保留天数。
    • prometheus官方提供了多种exporter,
    • 默认监听9090端口,对外提供web图形查询页面,以及数据库查询访问接口。
    • 配置监控规则rules(需自行手动配置),并将触发规则的告警发送至alertmanager ,并由alertmanager中配置的告警媒介向外发送告警。
  • grafana:
    • 由于prometheus本身提供的图形页面过于简陋,所以使用grafana来提供图形页面展示。
    • grafana 是专门用于图形展示的软件,支持多种数据来源,prometheus只是其中一种。
    • 自带告警功能,且告警规则可在监控图形上直接配置,不过由于此种方式不支持模板变量(dashboard中为了方便展示配置的特殊变量),即每一个指标,每一台设备均需要单独配置,所以实用性较低
    • 默认监听端口:3000
  • node_exporter:
    • agent端,prometheus官方提供的诸多exporter中的一种,安装与各监控节点主机
    • 负责抓取主机及系统各项信息,如cpu,mem ,disk,networtk.filesystem,…等等各项基本指标,非常全面。并将抓取到的各项指标metrics 通过http协议对方发布,供prometheus server端抓取。
    • 默认监听端口: 9100
  • cadvisor:
    • agent端,安装与docker主机,抓取主机和docker容器运行中各项数据。
    • 本身也已容器方式运行,监听端口8080(可自行设置对外映射端口,且建议映射到其他端口)。
    • 提供基本的graph展示页面,同时提供metrics抓取页面
  • alertmanager:
    • 接受prometheus发送的告警,并通过一定规则分组,控制告警的发送(如告警频率,规则抑制,匹配不同的告警后端媒介,设置静默等)。
    • 可配置多种不同的告警后端媒介,如:邮件,webhook,wechat(企业微信)已经一些企业版的监控告警平台等。
    • 默认监听端口:9093
  • blackbox_exporter:
    • Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集
    • 可直接配置与prometheus server节点,也可配置在单独节点
    • 默认监听端口:9115
  • nginx:
    • 由于prometheus,alertmanager本身不具有认证功能,所以前端使用nginx对外访问,提供基本basic认证已经配置https
    • 以上各组件均需暴露自身端口,所以在docker-compos 部署过程中,将容器部署在同一网络中,前端入口映射端口由nginx统一配置,方便管理

2.prometheus-server

2.1 官方地址:

  • 官方文档地址:https://prometheus.io/docs/introduction/overview/ 
  • github项目下载地址: https://github.com/prometheus/prometheus

2.2 安装 prometheus server

2.2.1 linux(centos7) 下载安装

  • 创建运行prometheus server进程的系统用户,并为其创建家目录/var/lib/prometheus 作为数据存储目录
~]# useradd -r -m -d /var/lib/prometheus prometheus
  • 下载并安装prometheus server,以2.14.0为例:
 wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
 tar -xf prometheus-2.14.0.linux-amd64.tar.gz  -C /usr/local/
 cd /usr/local
 ln -sv prometheus-2.14.0.linux-amd64 prometheus
  • 创建unit file,让systemd 管理prometheus
 vim /usr/lib/systemd/system/prometheus.service             
 [Unit]
 Description=The Prometheus 2 monitoring system and time series database.
 Documentation=https://prometheus.io
 After=network.target
 [Service]
 EnvironmentFile=-/etc/sysconfig/prometheus
 User=prometheus
 ExecStart=/usr/local/prometheus/prometheus \
 --storage.tsdb.path=/home/prometheus/prometheus \
 --config.file=/usr/local/prometheus/prometheus.yml \
 --web.listen-address=0.0.0.0:9090 \
 --web.external-url= $PROM_EXTRA_ARGS
 Restart=on-failure
 StartLimitInterval=1
 RestartSec=3
 [Install]
 WantedBy=multi-user.target
  • 其他运行时参数: ./prometheus –help
  • 启动服务
systemctl daemon-reload
systemctl start prometheus.service
  • 注意开启防火墙端口:
iptables -I INPUT -p tcp --dport 9090 -s NETWORK/MASK -j ACCEPT
  • 浏览器访问:
http://IP:PORT

2.2.2 docker安装:

  • image: prom/prometheus
  • 启动命令:
$ docker run --name prometheus -d -v ./prometheus:/etc/prometheus/ -v ./db/:/prometheus -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.listen-address="0.0.0.0:9090" --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles --storage.tsdb.retention=30d

2.3 prometheus配置:

2.3.1 启动参数

  • 常用启动参数:
--config.file=/etc/prometheus/prometheus.yml # 指明主配置文件
--web.listen-address="0.0.0.0:9090"     # 指明监听地址端口
--storage.tsdb.path=/prometheus     # 指明数据库目录
--web.console.libraries=/usr/share/prometheus/console_libraries
--web.console.templates=/usr/share/prometheus/consoles  # 指明console lib 和 tmpl
--storage.tsdb.retention=60d  # 指明数据保留天数,默认15

2.3.2 配置文件:

  • Prometheus的主配置⽂件为prometheus.yml

    它主要由global、rule_files、scrape_configs、alerting、remote_write和remote_read⼏个配置段组成:

 - global:全局配置段;

 - rule_files:指定告警规则文件的路径

 - scrape_configs:
    scrape配置集合,⽤于定义监控的⽬标对象(target)的集合,以及描述如何抓取 (scrape)相关指标数据的配置参数;
    通常,每个scrape配置对应于⼀个单独的作业(job),
    ⽽每个targets可通过静态配置(static_configs)直接给出定义,也可基于Prometheus⽀持的服务发现机制进 ⾏⾃动配置;
  - job_name: 'nodes'
 static_configs:    # 静态指定,targets中的 host:port/metrics 将会作为metrics抓取对象
 - targets: ['localhost:9100']
 - targets: ['172.20.94.1:9100']
- job_name: 'docker_host'
  file_sd_configs:  # 基于文件的服务发现,文件中(yml 和json 格式)定义的host:port/metrics将会成为抓取对象
 - files:
  - ./sd_files/docker_host.yml
refresh_interval: 30s
  • alertmanager_configs:

可由Prometheus使⽤的Alertmanager实例的集合,以及如何同这些Alertmanager交互的配置参数;

每个Alertmanager可通过静态配置(static_configs)直接给出定义, 也可基于Prometheus⽀持的服务发现机制进⾏⾃动配置;

  • remote_write:
配置“远程写”机制,Prometheus需要将数据保存于外部的存储系统(例如InfluxDB)时 定义此配置段,
随后Prometheus将样本数据通过HTTP协议发送给由URL指定适配器(Adaptor);
  • remote_read:
配置“远程读”机制,Prometheus将接收到的查询请求交给由URL指定适配器 (Adpater)执⾏,
Adapter将请求条件转换为远程存储服务中的查询请求,并将获取的响应数据转换为Prometheus可⽤的格式;
  • 监控及告警规则配置文件:*.yml
    • 定义监控规则
    • 需要在主配置文件rule_files: 中指定才会生效
 rule_files:
- "test_rules.yml"  # 指定配置告警规则的文件路径
  • 服务发现定义文件:支持yaml 和 json 两种格式
    • 也是需要在主配置文件中定义
    file_sd_configs:
    - files:
        - ./sd_files/http.yml
      refresh_interval: 30s
    

2.3.3 简单的配置文件示例:

  • prometheus.yml 示例
global:
  scrape_interval:  15s      #每过15秒抓取一次指标数据
  evaluation_interval: 15s#每过15秒执行一次报警规则,也就是说15秒执行一次报警
alerting:
  alertmanagers:
  - static_configs:
 - targets: ["localhost:9093"]# 设置报警信息推送地址 , 一般而言设置的是alertManager的地址
rule_files:
  - "test_rules.yml"  # 指定配置告警规则的文件路径
scrape_configs: 
  - job_name: 'node'#自己定义的监控的job_name
 static_configs:    # 配置静态规则,直接指定抓取的ip:port
- targets: ['localhost:9100']
  - job_name: 'CDG-MS'
 honor_labels: true
 metrics_path: '/prometheus'
 static_configs:
- targets: ['localhost:8089']
 relabel_configs:
- target_label: env
  replacement: dev
  - job_name: 'eureka'
 file_sd_configs:       # 基于文件的服务发现
- files:
 - "/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json" # 支持json 和yml 两种格式
refresh_interval: 30s  # 30s钟自行刷新配置,读取文件,修改之后无需手动reload
 relabel_configs:
- source_labels: [__job_name__]
  regex: (.*)
  target_label: job
  replacement: ${1}
- target_label: env
  replacement: dev              
  • 告警规则配置文件示例:“`
    [root@host40 monitor-bak]# cat prometheus/rules/docker_monitor.yml
    groups:
    </p></li>
    <li><p>name: “container monitor”
    rules:

    <ul>
    <li>alert: “Container down: env1″
    expr: time() – container_last_seen{name=”env1”} > 60
    for: 30s
    labels:
    severity: critical
    annotations:
    summary: “Container down: {{$labels.instance}} name={{$labels.name}}”

    “`

  • 基于文件的服务发现定义文件: *.yml
    [root@host40 monitor]# cat prometheus/sd_files/virtual_lan.yml 
    - targets: ['10.10.11.179:9100']
    - targets: ['10.10.11.178:9100']
    
    [root@host40 monitor]# cat prometheus/sd_files/tcp.yml 
    - targets: ['10.10.11.178:8001']
    labels:
    server_name: http_download
    - targets: ['10.10.11.178:3307']
    labels:
    server_name: xiaojing_db
    - targets: ['10.10.11.178:3001']
    labels:
    server_name: test_web
    

2.3.5其他配置

  • 由于prometheus很多配置需要和其他组件耦合,所以在介绍到相应组件时再行介绍

2.4 prometheus web-gui

  • web页面访问地址: http://ip:port 如:http://10.10.11.40:9090/
  • alerts: 查看告警规则
  • graph: 查询收集到的指标数据,并提供简单的绘图
  • status: prometheus运行时配置已经监听主机相关信息
  • 详情自行查看web-gui页面

3.node_exporter

3.1 基本介绍

  • node_exporter 在被监控节点安装,抓取主机监控信息,并对外提供http服务,供prometheus抓取监控信息。 
  • 项目及文档地址:https://github.com/prometheus/node_exporter
  • prometheus官方提供了很多不同类型的exporter,列表地址: https://prometheus.io/docs/instrumenting/exporters/

3.2 安装node_exporter

3.2.1 linux(centos7)下载安装:

  • 下载并解压
    wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
    tar xf node_exporter-0.18.1.linux-amd64.tar.gz -C /usr/local/
    cd /usr/local
    ln -sv node_exporter-0.18.1.linux-amd64/ node_exporter
    
  • 创建用户:
    useradd -r -m -d /var/lib/prometheus prometheus
    
  • 配置unit file:
    vim /usr/lib/systemd/system/node_exporter.service
    [Unit]
    Description=Prometheus exporter for machine metrics, written in Go with pluggable metric 
    collectors.Documentation=https://github.com/prometheus/node_exporterAfter=network.target
    [Service]
    EnvironmentFile=-/etc/sysconfig/node_exporter
    User=prometheus
    ExecStart=/usr/local/node_exporter/node_exporter \
    $NODE_EXPORTER_OPTS
    Restart=on-failure
    StartLimitInterval=1
    RestartSec=3
    [Install]
    WantedBy=multi-user.target 
    
  • 启动服务:
    systemctl daemon-reload
    systemctl start node_exporter.service
    
  • 可以手动测试是否可以获取metrics信息:
    curl http://localhost:9100/metrics
    
  • 开启防火墙:
    iptables -I INPUT -p tcp --dport 9100 -s NET/MASK -j ACCEPT
    

3.2.2 docker安装

  • image: quay.io/prometheus/node-exporter,prom/node-exporter 
  • 启动命令:
    docker run -d --net="host" --pid="host" -v "/:/host:ro,rslave" --name monitor-node-exporter --restart always quay.io/prometheus/node-exporter --path.rootfs=/host --web.listen-address=:9100
    
  • 对于部分低版本的docker,出现报错:Error response from daemon: linux mounts: Could not find source mount of /解决办法:-v “/:/host:ro,rslave” -> -v “/:/host:ro”

3.3 配置node_exporter

  • 开启关闭collectors:
    ./node_exporter --help  # 查看支持的所有collectors,可根据实际需求 enable 和 disabled 各项指标收集
    
    如 --collector.cpu=disabled ,不再收集cpu相关信息
    
  • Textfile Collector: 文本文件收集器
    通过 启动参数 --collector.textfile.directory="DIR"   可开启文本文件收集器
    收集器会收集目录下所有*.prom的文件中的指标,指标必须满足    prom格式
    

    示例:

    echo my_batch_job_completion_time $(date +%s) > /path/to/directory/my_batch_job.prom.$$
    mv /path/to/directory/my_batch_job.prom.$$ /path/to/directory/my_batch_job.prom            
    echo 'role{role="application_server"} 1' > /path/to/directory/role.prom.$$
    mv /path/to/directory/role.prom.$$ /path/to/directory/role.prom    
    rpc_duration_seconds{quantile="0.5"} 4773
    http_request_duration_seconds_bucket{le="0.5"} 129389
    

    即如果node_exporter 不能满足自身指标抓取,可以通过脚本形式将指标抓取之后写入文件,由node_exporter对外提供个prometheus抓取
    可以省掉pushgateway

  • 有关prom格式和查询语法,将再之后介绍

3.4 配置prometheus抓取node_exporter 指标

  • 示例: prometheus.yml

    “`
    scrape_configs:
    </p></li>
    </ul>

    <h1>The job name is added as a label <code>job=</code> to any timeseries scraped from this config.</h1>

    <ul>
    <li>job_name: ‘prometheus’
    # metrics_path defaults to ‘/metrics’
    # scheme defaults to ‘http’.
    static_configs:

    <ul>
    <li>targets: [‘localhost:9090’]</li>
    </ul></li>
    <li><p>job_name: ‘nodes’
    static_configs:

    <ul>
    <li>targets: [‘localhost:9100’]</li>
    <li>targets: [‘172.20.94.1:9100’]</li>
    </ul>

    <pre><code class=”line-numbers”></code></pre>

    <ul>
    <li>job_name: ‘node_real_lan’
    file_sd_configs:</li>
    <li>files:</li>
    </ul></li>
    <li>./sd_files/real_lan.yml
    refresh_interval: 30s
    params: # 可选
    collect[]:

    <ul>
    <li>cpu</li>
    <li>meminfo</li>
    <li>diskstats</li>
    <li>netdev</li>
    <li>netstat</li>
    <li>filefd</li>
    <li>filesystem</li>
    <li>xfs

    “`

4.cadvisor

4.1 官方地址:

  • https://github.com/google/cadvisor
  • image: gcr.io/google_containers/cadvisor[:v0.36.0] # 需要能访问google
  • image: google/cadvisor:v0.33.0 # docker hub镜像,版本没有google的新

4.2 docker run

sudo docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=9080:8080 \
  --detach=true \
  --name=cadvisor \
  --privileged \
  --device=/dev/kmsg \
  google/cadvisor:v0.33.0

4.3 web页面查看简单的单机图形监控信息

  • http://ip:port

4.4 配置prometheus抓取

  • 配置示例:“`

    <ul>
    <li>job_name: ‘docker’
    static_configs:</li>
    </ul></li>
    <li>targets: [‘localhost:9080’]

    “`

5.grafana

5.1 官方地址

  • grafana程序下载地址:https://grafana.com/grafana/download
  • grafana dashboard 下载地址: https://grafana.com/grafana/download/

5.2 安装grafana

5.2.1 linux(centos7)安装

  • 下载并安装
    wget https://dl.grafana.com/oss/release/grafana-7.2.2-1.x86_64.rpm
    sudo yum install grafana-7.2.2-1.x86_64.rpm
    
  • 准备service 文件:
    [Unit]
    Description=Grafana instance
    Documentation=http://docs.grafana.org
    Wants=network-online.target
    After=network-online.target
    After=postgresql.service mariadb.service mysqld.service
    
    [Service]
    EnvironmentFile=/etc/sysconfig/grafana-server
    User=grafana
    Group=grafana
    Type=notify
    Restart=on-failure
    WorkingDirectory=/usr/share/grafana
    RuntimeDirectory=grafana
    RuntimeDirectoryMode=0750
    ExecStart=/usr/sbin/grafana-server  \
    --config=${CONF_FILE}  \
    --pidfile=${PID_FILE_DIR}/grafana-server.pid\
    --packaging=rpm  \
    cfg:default.paths.logs=${LOG_DIR}  \
    cfg:default.paths.data=${DATA_DIR} \
    cfg:default.paths.plugins=${PLUGINS_DIR} \
    cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR}
    
    LimitNOFILE=10000
    TimeoutStopSec=20
    
    [Install]
    WantedBy=multi-user.target
    
  • 启动grafana
    systemctl enable grafana-server.service
    systemctl restart grafana-server.service
    

    默认监听3000端口

  • 开启防火墙:
    iptables -I INPUT -p tcp --dport 3000 -s NET/MASK -j ACCEPT
    

5.2.2 docker安装

  • image: grafana/grafana
    docker run -d --name=grafana -p 3000:3000 grafana/grafana:7.2.2
    

5.3 grafana 简单使用流程

  • web页面访问:
    http://ip:port
    

    首次登陆会要求自行设置账号密码
    7.2版本会要求输入账号密码之后重置,初始账号密码都是admin

  • 使用流程:
    • 添加数据源
    • 添加dashboard,配置图形监控面板,也可在官网下载对应服务的dashboard模板,下载地址:https://grafana.com/grafana/download/
    • 导入模板,json 或 链接 或模板编号
    • 查看dashboard
  • 常用模板编号:
    • node-exporter: cn/8919,en/11074
    • k8s: 13105
    • docker: 12831
    • alertmanager: 9578
    • blackbox_exportre: 9965
  • 重置管理员密码:
    查看Grafana配置文件,确定grafana.db的路径
    配置文件路径:/etc/grafana/grafana.ini
    [paths]
    ;data = /var/lib/grafana
    [database]
    # For "sqlite3" only, path relative to data_path setting
    ;path = grafana.db
    通过配置文件得知grafana.db的完整路径如下:
    /var/lib/grafana/grafana.db
    
    使用sqlites修改admin密码 
    sqlite3 /var/lib/grafana/grafana.db
    sqlite> update user set password = 
    '59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6', 
    salt = 'F3FAxVm33R' where login = 'admin';
    .exit
    
    使用admin admin 登录
    

5.4 grafana告警配置:

  • grafana-server配置 smtp服务器,配置发件邮箱
    vim /etc/grafana/grafana.ini
    [smtp]
    enabled =  true
    host = smtp.126.com:465
    user = USER@126.com
    password = PASS
    skip_verify = false
    from_address = USER@126.com
    from_name = Grafana Alart
    
  • grafana页面添加Notification Channel
    Alerting -> Notification Channel
    save之前 可以send test
    
  • 进入dashboard,添加alart rules 
  • 由于现阶段grafana(7.2.2)不支持在报警查询中使用模板变量。所以报警功能实用性很低。生产中建议使用alertmanager

6.prometheus and PromQL:

6.1 PromQL 简述

  • prometheus用来查询数据库的语法规则,用来将数据库中存储的由各exporter 采集到的metric指标组织成可视化的图标信息,以及告警规则
  • promQL一个多维数据模型,其中包含通过metric name 和键/值对标识的时间序列数据
  • 一种灵活的查询语言 ,可利用此维度
  • 不依赖分布式存储;单服务器节点是自治的
  • 多种图形和仪表板支持模式

6.2 使用到promQL的组件:

  • prometheus server
  • client libraries for instrumenting application c7ode
  • push gateway
  • exporters
  • alertmanager

6.3 metric 介绍

6.3.1 metric类型

  • gauges: 返回单一数值,如:
    • node_boot_time_seconds

    node_boot_time_seconds{instance=”10.10.11.40:9100″,job=”node_real_lan”} 1574040030

  • counters: 计数,
  • histograms: 直方图,统计数据的分布情况。比如最大值,最小值,中间值,中位数,百分位数等。
  • summaries: 采样点分位图统计。

6.3.2 label

  • node_boot_time_seconds{instance=”10.10.11.40:9100″,job=”node_real_lan”}

    如上示例,这里的instance,和job 就是label

    • job : job_name,在prometheus.yml 中定义
    • instance: host:port
  • 也可以在配置文件自行定义label,如:
    - targets: ['10.10.11.178:3001']
    labels:
    server_name: test_web
    

    添加的label即会在prometheus查询数据使用:

    metric{servername=...,}
    

6.4 PromQL 表达式

  • PromQL表达式即是grafana绘制图标的基本语句,也是prometheus用来设置告警规则的基本语句,所以能弄懂或者看懂promQL 非常重要。

6.4.1 先看示例:

  • 计算cpu使用率:
    (1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))/(sum(increase(node_cpu_seconds_total[1m])) by (instance)))) * 100
    

    其中metric:

    node_cpu_seconds_total         # 总cpu 使用时间
    node_cpu_seconds_total{mode="idle"} # 空闲cpu使用时间,其他类似标签: user , system , steal , softirq , irq , nice , iowait , idle
    

    用到的函数:

    “`
    increase( [1m]) # 1分钟之类的增量。
    sum()
    sum() by (TAG) # 其中 TAG 是标签,此地 instance 代表的是机器名. 按主机名进行相加,否则多主机只会显示一条线。
    </p></li>
    </ul>

    <pre><code class=”line-numbers”>#### 6.4.2 标签选择

    – 匹配运算:

    “`
    = #等于 Select labels that are exactly equal to the provided string.
    != #不等于 Select labels that are not equal to the provided string.
    =~ #正则表达式匹配 Select labels that regex-match the provided string.
    !~ #正则表达式不匹配 Select labels that do not regex-match the provided string.
    “`

    – 示例:

    “`
    node_cpu_seconds_total{mode=”idle”} # mode : 标签,metric自带属性。
    api_http_requests_total{method=”POST”, handler=”/messages”}
    “`

    “`
    http_requests_total{environment=~”staging|testing|development”,method!=”GET”}
    “`

    – 注意: 必须指定一个名称或至少一个与空字符串不匹配的标签匹配器

    “`
    {job=~”.*”} # Bad!
    {job=~”.+”} # Good!
    {job=~”.*”,method=”get”} # Good!
    “`

    #### 6.4.3 运算

    – 时间范围:

    “`
    s -秒
    m – 分钟
    h – 小时
    d – 天
    w -周
    y -年
    “`

    – 运算符:

    “`
    + (addition)
    – (subtraction)
    * (multiplication)
    / (division)
    % (modulo)
    ^ (power/exponentiatio
    == (equal)
    != (not-equal)
    > (greater-than)
    = (greater-or-equal)
    40
    for: 1m
    labels:
    servirity: warning
    annotations:
    summary: “{{$labels.instance}}:CPU 使用过高”
    description: “{{$labels.instance}}:CPU 使用率超过 40%”
    value: “{{$value}}”
    – alert: “CPU 使用率超过90%”
    expr: 100-(avg(rate(node_cpu_seconds_total{mode=”idle”}[1m])) by(instance)* 100) > 90
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: “{{$labels.instance}}:CPU 使用率90%”
    description: “{{$labels.instance}}:CPU 使用率超过90%,持续时间超过5mins”
    value: “{{$value}}”
    “`

    – 如果需要在配置文件中使用中文,务必注意编码规则为utf8,否则报错

    ### 7.6 配置alertmanager

    – 详细文档地址: https://prometheus.io/docs/alerting/latest/configuration/
    – 主配置文件: alertmanager.yml
    – 模板配置文件: *.tmpl
    – 只是介绍少部需要用到的配置,如需查看完整配置,请查看官方文档

    #### 7.6.1 alertmanager.yml

    – 主配置文件中需要配置:
    – global: 发件邮箱配置,
    – templates: 指定邮件模板文件(如果不指定,则使用alertmanager默认模板),
    – routes: 配置告警规则,比如匹配哪个label的规则发送到哪个后端
    – receivers: 配置后端告警媒介: email,wechat,webhook等等

    – 先看示例:

    “`
    vim alertmanager.yml
    global:
    smtp_smarthost: ‘xxx’
    smtp_from: ‘xxx’
    smtp_auth_username: ‘xxx’
    smtp_auth_password: ‘xxx’
    smtp_require_tls: false
    templates:
    – ‘/alertmanager/template/*.tmpl’
    route:
    receiver: ‘default-receiver’
    group_wait: 1s #组报警等待时间
    group_interval: 1s #组报警间隔时间
    repeat_interval: 1s #重复报警间隔时间
    group_by: [cluster, alertname]
    routes:
    – receiver: test
    group_wait: 1s
    match_re:
    severity: test
    receivers:
    – name: ‘default-receiver’
    email_configs:
    – to: ‘xx@xx.xx’
    html: ‘{{ template “xx.html” . }}’
    headers: { Subject: ” {{ .CommonAnnotations.summary }}” }
    – name: ‘test’
    email_configs:
    – to: ‘xxx@xx.xx’
    html: ‘{{ template “xx.html” . }}’
    headers: { Subject: ” {{ 第二路由匹配测试}}” }
    “`

    “`
    vim test.tmpl
    {{ define “xx.html” }}

    {{ range $i, $alert := .Alerts }}

    {{ end }}

    报警项 磁盘 报警阀值 开始时间
    {{ index $alert.Labels “alertname” }} {{ index $alert.Labels “instance” }} {{ index $alert.Annotations “value” }} {{ $alert.StartsAt }}

    {{ end }}
    “`

    – 详解:

     

    gloable:

    resolve_timeout: # 在没有报警的情况下声明为已解决的时间

      - 其他邮件相关配置,如示例
    
    

    route: # 所有报警信息进入后的根路由,用来设置报警的分发策略

    group_by: [‘LABEL_NAME’,’alertname’, ‘cluster’,’job’,’instance’,…]

    这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 
    和alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
    
    

    group_wait: 30s

    当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
    
    

    group_interval: 5m

    当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
    
    

    repeat_interval: 5m

    如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
    
    

    match:
    label_name: NAME

    匹配报警规则,满足条件的告警将被发给 receiver
    
    

    match_re:
    label_name: , …

    正则表达式匹配。满足条件的告警将被发给 receiver
    
    

    receiver: receiver_name

    将满足match 和 match_re的告警发给后端 告警媒介(邮件,webhook,pagerduty,wechat,...)
    必须有一个default receivererr="root route must specify a default receiver"
    
    

    routes:
    – …

    配置多条规则。
    
    

    templates:
    [ – … ]

    “`

    ​ 配置模板,比如邮件告警页面模板

      receivers:
        - <receiver> ...# 列表
    
    - name: receiver_name   # 用于填写在route.receiver中的名字 
    
     email_configs:         # 配置邮件告警
    
     - to: <tmpl_string>
    send_resolved: <boolean> | default = false      # 故障恢复之后,是否发送恢复通知
    

    配置接受邮件告警的邮箱,也可以配置单独配置发件邮箱。 详见官方文档
    https://prometheus.io/docs/alerting/latest/configuration/#email_config

    - name: ...
      wechat_configs:
      - send_resolved: <boolean> | default = false
    
     api_secret: <secret> | default = global.wechat_api_secret
     api_url: <string> | default = global.wechat_api_url
     corp_id: <string> | default = global.wechat_api_corp_id
        message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}'
    
        agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}'
    
        to_user: <string> | default = '{{ template "wechat.default.to_user" . }}'
        to_party: <string> | default = '{{ template "wechat.default.to_party" . }}'
        to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}'             
        # 说明
            to_user: 企业微信用户ID
            to_party: 需要发送的组id
    
            corp_id: 企业微信账号唯一ID 可以在 我的企业 查看                         
            agent_id: 应用的 ID,应用管理 --> 打开自定应用查看
            api_secret: 应用的密钥
    
            打开企业微信注册 https://work.weixin.qq.com
            微信API官方文档 https://work.weixin.qq.com/api/doc#90002/90151/90854  
    

    企业微信告警配置

      inhibit_rules:
     - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
    

    抑制相关配置

    7.6.2 配置企业微信告警

    • 注册企业: https://work.weixin.qq.com可以注册未认证企业,人数上限200,绑定个人微信即可使用web后台

      微信API官方文档 : https://work.weixin.qq.com/api/doc#90002/90151/90854

    • 注册之后绑定私人微信即可扫码进入管理后台。
    • 发送告警的应用需要新建,操作也很简单
    • 需要注意的参数:
      • corp_id: 企业微信账号唯一ID 可以在 我的企业 查看
      • agent_id: 应用的 ID,应用管理 –> 打开自定应用查看
      • api_secret: 应用的密钥
      • to_user: 企业微信用户ID,
      • to_party: 需要发送的组id,通讯录,点击组名旁边的点可查看
    • 配置示例: 
     receivers:
    - name: 'default'
      email_configs:
     - to: 'XXX'
    send_resolved: true
    
      wechat_configs:
     - send_resolved: true
    corp_id: 'XXX'
    api_secret: 'XXX'
    agent_id: 1000002
    to_user: XXX
    to_party: 2
    message: '{{ template "wechat.html" . }}'
    
    • template:
      • 由于alertmanager默认的微信报警模板太丑丑陋和冗长,所以使用告警模板,邮件模板默认的倒是还可以 
      • 示例1:
      cat wechat.tmpl
      {{ define "wechat.html" }}
      {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
      [@警报~]
      实例: {{ .Labels.instance }}
      信息: {{ .Annotations.summary }}
      详情: {{ .Annotations.description }}
      值: {{ .Annotations.value }}
      时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      {{ end }}{{ end -}}
      {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
      [@恢复~]
      实例: {{ .Labels.instance }}
      信息: {{ .Annotations.summary }}
      时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      {{ end }}{{ end -}}
      {{- end }}
    

    7.6.3 告警模板时间问题:

    • 参考来源: https://blog.csdn.net/knight_zhou/article/details/106323719
    • Prometheus 邮件告警自定义模板的默认使用的是utc时间。
      触发时间: {{ .StartsAt.Format "2020-01-02 15:04:05" }} 
      修改之后:{{ (.StartsAt.Add 28800e9).Format "2020-01-02 15:04:05" }}
      

    7.7 prometheus常用告警规则:

    • 很厉害的一个页面,包括的好多写好的规则: https://awesome-prometheus-alerts.grep.to/rules

    7.7.1 容器告警指标,容器down掉告警

    vim rules/docker_monitor.yml
    groups:
      - name: "container monitor"   
     rules:
    - alert: "Container down: env1"
      expr: time() - container_last_seen{name="env1"} > 60
      for: 30s
      labels:
     severity: critical
      annotations:
     summary: "Container down: {{$labels.instance}} name={{$labels.name}}"  
    

    注意:

    此项指标只能监控容器down 掉,无法准确监控容器恢复(不准),即便容器没有成功启动,过一段时间,也会受到resolve通知
    

    7.7.2 针对磁盘CPU,IO ,磁盘使用、内存使用、TCP、网络流量配置监控告警:

    groups:
    - name: 主机状态-监控告警
      rules:
      - alert: 主机状态
     expr: up == 0
     for: 1m
     labels:
    status: 非常严重
     annotations:
    summary: "{{$labels.instance}}:服务器宕机"
    description: "{{$labels.instance}}:服务器延时超过5分钟"
    
      - alert: CPU使用情况
     expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100)
     for: 1m
     labels:
    status: 一般告警
     annotations:
    summary: "{{$labels.mountpoint}} CPU使用率过高!"
    description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"
    - alert: cpu使用率过高告警  # 查询提供了hostname label
      expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 10
    nodename) (node_uname_info) > 85
      for: 5m
      labels:
     region: 成都
      annotations:
     summary: "{{$labels.instance}}({{$labels.nodename}})CPU使用率过高!"
     description: '服务器{{$labels.instance}}({{$labels.nodename}})CPU使用率超过85%(
    $value}}%)'       
    - alert: 系统负载过高
      expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode="system"}
    nodename) (node_uname_info)>1.1
      for: 3m
      labels:
     region: 成都
      annotations:
     summary: "{{$labels.instance}}({{$labels.nodename}})系统负载过高!"
     description: '{{$labels.instance}}({{$labels.nodename}})当前负载超标率 {{printf 
    
    - alert: 内存不足告警
      expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* o
    nodename) (node_uname_info) > 80
      for: 3m
      labels:
     region: 成都
      annotations:
     summary: "{{$labels.instance}}({{$labels.nodename}})内存使用率过高!"
     description: '服务器{{$labels.instance}}({{$labels.nodename}})内存使用率超过80%(
    $value}}%)'
      - alert: IO操作耗时
     expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100)  102400
     for: 1m
     labels:
    status: 严重告警
     annotations:
    summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
    description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{
      - alert: 网络流出
     expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|d
    instance)) / 100) > 102400
     for: 1m
     labels:
    status: 严重告警
     annotations:
    summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
    description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{
      - alert: network in
     expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1
     for: 1m
     labels:
    name: network
    severity: Critical
     annotations:
    summary: "{{$labels.mountpoint}} 流入网络带宽过高"
    description: "{{$labels.mountpoint }}流入网络异常,高于100M"
    value: "{{ $value }}"        
      - alert: network out
     expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 
     for: 1m
     labels:
    name: network
    severity: Critical
     annotations:
    summary: "{{$labels.mountpoint}} 发送网络带宽过高"
    description: "{{$labels.mountpoint }}发送网络异常,高于100M"
    value: "{{ $value }}" 
    
      - alert: TCP会话
     expr: node_netstat_Tcp_CurrEstab > 1000
     for: 1m
     labels:
    status: 严重告警
     annotations:
    summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
    description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$valu
      - alert: 磁盘容量
     expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_b
    > 80
     for: 1m
     labels:
    status: 严重告警
     annotations:
    summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
    description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"    
    - alert: 硬盘空间不足告警  # 查询结果多了hostname等label
      expr: (100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_by
    )* on(instance) group_left(nodename) (node_uname_info)> 80
      for: 3m
      labels:
     region: 成都
      annotations:
     summary: "{{$labels.instance}}({{$labels.nodename}})硬盘使用率过高!"
     description: '服务器{{$labels.instance}}({{$labels.nodename}})硬盘使用率超过80%(
    $value}}%)'
      - alert: volume fullIn fourdaysd # 预计磁盘4天后写满
     expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
     for: 5m
     labels:
    name: disk
    severity: Critical
     annotations:
    summary: "{{$labels.mountpoint}} 预计主机可用磁盘空间4天后将写满"
    description: "{{$labels.mountpoint }}" 
    value: "{{ $value }}%"  
      - alert: disk write rate
     expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024
     for: 1m
     labels:container_memory_max_usage_bytes
    name: disk
    severity: Critical
     annotations:
    summary: "disk write rate (instance {{ $labels.instance }})"
    description: "磁盘写入速率大于50MB/s"
    value: "{{ $value }}%" 
      - alert: disk read latency
     expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_complet
     for: 1m
     labels:
    name: disk
    severity: Critical
     annotations:
    summary: "unusual disk read latency (instance {{ $labels.instance }})"
    description: "磁盘读取延迟大于100毫秒"
    value: "{{ $value }}%" 
      - alert: disk write latency
     expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_compl
     for: 1m
     labels:
    name: disk
    severity: Critical
     annotations:
    summary: "unusual disk write latency (instance {{ $labels.instance }})"
    description: "磁盘写入延迟大于100毫秒"
    value: "{{ $value }}%" 
    

    7.8 alertmanager 管理api

    GET /-/healthy  
    GET /-/ready  
    POST /-/reload
    
    • 示例:
    curl -u monitor:fosafer.com 127.0.0.1:9093/-/healthy
        OK
    curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
     [root@host40 monitor]# curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
    failed to reload config: yaml: unmarshal errors:
    line 26: field receiver already set in type config.plain
    

    等同: docker exec -it monitor-alertmanager kill -1 1 ,但是失败会报错

    8.blackbox_exporter

    8.1 blackbox_exporter简介

    • blackbox_exporter是Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集。 
    • 官方地址: https://github.com/prometheus/blackbox_exporter
    • 应用场景:
      HTTP 测试
      定义 Request Header 信息
      判断 Http status / Http Respones Header / Http Body 内容
      TCP 测试
      业务组件端口状态监听
      应用层协议定义与监听
      ICMP 测试
      主机探活机制
      POST 测试
      接口联通性
      SSL 证书过期时间     
      

    8.2 blackbox_exporter安装

    8.2.1 linux(centos7) 二进制下载安装blackbox_exporter

    • 下载并解压
      wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.18.0/
      blackbox_exporter-0.18.0.linux-amd64.tar.gz
      tar -xf blackbox_exporter-0.18.0.linux-amd64.tar.gz -C /usr/local/
      cd /usr/local 
      ln -sv blackbox_exporter-0.18.0.linux-amd64 blackbox_exporter
      cd blackbox_exporter
      ./blackbox_exporter --version
      
    • 添加systemd服务unit:
      vim /lib/systemd/system/blackbox_exporter.service
      [Unit]
      Description=blackbox_exporter
      After=network.target
      [Service]
      User=root
      Type=simple
      ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
      Restart=on-failure
      [Install]
      WantedBy=multi-user.target
      
      systemctl daemon-reload
      systemctl enable blackbox_exporter
      systemctl start blackbox_exporter
      
    • 默认监听端口: 9115 

    8.2.2 docker 安装blackbox_exporter

    • image: prom/blackbox-exporter:master
    • docker run:
      docker run --rm -d -p 9115:9115 --name blackbox_exporter -v `pwd`:/config prom/blackbox-exporter:master --config.file=/config/blackbox.yml
      

    8.3 配置blackbox_exporter

    • 默认配置文件: 
    • blackbox_exporter 默认情况配置文件已经能够满足大多数需求,后续如需自行配置,参见官方文档,以及项目类一个示例配置文件
      • https://github.com/prometheus/blackbox_exporter/blob/master/example.yml
      cat blackbox.yml
      modules:
      http_2xx:
      prober: http
      http_post_2xx:
      prober: http
      http:
      method: POST
      tcp_connect:
      prober: tcp
      pop3s_banner:
      prober: tcp
      tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
      insecure_skip_verify: false
      ssh_banner:
      prober: tcp
      tcp:
      query_response:
      - expect: "^SSH-2.0-"
      irc_banner:
      prober: tcp
      tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
      send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
      icmp:
      prober: icmp
      

    8.4 配置prometheus:

    • 官方介绍: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config 
    • 参考文档: https://blog.csdn.net/qq_25934401/article/details/84325356
    • 说明:
      labels:
      job:  job_name
      __address__: :
      instance: 默认__address__,如果没有被重新标签的话
      __scheme__: scheme
      __metrics_path__: path
      __param_: url 中第一个出现的  参数
      

    8.4.1 http/https 测试示例:

    scrape_configs:
      - job_name: 'blackbox'
     metrics_path: /probe
     params:
    module: [http_2xx]  # Look for a HTTP 200 response.
     static_configs:
    - targets:
      - http://prometheus.io # Target to probe with http.
      - https://prometheus.io# Target to probe with https.
      - http://example.com:8080 # Target to probe with http on port 8080.
     relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.    
    

    8.4.2 tcp探测示例:

    - job_name: "blackbox_telnet_port]"
      scrape_interval: 5s
      metrics_path: /probe
      params:
     module: [tcp_connect]
      static_configs:
    - targets: [ '1x3.x1.xx.xx4:443' ]
      labels:
     group: 'xxxidc机房ip监控'
    - targets: ['10.xx.xx.xxx:443']
      labels:
     group: 'Process status of nginx(main) server'
      relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 10.xxx.xx.xx:9115        
    

    8.4.3 icmp探测示例:

    - job_name: 'blackbox00_ping_idc_ip'
      scrape_interval: 10s
      metrics_path: /probe
      params:
     module: [icmp]  #ping
      static_configs:
    - targets: [ '1x.xx.xx.xx' ]
      labels:
     group: 'xxnginx 虚拟IP'
      relabel_configs:
    - source_labels: [__address__]
      regex: (.*)(:80)?
      target_label: __param_target
      replacement: ${1}
    - source_labels: [__param_target]
      regex: (.*)
      target_label: ping
      replacement: ${1}
    - source_labels: []
      regex: .*
      target_label: __address__
      replacement: 1x.xxx.xx.xx:9115
    

    8.4.4 POST探测示例:

    - job_name: 'blackbox_http_2xx_post'
      scrape_interval: 10s
      metrics_path: /probe
      params:
     module: [http_post_2xx_query]
      static_configs:
    - targets:
      - https://xx.xxx.com/api/xx/xx/fund/query.action
      labels:
     group: 'Interface monitoring'
      relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 1x.xx.xx.xx:9115  # The blackbox exporter's real hostname:port.
    

    8.4.5 SSL证书时间监测:

    cat << 'EOF' > prometheus.yml
    rule_files:
      - ssl_expiry.rules
    scrape_configs:
      - job_name: 'blackbox'
     metrics_path: /probe
     params:
    module: [http_2xx]  # Look for a HTTP 200 response.
     static_configs:
    - targets:
      - example.com  # Target to probe
     relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115  # Blackbox exporter.
      EOF 
    cat << 'EOF' > ssl_expiry.rules 
    groups: 
      - name: ssl_expiry.rules 
     rules: 
    - alert: SSLCertExpiringSoon 
      expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30 
      for: 10m
    EOF
    

    8.5 查看监听过程:

    • 类似于:
      curl http://172.16.10.65:9115/probe?target=prometheus.io&module=http_2xx&debug=true
      

    8.6 添加告警:

    • icmp、tcp、http、post 监测是否正常可以观察probe_success 这一指标
      probe_success == 0 ##联通性异常
      probe_success == 1 ##联通性正常
      
    • 告警也是判断这个指标是否等于0,如等于0 则触发异常报警“`
      [sss@prometheus01 prometheus]$ cat rules/blackbox-alert.rules
      groups:

      <ul>
      <li>name: blackbox_network_stats
      rules:</li>
      </ul></li>
      <li>alert: blackbox_network_stats
      expr: probe_success <span class=”text-highlighted-inline” style=”background-color: #fffd38;”> 0
      for: 1m
      labels:
      severity: critical
      annotations:
      summary: “Instance {{ $labels.instance }} is down”
      description: “This requires immediate action!”

      “`

    9.docker-compose部署完整prometheus监控系统

    • 部署主机: 10.10.11.40

    9.1 部署组件:

     prometheus
     alertmanager
     grafana
     nginx
     node_exporter
     cadvisor
     blackbox_exporter
    
    • image:
     prom/prometheus
     prom/alertmanager
     quay.io/prometheus/node-exporter  ,prom/node-exporter
     gcr.io/google_containers/cadvisor[:v0.36.0]  # 需要能访问google
     google/cadvisor:v0.33.0 # docker hub镜像,版本没有google的新
     grafana/grafana
     nginx
    
    • 将iamge pull下来之后从新tag ,并上传至本地harbor 仓库
      image: 10.10.11.40:80/base/nginx:1.19.3
      image: 10.10.11.40:80/base/prometheus:2.22.0
      image: 10.10.11.40:80/base/grafana:7.2.2
      image: 10.10.11.40:80/base/alertmanager:0.21.0
      image: 10.10.11.40:80/base/node_exporter:1.0.1
      image: 10.10.11.40:80/base/cadvisor:v0.33.0
      image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
      
      

    9.2 部署结构

    • 目录结构一览
      mkdir /home/deploy/monitor
      cd /home/deploy/monitor
      
      [root@host40 monitor]# tree
      .
      ├── alertmanager
      │   ├── alertmanager.yml
      │   ├── db
      │   │   ├── nflog
      │   │   └── silences
      │   └── templates
      │    └── wechat.tmpl
      ├── blackbox_exporter
      │   └── blackbox.yml
      ├── docker-compose.yml
      ├── grafana
      │   └── db
      │    ├── grafana.db
      │    ├── plugins
          ...
      ├── nginx
      │   ├── auth
      │   └── nginx.conf
      ├── node-exporter
      │   └── textfiles
      ├── node_exporter_install_docker.sh
      ├── prometheus
      │   ├── db
      │   ├── prometheus.yml
      │   ├── rules
      │   │   ├── docker_monitor.yml
      │   │   ├── system_monitor.yml
      │   │   └── tcp_monitor.yml
      │   └── sd_files
      │    ├── docker_host.yml
      │    ├── http.yml
      │    ├── icmp.yml
      │    ├── real_lan.yml
      │    ├── real_wan.yml
      │    ├── sedFDm5Rw
      │    ├── tcp.yml
      │    ├── virtual_lan.yml
      │    └── virtual_wan.yml
      └── sd_controler.sh
      
    • nginx basic认证需要的文件:
      [root@host40 monitor-bak]# ls nginx/auth/ -a
      .  ..  .htpasswd
      
    • 部分挂在目录权限:
      prometheus,grafana,alertmanager 的 db目录 需要777权限
      单独挂在的配置文件 alertmanager.yml,prometheus.yml,nginx.conf 需要 666权限。
      如果为了安全起见,建议将配置文件放入专门目录中挂载,并在command 中修改启动参数指定配置文件即可
      

    9.3 docker-compose.yml

    [root@host40 monitor-bak]# cat docker-compose.yml 
    version: "3"
    services:
    
      nginx:
     image: 10.10.11.40:80/base/nginx:1.19.3
     hostname: nginx
     container_name: monitor-nginx
     restart: always
     privileged: false
     ports:
    - 3001:3000
    - 9090:9090
    - 9093:9093
     volumes:
    - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    - ./nginx/auth:/etc/nginx/basic_auth
     networks:
    monitor:
      aliases:
     - nginx
     logging:
    driver: json-file
    options:
      max-file: '5'
      max-size: 50m
    
      prometheus:
     image: 10.10.11.40:80/base/prometheus:2.22.0
     container_name: monitor-prometheus
     hostname: prometheus
     restart: always
     privileged: true
     volumes:
    - ./prometheus/db/:/prometheus/
    - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    - ./prometheus/rules/:/etc/prometheus/rules/
    - ./prometheus/sd_files/:/etc/prometheus/sd_files/
     command: 
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.path=/prometheus'
    - '--web.console.libraries=/usr/share/prometheus/console_libraries'
    - '--web.console.templates=/usr/share/prometheus/consoles'
    - '--storage.tsdb.retention=60d'
     networks:
    monitor:
      aliases:
     - prometheus
     logging:
    driver: json-file
    options:
      max-file: '5'
      max-size: 50m
    
      grafana:
     image: 10.10.11.40:80/base/grafana:7.2.2
     container_name: monitor-grafana
     hostname: grafana
     restart: always
     privileged: true
     volumes:
    - ./grafana/db/:/var/lib/grafana 
     networks:
    monitor:
      aliases:
     - grafana
     logging:
    driver: json-file
    options:
      max-file: '5'
      max-size: 50m
    
      alertmanger:
     image: 10.10.11.40:80/base/alertmanager:0.21.0
     container_name: monitor-alertmanager
     hostname: alertmanager
     restart: always
     privileged: true
     volumes:
    - ./alertmanager/db/:/alertmanager
    - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    - ./alertmanager/templates/:/etc/alertmanager/templates
     networks:
    monitor:
      aliases:
     - alertmanager
     logging:
    driver: json-file
    options:
      max-file: '5'
      max-size: 50m
    
      node-exporter:
     image: 10.10.11.40:80/base/node_exporter:1.0.1
     container_name: monitor-node-exporter
     hostname: host40
     restart: always
     privileged: true
     volumes:
    - /:/host:ro,rslave
    - ./node-exporter/textfiles/:/textfiles
     network_mode: "host"
     command: 
    - '--path.rootfs=/host'
    - '--web.listen-address=:9100'
    - '--collector.textfile.directory=/textfiles' 
     logging:
    driver: json-file
    options:
      max-file: '5'
      max-size: 50m
    
      cadvisor:
     image: 10.10.11.40:80/base/cadvisor:v0.33.0
     container_name: monitor-cadvisor
     hostname: cadvisor
     restart: always
     privileged: true
     volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:ro
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
    - /dev/disk/:/dev/disk:ro
     ports:
    - 9080:8080
     networks: 
    monitor:
     logging:
    driver: json-file
    options:
      max-file: '5'
      max-size: 50m
    
      blackbox_exporter:
     image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
     container_name: monitor-blackbox
     hostname: blackbox-exporter
     restart: always
     privileged: true
     volumes:
    - ./blackbox_exporter/:/etc/blackbox_exporter
     networks:
    monitor:
      aliases:
     - blackbox
     command:
    - '--config.file=/etc/blackbox_exporter/blackbox.yml'
     logging:
    driver: json-file
    options:
      max-file: '5'
      max-size: 50m
    
    networks:
      monitor:
     ipam:
    config:
      - subnet: 192.168.17.0/24
    

    9.4 nginx

    • 由于prometheus,alertmanager 本身不带认证功能,所以前端使用nginx完成调度和basic auth 认证,同一代理后端监听端口,便于管理。 
    • 各程序默认端口
     prometheus: 9090
     grafana:3000
     alertmanager: 9093
     node_exproter: 9100
     cadvisor: 8080 (客户端)
    
    • nginx基础image使用basic认证:
     echo monitor:`openssl passwd -crypt 123456` > .htpasswd
    
    • 单独挂在配置文件容器不更新:(当然也可以选择挂在目录,而不是直接挂在文件)
      chmod 666 nginx.conf   
      
    • nginx容器加载配置文件:
      docker exec -it web-director nginx -s reload
      
    • nginx.conf“`
      [root@host40 monitor-bak]# cat nginx/nginx.conf
      user nginx;
      worker_processes auto;
      error_log /var/log/nginx/error.log;
      pid /run/nginx.pid;
      include /usr/share/nginx/modules/*.conf;
      events {
      worker_connections 10240;
      }
      http {
      log_format main ‘$remote_addr – $remote_user [$time_local] “$request” ‘
      ‘$status $body_bytes_sent “$http_referer” ‘
      ‘”$http_user_agent” “$http_x_forwarded_for”‘;
      access_log /var/log/nginx/access.log main;
      sendfileon;
      tcp_nopush on;
      tcp_nodelayon;
      keepalive_timeout65;
      types_hash_max_size 2048;
      include /etc/nginx/mime.types;
      default_type application/octet-stream;
      </p></li>
      </ul>

      <p>proxy_connect_timeout500ms;
      proxy_send_timeout1000ms;
      proxy_read_timeout3000ms;
      proxy_buffers 64 8k;
      proxy_busy_buffers_size 128k;
      proxy_temp_file_write_size 64k;
      proxy_redirect off;
      proxy_next_upstream error invalid_header timeout http_502 http_504;
      proxy_http_version 1.1;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Real-Port $remote_port;
      proxy_set_header Host $http_host;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      client_max_body_size 10m;
      client_body_buffer_size 512k;
      client_body_timeout 180;
      client_header_timeout 10;
      send_timeout 240;
      gzip on;
      gzip_min_length 1k;
      gzip_buffers 4 16k;
      gzip_comp_level 2;
      gzip_types application/javascript application/x-javascript text/css text/javascript image/jpeg image/gif image/png;
      gzip_vary off;
      gzip_disable “MSIE [1-6].”;

      server {
      listen 3000;
      server_name _;

      location / {
      proxy_pass http://grafana:3000;
      }
      }

      server {
      listen 9090;
      server_name _;

      location / {
      auth_basic “auth for monitor”;
      auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
      proxy_pass http://prometheus:9090;
      }
      }

      server {
      listen 9093;
      server_name _;

      location / {
      auth_basic “auth for monitor”;
      auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
      proxy_pass http://alertmanager:9093;<br />
      }
      }
      }

      “`

      9.5 prometheus

      • 注意db目录需可写,给777权限

      9.5.1 主配置文件: prometheus.yml

      [root@host40 monitor-bak]# cat prometheus/prometheus.yml 
      # my global config
      global:
        scrape_interval:  15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
        evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
        # scrape_timeout is set to the global default (10s).
      # Alertmanager configuration
      alerting:
        alertmanagers:
        - static_configs:
       - targets: ["alertmanager:9093"]
      # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
      rule_files:
        - "rules/*.yml"
      # A scrape configuration containing exactly one endpoint to scrape:
      # Here it's Prometheus itself.
      scrape_configs:
        # The job name is added as a label `job=` to any timeseries scraped from this config.
        - job_name: 'prometheus'
       static_configs:
       - targets: ['localhost:9090']
      
        - job_name: 'alertmanager'
       static_configs:
      - targets: ['alertmanager:9093']
        - job_name: 'node_real_lan'
       file_sd_configs:
      - files: 
       - ./sd_files/real_lan.yml
        refresh_interval: 30s
      
        - job_name: 'node_virtual_lan'
       file_sd_configs:
      - files:
       - ./sd_files/virtual_lan.yml
        refresh_interval: 30s
      
        - job_name: 'node_real_wan'
       file_sd_configs:
      - files:
       - ./sd_files/real_wan.yml
        refresh_interval: 30s
      
        - job_name: 'node_virtual_wan'
       file_sd_configs:
      - files:
       - ./sd_files/virtual_wan.yml
        refresh_interval: 30s
      
        - job_name: 'docker_host'
       file_sd_configs:
      - files:
       - ./sd_files/docker_host.yml
        refresh_interval: 30s
        - job_name: 'tcp'
       metrics_path: /probe
       params:
      module: [tcp_connect]
       file_sd_configs:
      - files:
       - ./sd_files/tcp.yml
        refresh_interval: 30s
       relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115 
        - job_name: 'http'
       metrics_path: /probe
       params:
      module: [http_2xx]
       file_sd_configs:
      - files:
       - ./sd_files/http.yml
        refresh_interval: 30s
       relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115 
        - job_name: 'icmp'
       metrics_path: /probe
       params:
      module: [icmp]
       file_sd_configs:
      - files:
       - ./sd_files/icmp.yml
        refresh_interval: 30s
       relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115 
      

      9.5.2 全部节点使用基于文件的服务发现:

      • 将需要监控的主机targets 写入相应job的target文件即可。示例如下:
      ls prometheus/sd_files/
      docker_host.yml  http.yml  icmp.yml  real_lan.yml  real_wan.yml  sedFDm5Rw  tcp.yml  virtual_lan.yml  virtual_wan.yml
      
       cat prometheus/sd_files/docker_host.yml
       - targets: ['10.10.11.178:9080']
       - targets: ['10.10.11.99:9080']
       - targets: ['10.10.11.40:9080']
       - targets: ['10.10.11.35:9080']
       - targets: ['10.10.11.45:9080']
       - targets: ['10.10.11.46:9080']
       - targets: ['10.10.11.48:9080']
       - targets: ['10.10.11.47:9080']
       - targets: ['10.10.11.65:9081']
       - targets: ['10.10.11.61:9080']
       - targets: ['10.10.11.66:9080']
       - targets: ['10.10.11.68:9080']
       - targets: ['10.10.11.98:9080']
       - targets: ['10.10.11.75:9080']
       - targets: ['10.10.11.97:9080']
       - targets: ['10.10.11.179:9080']
      
       cat prometheus/sd_files/tcp.yml
       - targets: ['10.10.11.178:8001']
      labels:
        server_name: http_download
       - targets: ['10.10.11.178:3307']
      labels:
        server_name: xiaojing_db
       - targets: ['10.10.11.178:3001']
      labels:
        server_name: test_web
      

      9.5.3 rules文件:

      • docker rules:
      cat prometheus/rules/docker_monitor.yml 
       groups:
      - name: "container monitor"
        rules:
       - alert: "Container down: env1"
      expr: time() - container_last_seen{name="env1"} > 60
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
       ```
      
      - tcp rules:
      
       ```
       cat prometheus/rules/tcp_monitor.yml 
       groups:
       - name: blackbox_network_stats
      rules:
      - alert: blackbox_network_stats
        expr: probe_success == 0
        for: 1m
        labels:
       severity: critical
        annotations:
       summary: "Instance {{ $labels.instance }} ,server-name: {{ $labels.server_name }} is down"
       description: "连接不通..."
       ```
      
      - system rules: # cpu ,mem, disk, network, filesystem...
      
      

      cat prometheus/rules/system_monitor.yml
      groups:
      – name: “system info”
      rules:
      – alert: “服务器宕机”
      expr: up 0
      for: 3m
      labels:
      severity: critical
      annotations:
      summary: “{{$labels.instance}}:服务器宕机”
      description: “{{$labels.instance}}:服务器无法连接,持续时间已超过3mins”
      – alert: “系统负载过高”
      expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode=”system”}))* on(instance) group_left(
      nodename) (node_uname_info) > 1.1
      for: 3m
      labels:
      servirity: warning
      annotations:
      summary: “{{$labels.instance}}:系统负载过高”
      description: “{{$labels.instance}}:系统负载过高.”
      value: “{{$value}}”
      – alert: “CPU 使用率超过90%”
      expr: 100-(avg(rate(node_cpu_seconds_total{mode=”idle”}[5m])) by(instance)* 100) > 90
      for: 3m
      labels:
      severity: critical
      annotations:
      summary: “{{$labels.instance}}:CPU 使用率90%”
      description: “{{$labels.instance}}:CPU 使用率超过90%.”
      value: “{{$value}}”
      – alert: “内存使用率超过80%”
      expr: (100 – node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* on(instance) group_left(
      nodename) (node_uname_info) > 80
      for: 3m
      labels:
      severity: critical
      annotations:
      summary: “{{$labels.instance}}:内存使用率80%”
      description: “{{$labels.instance}}:内存使用率超过80%”
      value: “{{$value}}”

      • alert: “IO操作耗时超过60%”
        expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) 85
        for: 3m
        labels:
        severity: longtime
        annotations:
        summary: “{{$labels.instance}}:磁盘分区容量超过85%”
        description: “{{$labels.instance}}:磁盘分区容量超过85%”
        value: “{{$value}}” 
      • alert: “磁盘将在4天后写满”
        expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
        for: 3m
        labels:
        severity: longtime
        annotations:
        summary: “{{$labels.instance}}: 预计将有磁盘分区在4天后写满,”
        description: “{{$labels.instance}}:预计将有磁盘分区在4天后写满,”
        value: “{{$value}}”

        “`
        </p></li>
        </ul>

        <h3>9.6 alertmanager:</h3>

        <ul>
        <li><p>注意db目录可写:</p></li>
        <li><p>主配置文件:

        “`
        cat alertmanager/alertmanager.yml
        global:
        resolve_timeout: 5m
        smtp_smarthost: ‘smtphz.qiye.163.com:25’
        smtp_from: ‘XXX@fosafer.com’
        smtp_auth_username: ‘XXX@fosafer.com’
        smtp_auth_password: ‘XXX’
        smtp_hello: ‘qiye.163.com’
        smtp_require_tls: true
        route:
        group_by: [‘instance’]
        group_wait: 30s
        receiver: default
        routes:

        • group_interval: 3m
          repeat_interval: 10m
          match:
          severiry: warning
          receiver: ‘default’
        • group_interval: 3m
          repeat_interval: 30m
          match:
          severiry: critical
          receiver: ‘default’ 
        • group_interval: 5m
          repeat_interval: 24h
          match:
          severiry: longtime
          receiver: ‘default’
          templates:
      • ./templates/*.tmpl
        receivers:
      • name: ‘default’
        email_configs:

        • to: ‘xiangkaihua@fosafer.com’
          send_resolved: true

        wechat_configs:

        • send_resolved: true
          corp_id: ‘XXX’
          api_secret: ‘XXX’
          agent_id: 1000002
          to_user: XXX
          to_party: 2
          message: ‘{{ template “wechat.html” . }}’
      • name: ‘critical’
        email_configs: 

        • to: ‘342382676@qq.com’
          send_resolved: true
        • to: ‘xiangkaihua@fosafer.com’
          send_resolved: true 

          “`

      • 告警模板文件
        cat alertmanager/templates/wechat.tmpl 
        {{ define "wechat.html" }}
        {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
        [@警报~]
        实例: {{ .Labels.instance }}
        信息: {{ .Annotations.summary }}
        详情: {{ .Annotations.description }}
        值: {{ .Annotations.value }}
        时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
        {{ end }}{{ end -}}
        {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
        [@恢复~]
        实例: {{ .Labels.instance }}
        信息: {{ .Annotations.summary }}
        时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
        恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
        {{ end }}{{ end -}}
        {{- end }}     
        

      9.7 grafana

      • 只需要挂载volume即可,配置文件无需更改,db目录也不大,可以保存配置和dashboard

      10.客户端部署

      10.1 被监控主机无docker,单独安装node_exporter

      • 安装脚本:
        http://10.10.11.178:8001/node_exporter_install.sh
        

      10.2 被监控主机运行docker,docker 安装 node_exporter cadvisor

      • 安装脚本:
        http://10.10.11.178:8001/node_exporter_install_docker.sh
        
      • 需要的image,对于没有添加10.10.11.40:80 仓库的docker主机,可以下载save的image,先load image 在安装
        http://10.10.11.178:8001/monitor-client.tgz
        

      11.prometheus使用和维护

      11.1 通过脚本添加和删除监控节点

      • 所有的job都使用基于文件的服务发现,所以,只用将target写入sd_file即可,无需重读配置文件 
      • 基于此写了一个文本处理脚本作为sd_files的前端,通过命令行的形式添加和删除targets,无需手动编辑文件
      • 脚本名称: sd_controler.sh
      • 脚本使用:./sd_controler.sh 即可查看usage
      • 完整脚本如下:

        “`
        [root@host40 monitor]# cat sd_controler.sh

        !/bin/bash

        version: 1.0

        Description: add | del | show instance from|to prometheus file_sd_files.

        rl | vl | dk | rw | vw | tcp | http | icmp : short for job name, each one means a sd_file.

        tcp | http | icmp ( because with ports for service ) add with label (server_name by default) to easy read in alert emails.

        each time can only add|del for one instance.

        说明:用来添加、删除、查看prometheus基于文件的服务发现中的条目。比如IP:PORT 组合。

        rl | vl | dk | rw | vw | tcp | http | icmp :这写prometheus job名称的简称,每一项代表一个job,操作一个sd_file 即job文件服务发现使用的文件。

        tcp | http | icmp,由于常常无法根据服务端口第一时间确认挂掉的是什么服务,所以,在tcp http icmp(顺带)添加的时候要求带上server_name的标签label,

        让监控人员收到告警邮件第十时间知道挂掉的是什么服务。

        每一次只能添加、删除一条记录,如果需要批量添加,可以直接使用vim 文本操作,或者写for 语句批量执行。

        vars

        SD_DIR=./prometheus/sd_files
        DOCKER_SD=$SD_DIR/docker_host.yml
        RL_HOST_SD=$SD_DIR/real_lan.yml
        VL_HOST_SD=$SD_DIR/virtual_lan.yml
        RW_HOST_SD=$SD_DIR/real_wan.yml
        VW_HOST_SD=$SD_DIR/virtual_wan.yml

        TCP_SD=$SD_DIR/tcp.yml
        HTTP_SD=$SD_DIR/http.yml
        ICMP_SD=$SD_DIR/icmp.yml

        SDFILE=

        funcs

        usage(){
        echo -e “Usage: $0 [ IP:PORT | FQDN ] [ server-name ]”
        echo -e ” example: \n\t node add:\t $0 rl add | del 10.10.10.10:9100\n\t tcp,http,icmp add:\t $0 tcp add 10.10.10.10:3306 web-mysql\n\t del:\t $0 http del www.baidu.com\n\t show:\t $0 rl | vl | dk | rw | vw | tcp | http | icmp show.”
        exit
        }

        add(){

      $1: SDFILE, $2: IP:PORT

      grep -q $2 $1 || echo -e “- targets: [‘$2’]” >> $1
      }

      del(){

      $1: SDFILE, $2: IP:PORT

      sed -i ‘/’$2’/d’ $1
      }

      add_with_label(){

      $1: SDFILE, $2: [IP:[PROT]|FQDN] $3:SERVER-NAME

      LABEL_01=”server_name”
      if ! grep -q ‘$2’ $1;then
      echo -e “- targets: [‘$2’]” >> $1
      echo -e ” labels:” >> $1
      echo -e ” ${LABEL_01}: $3″ >> $1
      fi
      }

      del_with_label(){

      $1: SDFILE, $2: [IP:[PROT]|FQDN]

      NUM=cat -n $SDFILE |grep "'$2'"|awk '{print $1}'
      let ENDNUM=NUM+2

      sed -i $NUM,${ENDNUM}d $1
      }

      action(){
      if [ “$1” “add” ];then
      add $SDFILE $2
      elif [ “$1” 
      “del” ];then
      del $SDFILE $2
      elif [ “$1” “show” ];then
      cat $SDFILE
      fi
      }

      action_with_label(){
      if [ “$1” “add” ];then
      add_with_label $SDFILE $2 $3
      elif [ “$1” 
      “del” ];then
      del_with_label $SDFILE $2 $3
      elif [ “$1” “show” ];then
      cat $SDFILE
      fi
      }

      ### main code
      [ “$2” “” ] || [[ ! “$2” =~ ^(add|del|show)$ ]] && usage

      curl –version &>/dev/null || { echo -e “no curl found. ” && exit 15; }

      if [[ $1 =~ ^(rl|vl|rw|vw|dk)$ ]] && [ “$2” “add” ];then
      [ “$3” 
      “” ] && usage

      if [ “$4” != “-f” ];then
      COOD=curl -IL -o /dev/null --retry 3 --connect-timeout 3 -s -w "%{http_code}" http://$3/metrics
      [ “$COOD” != “200” ] && echo -e “http://$3/metrics is not arriable. check it again. or you can use -f to ignor it.” && exit 11
      fi
      fi

      if [[ $1 =~ ^(tcp|http|icmp)$ ]] && [ “$2” “add” ];then
      [ “$4” 
      “” ] && echo -e “监听 tcp http icmp 服务时必须指明 server-name.” && usage
      fi

      case $1 in
      rl)
      SDFILE=$RL_HOST_SD
      action $2 $3 && echo $2 OK
      ;;
      vl)
      SDFILE=$VL_HOST_SD
      action $2 $3 && echo $2 OK
      ;;
      dk)
      SDFILE=$DOCKER_SD
      action $2 $3 && echo $2 OK
      ;;
      rw)
      SDFILE=$RW_HOST_SD
      action $2 $3 && echo $2 OK
      ;;
      vw)
      SDFILE=$VW_HOST_SD
      action $2 $3 && echo $2 OK
      ;;
      tcp)
      SDFILE=$TCP_SD
      action_with_label $2 $3 $4 && echo $2 OK
      ;;
      http)
      SDFILE=$HTTP_SD
      action_with_label $2 $3 $4 && echo $2 OK
      ;;
      icmp)
      SDFILE=$ICMP_SD
      action_with_label $2 $3 $4 && echo $2 OK
      ;;
      *)
      usage
      ;;
      esac

      “`

免责声明:文章内容不代表本站立场,本站不对其内容的真实性、完整性、准确性给予任何担保、暗示和承诺,仅供读者参考,文章版权归原作者所有。如本文内容影响到您的合法权益(内容、图片等),请及时联系本站,我们会及时删除处理。

作者: 3182235786a

为您推荐

windows8

windows8

Windows 8 是微软公司于 2012 年推出的一款操作系统,因其独特的界面设计和功能受到广泛关注。本文将从 Win...
Windows 下载指南:获取最新版本的 Windows 操作系统

Windows 下载指南:获取最新版本的 Windows 操作系统

作为全球最受欢迎的操作系统之一,Windows 提供了丰富的功能和用户友好的界面。如果您想获取最新版本的 Windows...
windows资源管理器已停止工作

windows资源管理器已停止工作

Windows 资源管理器已停止工作是 Windows 操作系统中常见的一个问题,通常表现为资源管理器窗口无法正常打开或...
Windows 10 激活方法详解:轻松激活您的操作系统

Windows 10 激活方法详解:轻松激活您的操作系统

购买了全新的Windows 10操作系统后,如何激活它成为许多用户关注的问题。本文将为您详细介绍Windows 10的激...
windows10激活工具

windows10激活工具

Windows 10 激活工具是一款用于激活 Windows 10 操作系统的软件。通过使用激活工具,用户可以轻松地激活...

发表回复

返回顶部