使用prometheus自定义监控

背景：

目的是监控tomcat的cpu和内存的，本来是打算是使用zabbix自发现去做，但感觉又要写模板，又要写脚本，还要用自动化工具推自发现脚本，而且还担心性能也不是很好。所以就打算换种新的监控工具，最终选择了prometheus.

实施：

1. 第一步就是要安装prometheus了,我这边为了保持可通用性和简洁，不污染机器环境，能用docker安装的都用docker进行安装(其他的工具也是这样的)。另外docker安装比较方便和省心。先上premoethus的docker-compose，安装docker-compose的可以移步docker-compose安装


version: '2'

services:

  prometheus:

    image: prom/prometheus:v2.0.0

    ports:

      - "9090:9090"

    volumes:

      - /data/compose_data/prometheus/prometheus:/prometheus   #prometheus数据目录

      - /data/compose_data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml   #prometheus配置文件

      - /data/compose_data/prometheus/first_rules.yml:/etc/prometheus/first_rules.yml   #报警配置文件

    command: --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries  --web.console.templates=/usr/share/prometheus/consoles --web.external-url=http://{ip或者域名}:9090    #重写了启动的配置参数，其中web.external-url配置问prometheus地址是为了在报警邮件里面点击直接到prometheus的web界面

这里需要注意我的配置文件都写好了，所以直接进行volumes映射的。如果是第一次创建容器环境，请先启动没有映射的容器将配置文件取出来，配置好进行映射。下面列出配置文件的内容：
prometheus.yml:

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - altermanager:9093   #设置altermanager的地址，后文会写到安装altermanager
rule_files:
  - "first_rules.yml"   # 设置报警规则
  # - "second_rules.yml"
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ['localhost:9090']   #这个自带的默认监控prometheus所在机器的prometheus状态
#  - job_name: 'localhost'
#   static_configs:
#      - targets: ['192.168.98.73:9101']   #这部分是监控机器的状态，需要在机器节点启动[node_exporter](https://github.com/prometheus/node_exporter),需要监控机器的可以移步查看
#       labels:
#         instance: localhost
  - job_name: "uat-apps-status"      # 自己定义的监控的job_name
    static_configs:
      - targets: ['192.168.98.73:9091']   # 指向pushgateway.  我在每台机器上使用的是推的方式到pushgateway,所以采取了此种方式。
        labels:
          instance: uat    #新添加的标签，可以自定义
    scrape_interval: 60s

这里需要说明的是也可以使用metrics的方式让premetheus去各个节点去拉数据，因为这样我就需要在监控的每个节点运行web服务端，所以就改成了推到pushgateway的方式。

first_rules.yml:

groups:
- name: example   #报警规则的名字
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown     #检测job的状态，持续1分钟metrices不能访问会发给altermanager进行报警
    expr: up == 0
    for: 1m    #持续时间
    labels:
      serverity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."


  - alert: "it's has problem"  #报警的名字
    expr: "test_tomcat{exported_instance="uat",exported_job="uat-app-status",host="test",instance="uat",job="uat-apps-status"} -  test_tomcat{exported_instance="uat",exported_job="uat-app-status",host="test",instance="uat",job="uat-apps-status"} offset 1w > 5"   # 这个意思是监控该表达式查询出来的值与一周前的值进行比较，大于5且持续10m钟就发送给altermanager进行报警
    for: 1m  #持续时间
    labels:
      serverity: warning
    annotations:
      summary: "{{ $labels.type }}趋势增高"
      description: "机器:{{ $labels.host }} tomcat_id:{{ $labels.id }} 类型:{{ $labels.type }} 与一周前的差值大于5,当前的差值为:{{ $value }}"    #自定义的报警内容

这些是自定义的基本报警内容，具体还可以使用模块功能，构建更详细的报警页面，具体可以参考模板使用方法,读者可以基于自己环境的情况进行配置。

2. 然后就需要安装上面配置文件用到了altermanger和pushgateway了，这边同样使用docker来安装。
altermanger的docker-compose:

version: '2'
services:
  altermanager:
    image: prom/alertmanager:master
    volumes:
      - /data/compose_data/prometheus_altermanager/conf/config.yml:/etc/alertmanager/config.yml  #altermanager配置文件
      - /data/compose_data/prometheus_altermanager/data:/altermanager  #altermanager数据目录
    ports:
      - "9093:9093"
    command: -config.file=/etc/alertmanager/config.yml -storage.path=/alertmanager -web.external-url=http://{ip或者域名}:9093   #重写了启动方式，添加了web.external参数，使报警邮件点击可以直接到altermanager web页面

这里的配置文件同样的需要从容器里面导出来配置好，放到对应的映射目录的。
confg.yml:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtp.exmail.qq.com:465'  #这里是指smtp的服务器
  smtp_from: 'test@qq.com'  # 邮箱from地址，一般写邮箱的用户名
  smtp_auth_username: 'test@qq.com'  #邮箱的用户名
  smtp_auth_password: '******'    #邮箱的密码
  smtp_require_tls: false   # 这个配置了true导致没有报错，最后我设置成了false正常了
  # The auth token for Hipchat.
  #hipchat_auth_token: '1234556789'
  # Alternative host for Hipchat.
  #hipchat_api_url: 'https://hipchat.foobar.org/'    #这是其他的报警方式
route:
  group_by: ['host','id','type']    #可以机器标签进行报警的分组
  group_wait: 30s   #分组等待时间
  group_interval: 30s    #分组的时间间隔 
  repeat_interval: 1h     #重复报警的时间间隔
  receiver: 'test-mails'    #发给定义的name
receivers:
- name: 'test-mails'
  email_configs:
  - to: "test@qq.com"  #收件人地址 想发送多个人可以这样写test1@qq.com,test2@qq.com

上面只是介绍简单的报警配置，具体可以依据分组做静默，分类发送，按级别发送等等，具体配置可以看看docker容器默认的配置文件配置方法。
3.接下来就是pushgateway的docker-compose:

version: '2'
services:
  pushgateway:
    image: prom/pushgateway:master
    ports:
      - "9091:9091"

这个比较简单，我们只是用这个做监控数据的中转。
4.配置完成之后依次启动这些容器：

docker-compose -f /data/compose/prometheus_pushgateway/docker-compose.yml up -d
docker-compose -f /data/compose/prometheus_altermanager/docker-compose.yml up -d
docker-compose -f /data/compose/prometheus/docker-compose.yml up -d
如果启动失败可以使用docker-compose -f 文件 logs 查看错误详情进行更正

5.启动完成后可以访问相应的web页面进行查看：
premetheus web页面:

访问 http://{premetheus_ip}:9090

premetheus.png

其中：

alters可以查看当前报警的状态
status->rules可以查看配置的报警规则.
status->targets可以查看配置的job及状态。

altermanger页面：

访问：http://altermanager:9093

altermanager.png

这里可以基于label设置告警的静默期，查看当前报警的内容等。邮件里面点击的连接就是到达这里。

pushgateway页面:

访问: http://pushgateway:9091

pushgateway.png

这里可以看到pushgateway的对应的job,已经对应job的key及上次收到数据的时间，也可以删除job的数据重新生成。

6.上述步骤都完成后接下来就需要写脚本取数据到pushgateway了。我这边给出个实例脚本，大家可以根据此进行更改以监控自己想要监控的数据

import requests,time
from get_application_status import get_app_status
# 这个get_app_status的模块是自己写的通过命令取得tomcat的cpu和内存，并返回字典。

def _submit_wrapper(url, job_name, value):
    headers = {'X-Requested-With': 'Python requests', 'Content-type': 'text/xml'}
    requests.post('http://%s/metrics/job/%s' % (url, job_name),
                      data='%s\n' % (value), headers=headers)


def push_metrics(job_name,hostid,instance,url):
    all_app_status = get_app_status()
    tomcat_status = all_app_status.tomcat_status()
    metrice_name = ""
    for tomcat in tomcat_status:
        metrice_name += '%s_tomcat{id="%s",host="%s",type="mem",instance="%s",job="%s"} %s\n'%(hostid,tomcat,hostid,instance,job_name,tomcat_status[tomcat]['mem'])
        metrice_name += '%s_tomcat{id="%s",host="%s",type="cpu",instance="%s",job="%s"} %s\n' % (hostid,tomcat,hostid,instance,job_name,tomcat_status[tomcat]['cpu'])
#重点是这块将取到的值组成一个字符串，字符串的格式要符合metrics的标准,可以选择target的一个metrics进行格式查看。这里给出个实例：
# uat_tomcat{id="tomcat_1018",host="uat",type="mem",instance="uat",job="uat-app-status"} 3.2
# uat_tomcat{id="tomcat_1018",host="uat",type="cpu",instance="uat",job="uat-app-status"} 2.4
    _submit_wrapper(url=url,job_name=job_name,value=metrice_name) 




if __name__ == "__main__":
    job_name = "{{ job_name }}"  #我这里使用的是ansible批量推的形式运行该脚本，所有用了jinja的变量，如果不需要可以直接加此设置成对应的值运行。
    hostid = "{{ hostid }}".replace("-","_")  #这里我发现type标签的值不支持-，所以就替换成_
    instance = "{{ instance }}"
    url = "{{ url }}"  #这里的地址填写的是altermanger的地址(algermanger:9091)
    while True:
        push_metrics(job_name=job_name,hostid=hostid,instance=instance,url=url)
        time.sleep(60)   #这里用的是死循环不断的取数据，其实也可以使用计划任务。

7.接下来运行上面的脚本推到gateway,prometheus就可以取到数据了。我这边再补充下用ansible推的大致yml:
我用的anisible的roles功能：
tasks/main.yml:

---
# tasks file for premethous_client

- name: test dir is exits
  file: path=/root/scripts state=directory

- name: copy service_promethous
  copy: src=service_promethous.sh dest=/root/scripts/service_promethous.sh  #这是自己写的简单启动关闭该脚本的文件

- name: copy get_application_status
  template: src=get_application_status.py dest=/root/scripts/get_application_status.py   #这是那个模块，这个根据大家写的脚本内容，可用可不用的。这里就不说明了

- name: copy single_gateway
  template: src=single_gateway.py dest=/root/scripts/single_gateway.py  #这是推gateway的脚本，我写的是以死循环的方式运行，其实可以用计划任务
  notify:
    - restart prometheus

- name: start prometheus
  shell: sh /root/scripts/service_promethous.sh start
  ignore_errors: True

files/service_promethous.sh：

#!/bin/bash

start(){
  ps aux |grep single_gateway.py | grep python | grep -v grep > /dev/null
  if [ $? -eq 0 ];then
    echo "already start"
    exit 0
  else
    nohup python /root/scripts/single_gateway.py > /dev/null 2>&1 &
  fi
}

stop(){
  num=`ps aux |grep single_gateway.py | grep -v grep | awk  '{print $2}'`
  if [ $num ];then
    kill $num
    echo "stop success..."
  else
    echo "no starting..."
  fi
}

status(){
  ps aux |grep single_gateway.py | grep python | grep -v grep > /dev/null
  if [ $? -eq 0 ];then
    echo "starting...."
  else
    echo "stoping......"
  fi
}

case "$1" in
        start)
                start
                ;;
        stop)
                stop
                ;;
        restart)
                stop
                sleep 2
                start
                ;;
        status)
                status
                ;;
        *)
                echo "only suport start|stop|restart|status"
esac

# 偷懒写的简单的启动关闭文件，大家看下就好了。

handlers/main.yml：

---
# handlers file for premethous_client
#
- name: restart prometheus
  shell: sh /root/scripts/service_promethous.sh restart  #文件变化重启脚本的handles

premethous_client.yml:

- hosts: uat_env
  gather_facts: True
  roles:
    - premethous_client
  vars:
    job_name: uat-app-status
    hostid: "{{ ansible_hostname }}"
    instance: uat
    url: altermanager:9091
  tags:
    - uat-premethous

这是执行role的playbook,主要就是定义了全局变量替换role的templates。做到不同的机器推到pushgateway的值的key不一样。
8.配置完ansible

ansible-playbook premethous_client.yml

指定的机器就可以就行数据的推送了。

注：ansible只是为了方便才做的，其实可以不做。

9.现在其实prometheus的监控报警已经完成了，这边我再扩展下grafana结合prometheus展示的使用方法。
按照惯例依然使用docker-compose部署

version: '2'
services:
  grafana:
    image: grafana/grafana:4.6.3
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=123456   #设置管理员的密码
    volumes:
      - /data/compose_data/grafana:/var/lib/grafana  #设置数据目录
    command: cfg:default.smtp.enabled=true cfg:default.smtp.host=smtp.exmail.qq.com:25 cfg:default.smtp.user=test@qq.com cfg:default.smtp.password=***** cfg:default.smtp.from_address=test@qq.com cfg:default.server.root_url=http://{grafna_url或者域名}:3000
#重写了grafana启动的配置，添加了邮件发送的功能

启动grafna:

docker-compose -f /data/compose/grafana/docker-compose.yml up -d

访问：http://granfana:3000 输入用户名密码登录

grafana.png

这里有几个步骤，添加数据源及添加用户，我就不说了，根据提示可以很轻松的完成。我这里主要介绍dashboard templates的使用,点击到dashboard然后点击new,然后点击齿轮形状的图标，点击templating,新建variable

templating.png

这里需要注意的是从取到的数据顾虑出来单个的值要用()包括才能取所有的key值里面取到想要的appid,这里读者可以尝试用括号包裹不同的内容进行测试。然后还可以新建variables一个取相应的host。这里根据实际环境做适合自己的配置。我这边配置的整体效果是这个样子的。

status.png

注：在创建graph图形的时候，如果要引用templating设定的值，可以使用这样的格式[[]],比如引用appid就使用[[appid]]，在legend format使用label的值可以使用{{}}将labels括起来.
下面给出大致配置的截图：

config.png

结语:

这里也就是用到了prometheus自定义监控的基本功能，实际上prometheus还支持其他好几种数据类型，各种丰富的算术表达式，报警聚合和抑制，以及自动发现等，这些需要大家慢慢发掘与学习了，真正业务有需求了就会驱动技术的进步。

这里有1个问题，就是推送端的监控问题，如果脚本意外出错停止，数据就不会更新了，这边我用的是zabbix监控了机器的这个进程，当然还有更好的方法，大家可以尝试尝试。

最后编辑于：2018.01.11 10:13:31

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,324评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,303评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,192评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,555评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,569评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,566评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,927评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,583评论 0赞 257
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,827评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,590评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,669评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,365评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,941评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,928评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,159评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,880评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,399评论 2赞 342

使用prometheus自定义监控

背景：

实施：

结语:

推荐阅读更多精彩内容