本文共 12212 字,大约阅读时间需要 40 分钟。
背景:
Open-Falcon 是小米运维部开源的一款互联网企业级监控系统解决方案,具体的安装和使用说明请见官网:。
监控:
1) MySQL 收集信息脚本(mysql_monitor.py)
#!/bin/env python#-- encoding: utf-8 --from future import division
import MySQLdbimport datetimeimport timeimport osimport sysimport fileinputimport requestsimport jsonimport reclass MySQLMonitorInfo():
def __init__(self,host,port,user,password): self.host = host self.port = port self.user = user self.password = passworddef stat_info(self): try: m = MySQLdb.connect(host=self.host,user=self.user,passwd=self.password,port=self.port,charset='utf8') query = "SHOW GLOBAL STATUS" cursor = m.cursor() cursor.execute(query) Str_string = cursor.fetchall() Status_dict = {} for Str_key,Str_value in Str_string: Status_dict[Str_key] = Str_value cursor.close() m.close() return Status_dict except Exception, e: print (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S") print e Status_dict = {} return Status_dict def engine_info(self): try: m = MySQLdb.connect(host=self.host,user=self.user,passwd=self.password,port=self.port,charset='utf8') _engine_regex = re.compile(ur'(History list length) ([0-9]+\.?[0-9]*)\n') query = "SHOW ENGINE INNODB STATUS" cursor = m.cursor() cursor.execute(query) Str_string = cursor.fetchone() a,b,c = Str_string cursor.close() m.close() return dict(_engine_regex.findall(c)) except Exception, e: print (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S") print e return dict(History_list_length=0)
if name == 'main':
open_falcon_api = 'http://192.168.200.86:1988/v1/push'db_list= []for line in fileinput.input(): db_list.append(line.strip())for db_info in db_list:
#host,port,user,password,endpoint,metric = db_info.split(',')
host,port,user,password,endpoint = db_info.split(',')timestamp = int(time.time()) step = 60
#tags = "port=%s" %port
tags = ""conn = MySQLMonitorInfo(host,int(port),user,password) stat_info = conn.stat_info() engine_info = conn.engine_info() mysql_stat_list = [] monitor_keys = [ ('Com_select','COUNTER'), ('Qcache_hits','COUNTER'), ('Com_insert','COUNTER'), ('Com_update','COUNTER'), ('Com_delete','COUNTER'), ('Com_replace','COUNTER'), ('MySQL_QPS','COUNTER'), ('MySQL_TPS','COUNTER'), ('ReadWrite_ratio','GAUGE'), ('Innodb_buffer_pool_read_requests','COUNTER'), ('Innodb_buffer_pool_reads','COUNTER'), ('Innodb_buffer_read_hit_ratio','GAUGE'), ('Innodb_buffer_pool_pages_flushed','COUNTER'), ('Innodb_buffer_pool_pages_free','GAUGE'), ('Innodb_buffer_pool_pages_dirty','GAUGE'), ('Innodb_buffer_pool_pages_data','GAUGE'), ('Bytes_received','COUNTER'), ('Bytes_sent','COUNTER'), ('Innodb_rows_deleted','COUNTER'), ('Innodb_rows_inserted','COUNTER'), ('Innodb_rows_read','COUNTER'), ('Innodb_rows_updated','COUNTER'), ('Innodb_os_log_fsyncs','COUNTER'), ('Innodb_os_log_written','COUNTER'), ('Created_tmp_disk_tables','COUNTER'), ('Created_tmp_tables','COUNTER'), ('Connections','COUNTER'), ('Innodb_log_waits','COUNTER'), ('Slow_queries','COUNTER'), ('Binlog_cache_disk_use','COUNTER') ] for _key,falcon_type in monitor_keys: if _key == 'MySQL_QPS': _value = int(stat_info.get('Com_select',0)) + int(stat_info.get('Qcache_hits',0)) elif _key == 'MySQL_TPS': _value = int(stat_info.get('Com_insert',0)) + int(stat_info.get('Com_update',0)) + int(stat_info.get('Com_delete',0)) + int(stat_info.get('Com_replace',0)) elif _key == 'Innodb_buffer_read_hit_ratio': try: _value = round((int(stat_info.get('Innodb_buffer_pool_read_requests',0)) - int(stat_info.get('Innodb_buffer_pool_reads',0)))/int(stat_info.get('Innodb_buffer_pool_read_requests',0)) * 100,3) except ZeroDivisionError: _value = 0 elif _key == 'ReadWrite_ratio': try: _value = round((int(stat_info.get('Com_select',0)) + int(stat_info.get('Qcache_hits',0)))/(int(stat_info.get('Com_insert',0)) + int(stat_info.get('Com_update',0)) + int(stat_info.get('Com_delete',0)) + int(stat_info.get('Com_replace',0))),2) except ZeroDivisionError: _value = 0 else: _value = int(stat_info.get(_key,0)) falcon_format = { 'Metric': '%s' % (_key), 'Endpoint': endpoint, 'Timestamp': timestamp, 'Step': step, 'Value': _value, 'CounterType': falcon_type, 'TAGS': tags } mysql_stat_list.append(falcon_format) #_key : History list length for _key,_value in engine_info.items(): _key = "Undo_Log_Length" falcon_format = { 'Metric': '%s' % (_key), 'Endpoint': endpoint, 'Timestamp': timestamp, 'Step': step, 'Value': int(_value), 'CounterType': "GAUGE", 'TAGS': tags } mysql_stat_list.append(falcon_format) print json.dumps(mysql_stat_list,sort_keys=True,indent=4) requests.post(open_falcon_api, data=json.dumps(mysql_stat_list)) 指标说明:收集指标里的COUNTER表示每秒执行次数,GAUGE表示直接输出值。
指标 类型 说明
Undo_Log_Length GAUGE 未清除的Undo事务数Com_select COUNTER select/秒=QPSCom_insert COUNTER insert/秒Com_update COUNTER update/秒Com_delete COUNTER delete/秒Com_replace COUNTER replace/秒MySQL_QPS COUNTER QPSMySQL_TPS COUNTER TPS ReadWrite_ratio GAUGE 读写比例Innodb_buffer_pool_read_requests COUNTER innodb buffer pool 读次数/秒Innodb_buffer_pool_reads COUNTER Disk 读次数/秒Innodb_buffer_read_hit_ratio GAUGE innodb buffer pool 命中率Innodb_buffer_pool_pages_flushed COUNTER innodb buffer pool 刷写到磁盘的页数/秒Innodb_buffer_pool_pages_free GAUGE innodb buffer pool 空闲页的数量Innodb_buffer_pool_pages_dirty GAUGE innodb buffer pool 脏页的数量Innodb_buffer_pool_pages_data GAUGE innodb buffer pool 数据页的数量Bytes_received COUNTER 接收字节数/秒Bytes_sent COUNTER 发送字节数/秒Innodb_rows_deleted COUNTER innodb表删除的行数/秒Innodb_rows_inserted COUNTER innodb表插入的行数/秒Innodb_rows_read COUNTER innodb表读取的行数/秒Innodb_rows_updated COUNTER innodb表更新的行数/秒Innodb_os_log_fsyncs COUNTER Redo Log fsync次数/秒 Innodb_os_log_written COUNTER Redo Log 写入的字节数/秒Created_tmp_disk_tables COUNTER 创建磁盘临时表的数量/秒Created_tmp_tables COUNTER 创建内存临时表的数量/秒Connections COUNTER 连接数/秒Innodb_log_waits COUNTER innodb log buffer不足等待的数量/秒Slow_queries COUNTER 慢查询数/秒Binlog_cache_disk_use COUNTER Binlog Cache不足的数量/秒使用说明:读取配置到都数据库列表执行,配置文件格式如下(mysqldb_list.txt):
IP,Port,User,Password,endpoint
192.168.2.21,3306,root,123,mysql-21:3306
192.168.2.88,3306,root,123,mysql-88:3306最后执行:python mysql_monitor.py mysqldb_list.txt
2) Redis 收集信息脚本(redis_monitor.py)
#!/bin/env python
#-- coding:utf-8 --import json
import timeimport reimport redisimport requestsimport fileinputimport datetimeclass RedisMonitorInfo():
def __init__(self,host,port,password): self.host = host self.port = port self.password = passworddef stat_info(self): try: r = redis.Redis(host=self.host, port=self.port, password=self.password) stat_info = r.info() return stat_info except Exception, e: print (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S") print e return dict()def cmdstat_info(self): try: r = redis.Redis(host=self.host, port=self.port, password=self.password) cmdstat_info = r.info('Commandstats') return cmdstat_info except Exception, e: print (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S") print e return dict()
if name == 'main':
open_falcon_api = 'http://192.168.200.86:1988/v1/push'db_list= []for line in fileinput.input(): db_list.append(line.strip())for db_info in db_list:
#host,port,password,endpoint,metric = db_info.split(',')
host,port,password,endpoint = db_info.split(',')timestamp = int(time.time()) step = 60 falcon_type = 'COUNTER'
#tags = "port=%s" %port
tags = ""conn = RedisMonitorInfo(host,port,password) #查看各个命令每秒执行次数 redis_cmdstat_dict = {} redis_cmdstat_list = [] cmdstat_info = conn.cmdstat_info() for cmdkey in cmdstat_info: redis_cmdstat_dict[cmdkey] = cmdstat_info[cmdkey]['calls'] for _key,_value in redis_cmdstat_dict.items(): falcon_format = { 'Metric': '%s' % (_key), 'Endpoint': endpoint, 'Timestamp': timestamp, 'Step': step, 'Value': int(_value), 'CounterType': falcon_type, 'TAGS': tags } redis_cmdstat_list.append(falcon_format) #查看Redis各种状态,根据需要增删监控项,str的值需要转换成int redis_stat_list = [] monitor_keys = [ ('connected_clients','GAUGE'), ('blocked_clients','GAUGE'), ('used_memory','GAUGE'), ('used_memory_rss','GAUGE'), ('mem_fragmentation_ratio','GAUGE'), ('total_commands_processed','COUNTER'), ('rejected_connections','COUNTER'), ('expired_keys','COUNTER'), ('evicted_keys','COUNTER'), ('keyspace_hits','COUNTER'), ('keyspace_misses','COUNTER'), ('keyspace_hit_ratio','GAUGE'), ('keys_num','GAUGE'), ] stat_info = conn.stat_info() for _key,falcon_type in monitor_keys: #计算命中率 if _key == 'keyspace_hit_ratio': try: _value = round(float(stat_info.get('keyspace_hits',0))/(int(stat_info.get('keyspace_hits',0)) + int(stat_info.get('keyspace_misses',0))),4)*100 except ZeroDivisionError: _value = 0 #碎片率是浮点数 elif _key == 'mem_fragmentation_ratio': _value = float(stat_info.get(_key,0)) #拿到key的数量 elif _key == 'keys_num': _value = 0 for i in range(16): _key = 'db'+str(i) _num = stat_info.get(_key) if _num: _value += int(_num.get('keys')) _key = 'keys_num' #其他的都采集成counter,int else: try: _value = int(stat_info[_key]) except: continue falcon_format = { 'Metric': '%s' % (_key), 'Endpoint': endpoint, 'Timestamp': timestamp, 'Step': step, 'Value': _value, 'CounterType': falcon_type, 'TAGS': tags } redis_stat_list.append(falcon_format) load_data = redis_stat_list+redis_cmdstat_list print json.dumps(load_data,sort_keys=True,indent=4) requests.post(open_falcon_api, data=json.dumps(load_data)) 指标说明:收集指标里的COUNTER表示每秒执行次数,GAUGE表示直接输出值。
指标 类型 说明
connected_clients GAUGE 连接的客户端个数blocked_clients GAUGE 被阻塞客户端的数量used_memory GAUGE Redis分配的内存的总量used_memory_rss GAUGE OS分配的内存的总量mem_fragmentation_ratio GAUGE 内存碎片率,used_memory_rss/used_memorytotal_commands_processed COUNTER 每秒执行的命令数,比较准确的QPSrejected_connections COUNTER 被拒绝的连接数/秒expired_keys COUNTER 过期KEY的数量/秒 evicted_keys COUNTER 被驱逐KEY的数量/秒keyspace_hits COUNTER 命中KEY的数量/秒keyspace_misses COUNTER 未命中KEY的数量/秒keyspace_hit_ratio GAUGE KEY的命中率keysnum GAUGE KEY的数量cmd* COUNTER 各种名字都执行次数/秒使用说明:读取配置到都数据库列表执行,配置文件格式如下(redisdb_list.txt):IP,Port,Password,endpoint
192.168.1.56,7021,zhoujy,redis-56:7021
192.168.1.55,7021,zhoujy,redis-55:7021最后执行:python redis_monitor.py redisdb_list.txt
3) MongoDB 收集信息脚本(mongodb_monitor.py)
...后续添加
4)其他相关的监控(需要装上agent),比如下面的指标:
告警项 触发条件 备注
load.1min all(#3)>10 Redis服务器过载,处理能力下降cpu.idle all(#3)<10 CPU idle过低,处理能力下降df.bytes.free.percent all(#3)<20 磁盘可用空间百分比低于20%,影响从库RDB和AOF持久化mem.memfree.percent all(#3)<15 内存剩余低于15%,Redis有OOM killer和使用swap的风险mem.swapfree.percent all(#3)<80 使用20% swap,Redis性能下降或OOM风险net.if.out.bytes all(#3)>94371840 网络出口流量超90MB,影响Redis响应net.if.in.bytes all(#3)>94371840 网络入口流量超90MB,影响Redis响应disk.io.util all(#3)>90 磁盘IO可能存负载,影响从库持久化和阻塞写转载于:https://blog.51cto.com/ershao/2118760