本节重点介绍 :
- pre_query项目配置文件设计
- ansible-copy拷贝日志文件
- 解析日志文件并判断重查询
新建python项目 pre_query
设计配置文件
- config.yaml
prome_query_log:
prome_log_path: /App/logs/prometheus_query.log # prometheus query log文件path
heavy_query_threhold: 5.0 # heavy_query阈值
py_name: parse_prome_query_log.py # 主文件名
local_work_dir: /App/tgzs/conf_dir/prome_heavy_expr_parse/all_prome_query_log # parser拉取query_log的保存路径
check_heavy_query_api: http://localhost:9090 # 一个prometheus查询地址,用来double_check记录是否真的heavy,避免误添加
redis:
host: localhost # redis地址
port: 6379
redis_set_key: hke:heavy_query_set
redis_one_key_prefix: hke:heavy_expr # heavy_query key前缀
high_can_result_key: high_can_result_key
consul:
host: localhost #consul地址
port: 8500
consul_record_key_prefix: prometheus/records # heavy_query key前缀
# 所有采集的地址,用来取高基数
scrape_promes:
- 1.1.1.1:9090
- 1.1.1.2:9090
- 1.1.0.0空的模板:9090
- 1.1.1.4:9090
heavy_blacklist_metrics: # 黑名单metric_names
- kafka_log_log_logendoffset
- requests_latency_bucket
- count(node_cpu_seconds_total)
ansible-copy拷贝日志文件
变量存放的yaml config.yaml
prome_query_log:
prome_log_path: /App/logs/prometheus_query.log # prometheus query log文件path
heavy_query_threhold: 5.0 # heavy_query阈值
py_name: parse_prome_query_log.py # 主文件名
local_work_dir: /App/tgzs/conf_dir/prome_heavy_expr_parse/all_prome_query_log # parser拉取query_log的保存路径
check_heavy_query_api: http://localhost:9090 # 一个prometheus查询地址,用来double_check记录是否真的heavy,避免误添加
执行拷贝的playbook
- prome_heavy_expr_parse.yaml
- 意思是将所有prometheus的query log 拷贝到本地目录下
- name: fetch log and push expr to cache
hosts: all
user: root
gather_facts: false
vars_files:
- config.yaml
tasks:
- name: fetch query log
fetch: src={{ prome_query_log.prome_log_path }} dest={{ prome_query_log.local_work_dir }}/{{ inventory_hostname }}_query.log flat=yes validate_checksum=no
register: result
- name: Show debug info
debug: var=result verbosity=0
解析日志文件,查找重查询
解析文件 parse_prome_query_log.py
方法 parse_log_file
- 代码
def parse_log_file(log_f):
'''
{
"httpRequest":{
"clientIP":"1.1.1.1",
"method":"GET",
"path":"/api/v1/query_range"
},
"params":{
"end":"2020-04-09T06:20:00.000Z",
"query":"api_request_counter{job="kubernetes-pods",kubernetes_namespace="sprs",app="model-server"}/60",
"start":"2020-04-02T06:20:00.000Z",
"step":1200
},
"stats":{
"timings":{
"evalTotalTime":0.467329174,
"resultSortTime":0.000476303,
"queryPreparationTime":0.373947928,
"innerEvalTime":0.092889708,
"execQueueTime":0.000008911,
"execTotalTime":0.467345411
}
},
"ts":"2020-04-09T06:20:28.353Z"
}
:param log_f:
:return:
'''
heavy_expr_set = set()
heavy_expr_dict = dict()
record_expr_dict = dict()
with open(log_f) as f:
for x in f.readlines():
x = json.loads(x.strip())
if not isinstance(x, dict):
continue
httpRequest = x.get("httpRequest")
path = httpRequest.get("path")
if path != "/api/v1/query_range":
continue
params = x.get("params")
start_time = params.get("start")
end_time = params.get("end")
stats = x.get("stats")
evalTotalTime = stats.get("timings").get("evalTotalTime")
execTotalTime = stats.get("timings").get("execTotalTime")
queryPreparationTime = stats.get("timings").get("queryPreparationTime")
execQueueTime = stats.get("timings").get("execQueueTime")
innerEvalTime = stats.get("timings").get("innerEvalTime")
# 如果查询事件段大于6小时则不认为是heavy-query
if not start_time or not end_time:
continue
start_time = datetime.strptime(start_time, '%Y-%m-%dT%H:%M:%S.%fZ').timestamp()
end_time = datetime.strptime(end_time, '%Y-%m-%dT%H:%M:%S.%fZ').timestamp()
if end_time - start_time > 3600 * 6:
continue
# 如果两个时间都小于阈值则不为heavy-query
c = (queryPreparationTime < HEAVY_QUERY_THREHOLD) and (innerEvalTime < HEAVY_QUERY_THREHOLD)
if c:
continue
if queryPreparationTime > 40:
continue
if execQueueTime > 40:
continue
if innerEvalTime > 40:
continue
if evalTotalTime > 40:
continue
if execTotalTime > 40:
continue
query = params.get("query").strip()
is_bl = False
for bl in HEAVY_BLACKLIST_METRICS:
if bl in query:
is_bl = True
break
if is_bl:
continue
# avoid multi heavy query
if REDIS_ONE_KEY_PREFIX in query:
continue
# \r\n should not in query ,replace it
if "\r\n" in query:
query = query.replace("\r\n", "", -1)
# \n should not in query ,replace it
if "\n" in query:
query = query.replace("\n", "", -1)
# - startwith for grafana network out
if query.startswith("-"):
query = query.replace("-", "", 1)
md5_str = get_str_md5(query.encode("utf-8"))
record_name = "{}:{}".format(REDIS_ONE_KEY_PREFIX, md5_str)
record_expr_dict[record_name] = query
heavy_expr_set.add(query)
last_time = heavy_expr_dict.get(query)
this_time = evalTotalTime
if last_time and last_time > this_time:
this_time = last_time
heavy_expr_dict[query] = this_time
logging.info("log_file:{} get :{} heavy expr".format(log_f, len(record_expr_dict)))
return record_expr_dict
- 判断是否是 range_query ,instant_query不分析
if path != "/api/v1/query_range":
continue
- 解析querylog中的耗时字段
- 如果查询事件段大于6小时则不认为是heavy-query
# 如果查询事件段大于6小时则不认为是heavy-query
if not start_time or not end_time:
continue
- 如果两个时间都小于阈值则不为heavy-query
# 如果两个时间都小于阈值则不为heavy-query
c = (queryPreparationTime < HEAVY_QUERY_THREHOLD) and (innerEvalTime < HEAVY_QUERY_THREHOLD)
if c:
continue
- 用dict和set去重,因为日志中可能有多行关于一个重查询ql的记录
last_time = heavy_expr_dict.get(query)
this_time = evalTotalTime
if last_time and last_time > this_time:
this_time = last_time
heavy_expr_dict[query] = this_time
- 将重查询ql的结果算md5作为key,ql作为value 返回
本节重点总结 :
- pre_query项目配置文件设置
- ansible-copy拷贝日志文件
- 解析日志文件并判断重查询