现象
2020年7月24号是周五, 晚上8点半开始, 很多HTTP接口出现超时, 查看监控发现RPC调用出现了大量的超时情况. 通过监控日志打印出来的业务逻辑处理耗时来看,业务逻辑耗时并不长。 一度怀疑是RPC框架在处理过高的QPS时出现性能问题.
MySQL监控
直到看到MySQL的监控信息才发现, MySQL实例的线程数和CPU都飙升了非常高. 其中运行中的线程数从正常情况的5~6飙升至峰值431,MySQL实例的CPU飙升至峰值1547%,慢查询也出现了15次,而事务处理的TPS却并没有提高多少,只是从6升到了9.图一 QPS飙升至431.图二 TPS从6升到9图三 CPU从 27%逐渐飙升至1547%图四 慢查询出现15次图五运行线程飙升至431.
原因追溯
明白了应该是MySQL实例发生了慢查询,导致拖慢了业务逻辑的处理,而这些慢的处理请求占用了RPC服务端线程池,导致其它的RPC请求无法被处理,所以在客户端侧看到了大量的超时请求。之所以通过日志没有观察到,是因为只观察到了日志中的部分处理耗时,漏掉了关键的一些导致业务处理变慢的处理耗时。
到底是什么请求导致了MySQL的慢查询呢?通过MySQL提供的一键综合诊断功能,对集群当前的锁进行检测发现如下死锁。图六 数据库实例综合诊断
摘录其中一条具体的SQL如下:
UPDATE creation_red_envelope_lottery t1 SET t1.user_id =559276217082693, t1.activity_id =310, t1.status =1, t1.reward_id ='836552413749253875' WHERE t1.id = (SELECT id FROM (SELECT id FROM creation_red_envelope_lottery t2 WHERE (t2.deleted = 0 AND t2.red_envelope_setting_id =168 AND t2.strategy_id =0) AND (SELECT ifnull(SUM(t3.price),0) FROM creation_red_envelope_lottery t3 WHERE to_days(modified_time) = to_days(now()) AND t3.status <>0) < 510000 AND (SELECT ifnull(SUM(t4.price),0) FROM creation_red_envelope_lottery t4 WHERE DATE_FORMAT(modified_time, '%Y%m') = DATE_FORMAT(CURDATE() , '%Y%m') AND t4.status <>0) < 6000000 AND (SELECT ifnull(SUM(t5.price),0) FROM creation_red_envelope_lottery t5 WHERE DATE_FORMAT(modified_time, '%Y%m') = DATE_FORMAT(CURDATE() , '%Y%m') AND t5.user_id =559276217082693 AND t5.status <>0) < 80000 AND NOT EXISTS (SELECT 1 FROM creation_red_envelope_lottery t6 WHERE t6.reward_id ='836552413749253875')AND t2.status = 0 LIMIT 1) a)
根据如上的SQL语句,原来是拆红包的逻辑出现了锁等待。
进一步诊断如下图七 智能DBA诊断
摘出其中出现死锁的两条SQL语句的执行情况如下:
{
"host": "10.49.56.123",
"port": 3533,
"deadlock_time": "2020-07-25 00:04:11",
"deadlock_id": "0x7f5e69863700",
"has_deaklock": 1,
"trx": [
{
"state": "starting index read",
"id": "157506765",
"thread_id": "155553100",
"query_id": "17616972479",
"client_ip": "10.62.40.110",
"db": "muses",
"sql": "UPDATE creation_red_envelope_lottery t1 SET t1.user_id =833216680297260, t1.activity_id =336, t1.status =1, t1.reward_id =\\'836557418986779319\\' WHERE t1.id = (SELECT id FROM (SELECT id FROM creation_red_envelope_lottery t2 WHERE (t2.deleted = 0 AND t2.red_envelope_setting_id =168 AND t2.strategy_id =3) AND (SELECT ifnull(SUM(t3.price),0) FROM creation_red_envelope_lottery t3 WHERE to_days(modified_time) = to_days(now()) AND t3.status <>0) < 510000 AND (SELECT ifnull(SUM(t4.price),0) FROM creation_red_envelope_lottery t4 WHERE DATE_FORMAT(modified_time, \\'%Y%m\\') = DATE_FORMAT(CURDATE() , \\'%Y%m\\') AND t4.status <>0) < 6000000 AND (SELECT ifnull(SUM(t5.price),0) FROM creation_red_envelope_lottery t5 WHERE DATE_FORMAT(modified_time, \\'%Y%m\\') = DATE_FORMAT(CURDATE() , \\'%Y%m\\') AND t5.user_id =833216680297260 AND t5.status <>0) < 80000 AND NOT EXISTS (SELECT 1 FROM creation_red_envelope_lottery t6 WHERE t6.rew",
"execute_result": "normal"
},
{
"state": "starting index read",
"id": "157506794",
"thread_id": "155551800",
"query_id": "17616974824",
"client_ip": "10.52.40.66",
"db": "muses",
"sql": "UPDATE creation_red_envelope_lottery t1 SET t1.user_id =833216680297260, t1.activity_id =336, t1.status =1, t1.reward_id =\\'836557419240534136\\' WHERE t1.id = (SELECT id FROM (SELECT id FROM creation_red_envelope_lottery t2 WHERE (t2.deleted = 0 AND t2.red_envelope_setting_id =168 AND t2.strategy_id =3) AND (SELECT ifnull(SUM(t3.price),0) FROM creation_red_envelope_lottery t3 WHERE to_days(modified_time) = to_days(now()) AND t3.status <>0) < 510000 AND (SELECT ifnull(SUM(t4.price),0) FROM creation_red_envelope_lottery t4 WHERE DATE_FORMAT(modified_time, \\'%Y%m\\') = DATE_FORMAT(CURDATE() , \\'%Y%m\\') AND t4.status <>0) < 6000000 AND (SELECT ifnull(SUM(t5.price),0) FROM creation_red_envelope_lottery t5 WHERE DATE_FORMAT(modified_time, \\'%Y%m\\') = DATE_FORMAT(CURDATE() , \\'%Y%m\\') AND t5.user_id =833216680297260 AND t5.status <>0) < 80000 AND NOT EXISTS (SELECT 1 FROM creation_red_envelope_lottery t6 WHERE t6.rewa",
"execute_result": "rollback"
}
]
}
可以清晰地看到是这两条SQL更新语句的查询条件是有重合的部分。由于两条语句是同时执行,先后锁住了不同的区域,而被锁住的区域又是对方SQL语句想要锁住的区域,因此出现了死锁。
由于我们的MySQL实例是多个项目共享的,因此单个项目的慢查询导致实例的CPU升高之后,几乎造成了所有的项目写入MySQL都变慢超时。
回顾一下问题出现之后我们线上处理的步骤:
1)在明确了是数据库死锁导致的问题之后,紧急联系DBA同学开启MySQL服务端的锁检测。当检测到死锁或者锁等待导致的慢查询时会自动杀死线程。
2)修复导致慢查询的SQL语句,将一次大的查询SQL语句修改为两次查询。
3)重新上线。