前几天分析了一个死锁的问题,有一个网友看了以后,就发了邮件给我问一个问题。一般来说,能够发送邮件提出问题的同学,都是很认真的,因为他要准备好日志,准备好操作过程,准备好他已经在做的事情。所以这类问题,我都会认真的分析一下,如果没有结果,那就继续分析再等等,掐指一算,有很多问题已经拖了好久了。
这位网友提的一个问题,我看了以后感觉很是奇怪,因为有些颠覆我对MySQL锁的一些认识。这该如何是好。
这个环境的事务隔离级别是RR,存在主键,存在范围查询。
如何复现这个问题,网友提供了信息。
创建表mysql> create table tt(a int not null primary key) engine=innodb; mysql> insert into tt values(10),(20),(30),(40),(50);
复现这个问题可以参考:session1: mysql> set session tx_isolation='repeatable-read'; mysql> begin; mysql> select * from tt where a > 15 and a < 35 for update; +----+ | a | +----+ | 20 | | 30 | +----+
session2:mysql> insert into tt select 1; 此时这个操作会被阻塞,如果你按照这个思路来看,总是会感觉不对劲。 怎么MySQL这么矫情了。 我带着疑问在新搭建的一套MySQL 5.7环境上做了测试,结果还真是。 接下来的任务就是如何说服我,然后我理解了来说服这个网友。
结果这样一个操作下来,我连连测试了5个场景,如何SQL稍作改变,结果又会大大不同。
#for update的场景1
先来做一个基于主键的操作。先来验证一个最基本的情况,稳定下自己的情绪。
#session1
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> select *from tt;
+----+
| a |
+----+
| 10 |
| 20 |
| 30 |
| 40 |
| 50 |
+----+
5 rows in set (0.00 sec)
mysql> select * from tt where a =10 for update;
+----+
| a |
+----+
| 10 |
+----+
1 row in set (0.00 sec)
#session2
mysql> insert into tt select 1;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
这是一个最为保守的使用方法,如果这个还有问题,那就明显证明数据库有问题了,基于主键,去掉范围扫描,肯定妥妥的。
#for update的场景2
这个场景里面我们修改下范围,原来的(15,35)修改为(10,30),结果差别就很大了。有些阻塞的语句我直接就手工取消了。由此也可以看出其中的差别来,不过可能会看得有点懵了。
#session1
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from tt where a <30 and a>10 for update;
+----+
| a |
+----+
| 20 |
+----+
1 row in set (0.00 sec)
session2:
mysql> insert into tt select 35;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tt select 31;
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tt select 30;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 5;
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tt select 10;
ERROR 1062 (23000): Duplicate entry '10' for key 'PRIMARY'
mysql> insert into tt select 11;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
这里可以明显感觉到和当前环境的数据分布关系很微妙,范围只有一个20,但是似乎和0有着一定的联系,至少,我不能保证我的查询一定得按照这个精确的范围。
#for update 场景3
这个场景我把最开始碰到的问题做了一些扩展,看看其它范围的数据是否也有类似的情况。我扩大了数据范围,结果很明显的,结果让我有些意料之外。
session1:
mysql> select *from tt;
+----+
| a |
+----+
| 10 |
| 20 |
| 30 |
| 40 |
| 50 |
+----+
5 rows in set (0.00 sec)
mysql> begin;select * from tt where a <35 and a>15 for update;
Query OK, 0 rows affected (0.00 sec)
+----+
| a |
+----+
| 20 |
| 30 |
+----+
2 rows in set (0.00 sec)
session2:
mysql> insert into tt select 1;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 9;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 10;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 35;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 36;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 40;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 50;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
#for update 场景4
尽管这个时候我已经有些乱了,但是我还是耐着性子测试了另外几个场景。我把范围有(15,35)修改为(15,30),结果让我很意外。原本阻塞的insert就可以了。
session1
mysql> begin;select * from tt where a <30 and a>15 for update;
Query OK, 0 rows affected (0.00 sec)
+----+
| a |
+----+
| 20 |
+----+
1 row in set (0.00 sec)
session2
mysql> insert into tt select 1;
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tt select 15;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 15;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 16;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 14;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 13;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 11;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 10;
ERROR 1062 (23000): Duplicate entry '10' for key 'PRIMARY'
mysql> insert into tt select 9;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into tt select 30;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql> insert into tt select 31;
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings:
这一切的对比,从直观感受来看是和表里的数据分布是有一定的关系的。
比如场景4,如果我把范围由(15,35)修改为(15,30),这个数据的情况有什么特别之处吗,从我的猜测来看,应该是和里面的索引存储有一定的关系,我查看了Information_schema.innodb_trx,innodb_locks的细节,里面都是指向了同一行。
mysql> select * from INFORMATION_SCHEMA.innodb_locks\G;*************************** 1. row ***************************
lock_id: 4081:36:3:2
lock_trx_id: 4081
lock_mode: X,GAP
lock_type: RECORD
lock_table: `test`.`tt`
lock_index: PRIMARY
lock_space: 36
lock_page: 3
lock_rec: 2
lock_data: 10
*************************** 2. row ***************************
lock_id: 4078:36:3:2
lock_trx_id: 4078
lock_mode: X
lock_type: RECORD
lock_table: `test`.`tt`
lock_index: PRIMARY
lock_space: 36
lock_page: 3
lock_rec: 2
lock_data: 10
2 rows in set, 1 warning (0.00 sec)
通过上面的信息可以看到,都是只想了页面3的数据第2行,这个明显就不对应啊。
但是MySQL 5.7中出现这个问题,自己还是带着一丝的侥幸心理,在MGR上测试了一把,能够复现,结果今天继续耐着性子看了下这个问题,在5.6上模拟了一下,5.6全然没有这个问题,问题到了这里,就有了柳暗花明的一面,能够肯定的是这个问题在MySQL 5.7中可以复现,在MySQL 5.6中是正常的。
如此一来,问题的定论就有了方向,很快就在bugs.mysql.com里面找到了一个相关的bug(85749)
里面也做了类似的测试,能够复现,MySQL官方做了确认。
[31 Mar 18:10] Sinisa Milivojevic
Hi!I have run your test case and got the same results as you have. Upon further analysis, I concluded that this is a bug. A small bug , but a bug. Verified. 而有看点的是问题的提出者定位到了相关的代码,还是希望文档的部分能够把间隙锁的部分补充一下。 No locks are released in this case, but we do request X lock on the gap before the next, non-matching record when non-unique secondary index is used. Check code starting from this line (https://github.com/mysql/mysql-server/blob/71f48ab393bce80a59e5a2e498cd1f46f6b43f9a/storag...): /* Try to place a gap lock on the next index record to prevent phantoms in ORDER BY ... DESC queries */ const rec_t* next_rec = page_rec_get_next_const(rec); offsets = rec_get_offsets(next_rec, index, offsets, ULINT_UNDEFINED, &heap); err = sel_set_rec_lock(pcur, next_rec, index, offsets, prebuilt->select_lock_type, LOCK_GAP, thr, &mtr); in row_search_mvcc(). See the (potential) reason to set this gap lock in the comment above. Maybe there is another reason for the behavior we see. Then it should be also documented.
如果感觉分析的不过瘾,我们看看国内一个做源码的朋友一个相关问题的解读,在地铁上看了之后,一下子想起了这个问题,关联起来一看还真是。