1 BLOG文档结构图
2 前言部分
2.1 导读和注意事项
各位技术爱好者,看完本文后,你可以掌握如下的技能,也可以学到一些其它你所不知道的知识,~O(∩_∩)O~:
① enq: TX - allocate ITL entry等待事件的解决
② 一般等待事件的解决办法
③ 队列等待的基本知识
④ ITL死锁解决
⑤ ITL死锁模拟
⑥ Merge语句的非关联形式的查询优化
3 故障分析及解决过程
3.1 故障环境介绍
项目 | source db |
db 类型 | RAC |
db version | 11.2.0.3.0 |
db 存储 | ASM |
OS版本及kernel版本 | AIX 64位 7.1.0.0 |
3.2 故障发生现象及故障分析解决
早上刚来上班,同事就发了一个SQL过来,说是有锁,然后我就查了查系统里的锁,结果一个锁都没得。好吧,还是得干点事的,先看看SQL语句:
MERGE INTO TLHR.TLHRBOKBAL S
USING (SELECT A.BOOKACCOUNT AS BOOKACCOUNT,
(A.CURRBALANCE + NVL(B.BAL, 0.00)) AS BANKAMT
FROM TLHR.TLHRBOKBAL_TMP A,
(SELECT T1.BOOKACCOUNT AS BOOKACCOUNT,
SUM(DECODE(T1.DCFLAG, 'D', -T1.AMT, 'C', T1.AMT, 0)) AS BAL
FROM TLHR.TLHRBOKBALJN T1
WHERE T1.BOOKACCOUNT LIKE '13500000%'
AND T1.TRANDATE = '20150901'
AND (T1.REASON IN ('1', '2') OR
(T1.REASON = '0' AND T1.ONLINEFLAG = '1'))
GROUP BY T1.BOOKACCOUNT) B
WHERE A.BOOKACCOUNT = B.BOOKACCOUNT(+)
AND A.BOOKACCOUNT LIKE '13500000%') T
ON (S.BOOKACCOUNT = T.BOOKACCOUNT)
WHEN MATCHED THEN
UPDATE
SET S.LASTBALANCE = T.BANKAMT,
S.CURRBALANCE = T.BANKAMT,
S.DEBITAMT = 0.00,
S.CREDITAMT = 0.00;
看起来是一个MERGE语句,按照小麦苗以前的经验,这一类的SQL最好是修改为MERGE的非关联形式比较好,我们先看看执行计划有没有问题:
先找到SQL_ID为53qv858pwwwwb:
SELECT a.ELAPSED_TIME,a.EXECUTIONS,a.* FROM v$sql a WHERE a.SQL_TEXT LIKE '%MERGE INTO TLHRBOKBAL S%' AND A.SQL_TEXT LIKE '%13500000%' ;
查询历史执行计划:
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_AWR(SQL_ID => '53qv858pwwwwb' )) ;
SELECT COUNT(*) FROM TLHR.TLHRBOKBAL_TMP A WHERE A.BOOKACCOUNT LIKE '13500000%'; --306043/38998765可以看到,该执行计划的顺序为【7-->6-->5-->4-->9-->10-->8-->3-->2-->1-->0】,而耗费性能的地方在9、10、8这3个步骤上,走的是全表扫描,我们先看看2个大表的数据量:
SELECT COUNT(*) FROM TLHR.TLHRBOKBAL A WHERE A.BOOKACCOUNT LIKE '13500000%'; --306043/38826275
从3000万的数据里边取出30万的数据,还是比较少的,所以应该去走索引的,看了一下统计信息,也是最新收集的,好吧,算了,先修改一下SQL让其走索引扫描看看,:
MERGE INTO TLHR.TLHRBOKBAL S
USING (SELECT S.ROWID ROWIDS,
A.BOOKACCOUNT AS BOOKACCOUNT,
(A.CURRBALANCE + NVL(B.BAL, 0.00)) AS BANKAMT
FROM (SELECT /*+index(NB,PK_TLHRBOKBAL_TMP)*/NB.CURRBALANCE,NB.BOOKACCOUNT
FROM TLHR.TLHRBOKBAL_TMP NB
WHERE NB.BOOKACCOUNT LIKE '13500000%') A,
TLHR.TLHRBOKBAL S,
(SELECT T1.BOOKACCOUNT AS BOOKACCOUNT,
SUM(DECODE(T1.DCFLAG, 'D', -T1.AMT, 'C', T1.AMT, 0)) AS BAL
FROM TLHR.TLHRBOKBALJN T1
WHERE T1.BOOKACCOUNT LIKE '13500000%'
AND T1.TRANDATE = '20150901'
AND (T1.REASON IN ('1', '2') OR
(T1.REASON = '0' AND T1.ONLINEFLAG = '1'))
GROUP BY T1.BOOKACCOUNT) B
WHERE A.BOOKACCOUNT = B.BOOKACCOUNT(+)
AND S.BOOKACCOUNT = A.BOOKACCOUNT
AND S.BOOKACCOUNT LIKE '13500000%') T
ON (T.ROWIDS = S.ROWID)
WHEN MATCHED THEN
UPDATE
SET S.LASTBALANCE = T.BANKAMT,
S.CURRBALANCE = T.BANKAMT,
S.DEBITAMT = 0.00,
S.CREDITAMT = 0.00
执行计划中,基本都走了索引了,跑了一下,大约1分种多,但是里边有个HINTS,分析了一下表TLHRBOKBAL_TMP上的索引情况,发现是个主键索引,且有2个列(BOOKACCOUNT,CURRENCY),但是不包含列CURRBALANCE,可能是Oracle觉得回表读的耗费比较大吧,那这里可以使用虚拟索引测试一下索引的性能:
CREATE INDEX IX_VI01_ID ON TLHR.TLHRBOKBAL_TMP(CURRBALANCE, CURRENCY,BOOKACCOUNT) NOSEGMENT;
ALTER SESSION SET "_USE_NOSEGMENT_INDEXES"=TRUE;
EXPLAIN PLAN FOR MERGE INTO TLHR.TLHRBOKBAL S
USING (SELECT S.ROWID ROWIDS,
A.BOOKACCOUNT AS BOOKACCOUNT,
(A.CURRBALANCE + NVL(B.BAL, 0.00)) AS BANKAMT
FROM (SELECT NB.CURRBALANCE,NB.BOOKACCOUNT
FROM TLHR.TLHRBOKBAL_TMP NB
WHERE NB.BOOKACCOUNT LIKE '13500000%') A,
TLHR.TLHRBOKBAL S,
(SELECT T1.BOOKACCOUNT AS BOOKACCOUNT,
SUM(DECODE(T1.DCFLAG, 'D', -T1.AMT, 'C', T1.AMT, 0)) AS BAL
FROM TLHR.TLHRBOKBALJN T1
WHERE T1.BOOKACCOUNT LIKE '13500000%'
AND T1.TRANDATE = '20150901'
AND (T1.REASON IN ('1', '2') OR
(T1.REASON = '0' AND T1.ONLINEFLAG = '1'))
GROUP BY T1.BOOKACCOUNT) B
WHERE A.BOOKACCOUNT = B.BOOKACCOUNT(+)
AND S.BOOKACCOUNT = A.BOOKACCOUNT
AND S.BOOKACCOUNT LIKE '13500000%') T
ON (T.ROWIDS = S.ROWID)
WHEN MATCHED THEN
UPDATE
SET S.LASTBALANCE = T.BANKAMT,
S.CURRBALANCE = T.BANKAMT,
S.DEBITAMT = 0.00,
S.CREDITAMT = 0.00;
SELECT * FROM TABLE(DBMS_XPLAN.display);
说明创建3个列的索引是可以的。我们先将该虚拟索引删除DROP INDEX IX_VI01_ID;
3.2.1 ITL死锁问题解决
另外一个问题,是开发说上边的SQL语句产生了死锁,起初我还半信半疑,先去告警日志中用命令(more alert* | grep Deadlock)搜了一下:
结果发现很多的死锁,拿到相关的文件,看到如下一段:
user session for deadlock lock 0x7000008094d14e0 sid: 332 ser: 47221 audsid: 991000 user: 84/TLHR flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC pid: 101 O/S info: user: grid, term: UNKNOWN, ospid: 6489034 image: oracle@ZFTLHRDB1 client details: O/S info: user: TLHR, term: , ospid: 34406578 machine: ZFTLHRAP1 program: bat_CheckBookBal@ZFTLHRAP1 (TNS V1-V3) application name: bat_CheckBookBal@ZFTLHRAP1 (TNS V1-V3), hash value=446537749 current SQL: MERGE INTO TLHRBOKBAL S USING (SELECT A.BOOKACCOUNT .............................. DUMP LOCAL BLOCKER: initiate state dump for DEADLOCK possible owner[101.6489034] on resource TX-00EE0009-00005EA6 *** 2016-09-01 18:30:38.014 Submitting asynchronized dump request [28]. summary=[ges process stack dump (kjdglblkrdm1)]. Global blockers dump end:----------------------------------- Global Wait-For-Graph(WFG) at ddTS[0.2fe0] : BLOCKED 0x7000008e9c8bc28 3 wq 2 cvtops x1 TX 0x159001e.0x2379(ext 0x5,0x0)[1006-0065-0019365C] inst 1 BLOCKER 0x700000809ab4b28 3 wq 1 cvtops x28 TX 0x159001e.0x2379(ext 0x5,0x0)[2005-005E-00185E15] inst 2 BLOCKED 0x700000891b48708 3 wq 2 cvtops x1 TX 0x1c2001d.0x4b82(ext 0x2,0x0)[2005-005E-00185E15] inst 2 BLOCKER 0x7000008e9c8b148 3 wq 1 cvtops x28 TX 0x1c2001d.0x4b82(ext 0x2,0x0)[1004-004D-0000C03E] inst 1 BLOCKED 0x70000089a636970 3 wq 2 cvtops x1 TX 0x1c0000b.0x18f6(ext 0x2,0x0)[1004-004D-0000C03E] inst 1 BLOCKER 0x7000008e9c8b4e8 3 wq 1 cvtops x28 TX 0x1c0000b.0x18f6(ext 0x2,0x0)[1005-0058-000DD3D9] inst 1 BLOCKED 0x700000891d5fc50 3 wq 2 cvtops x1 TX 0xee0009.0x5ea6(ext 0x2,0x0)[1005-0058-000DD3D9] inst 1 BLOCKER 0x7000008094d14e0 3 wq 1 cvtops x28 TX 0xee0009.0x5ea6(ext 0x2,0x0)[1006-0065-0019365C] inst 1 * Cancel deadlock victim lockp 0x7000008e9c8bc28 *** 2016-09-01 22:51:33.104 2016-09-01 22:51:33.104615 : Setting 3-way CR grants to 1 global-lru off? 0 2016-09-02 10:28:55.429293 : Setting 3-way CR grants to 0 global-lru off? 0 |
果然,产生死锁的SQL还是上边分析优化的SQL,其中会话信息为:(332,47221),我们去DBA_HIST_ACTIVE_SESS_HISTORY视图里查询:
SELECT D.SQL_ID, D.CURRENT_OBJ#, D.EVENT, COUNT(1)
FROM DBA_HIST_ACTIVE_SESS_HISTORY D
WHERE D.SAMPLE_TIME BETWEEN
TO_DATE('2016-09-01 18:25:00', 'YYYY-MM-DD HH24:MI:SS') AND
TO_DATE('2016-09-01 18:45:00', 'YYYY-MM-DD HH24:MI:SS')
AND D.BLOCKING_SESSION_STATUS = 'VALID'
AND D.SESSION_ID = 332
AND D.SESSION_SERIAL# = 47221
GROUP BY D.SQL_ID, D.CURRENT_OBJ#, D.EVENT;
可以看到该会话的等待事件是enq: TX - allocate ITL entry。可以猜测是由于ITL事务槽引起的问题。
SELECT DISTINCT D.BLOCKING_SESSION, D.BLOCKING_SESSION_SERIAL#, D.SQL_ID
FROM DBA_HIST_ACTIVE_SESS_HISTORY D
WHERE D.SAMPLE_TIME BETWEEN
TO_DATE('2016-09-01 18:25:00', 'YYYY-MM-DD HH24:MI:SS') AND
TO_DATE('2016-09-01 18:45:00', 'YYYY-MM-DD HH24:MI:SS')
AND D.EVENT = 'enq: TX - allocate ITL entry'
AND D.BLOCKING_SESSION_STATUS = 'VALID'
AND D.SESSION_ID = 332
AND D.SESSION_SERIAL# = 47221;
可以看出会话(332,47221)共阻塞了3个会话,由于有死锁,那么我们看看上边查询出来的3个会话阻塞了哪些会话:
SELECT DISTINCT D.INSTANCE_NUMBER,
D.SESSION_ID,
D.SESSION_SERIAL#,
D.BLOCKING_INST_ID,
D.BLOCKING_SESSION,
D.BLOCKING_SESSION_SERIAL#,
D.SQL_ID
FROM DBA_HIST_ACTIVE_SESS_HISTORY D
WHERE D.SAMPLE_TIME BETWEEN
TO_DATE('2016-09-01 18:25:00', 'YYYY-MM-DD HH24:MI:SS') AND
TO_DATE('2016-09-01 18:45:00', 'YYYY-MM-DD HH24:MI:SS')
AND D.EVENT = 'enq: TX - allocate ITL entry'
AND D.BLOCKING_SESSION_STATUS = 'VALID'
AND ((D.SESSION_ID = 332 AND D.SESSION_SERIAL# = 47221) OR
(D.SESSION_ID = 2602 AND D.SESSION_SERIAL# = 4343) OR
(D.SESSION_ID = 2995 AND D.SESSION_SERIAL# = 46891) OR
(D.SESSION_ID = 1894 AND D.SESSION_SERIAL# = 30761));
可以看到,1894和2602相互阻塞(绿色表示),332和2602相互阻塞(红色表示),2995和332相互阻塞(粉色表示),这么多的相互阻塞就产生了死锁,这里由于SQL_ID不同,而且产生的等待事件是enq: TX - allocate ITL entry,所以推测出生成的是ITL死锁。
解决这类问题就是增大ini_trans和PCT_FREE的值。
SELECT * FROM DBA_TABLES D WHERE D.TABLE_NAME = 'TLHRBOKBAL';
可以看到,ini_trans和PCT_FREE的值都是默认的,太小了,根据MOS(Troubleshooting waits for 'enq: TX - allocate ITL entry' (Doc ID 1472175.1) 地址:http://blog.itpub.net/26736162/viewspace-2124531/)我们需要修改该参数,SQL如下:
ALTER TABLE TLHR.TLHRBOKBAL PCTFREE 20 INITRANS 16 ;
ALTER TABLE TLHR.TLHRBOKBAL MOVE NOLOGGING PARALLEL 12;
ALTER TABLE TLHR.TLHRBOKBAL LOGGING NOPARALLEL;
ALTER INDEX TLHR.PK_TLHRBOKBAL REBUILD PCTFREE 20 INITRANS 16 NOLOGGING PARALLEL 12;
ALTER INDEX TLHR.PK_TLHRBOKBAL LOGGING NOPARALLEL;
由于表里有3000W的数据量,开了并行,本来我预估的是5分钟,结果move表的时候10秒都不到还是比较快的。
调整之后的值:
SELECT * FROM DBA_TABLES D WHERE D.TABLE_NAME = 'TLHRBOKBAL';
SELECT * FROM dba_indexes d WHERE d.index_name='PK_TLHRBOKBAL';
修改已经生效,接下来就看开发那边是否还报死锁的错误,这个等待需要明天看了。
终于等到第2天了,看来没有报错了: