昨天客户反映说在下午某时间段有几个事务失败了,让我查下当时数据库系统的负载是否正常,看了下CPU的历史负载,很正常,于是看了下日志发现有几个timeout错误和一个ORA-02050错误。
Tue Feb 15 20:44:05 2011
Error 1013 trapped in 2PC on transaction 199.40.297416. Cleaning up.
Tue Feb 15 20:44:56 2011
Error stack returned to user:
ORA-02050: transaction 199.40.297416 rolled back, some remote DBs may be in-doubt
ORA-01013: user requested cancel of current operation
这个是分布式事务处理出错的提示,根据提示可知,可能是由于远程的数据库处理失败,导致事务号为199.40.297416的事务失败而回滚。原因可能是网络,也可能是远程节点的连接超时或是人为kill等。
Oracle给出的解释是:
Error: ORA 2050
Text: transaction <num> rolled back, some remote DBs may be in-doubt
-------------------------------------------------------------------------------
Cause:There was a network failure or a remote failure during the two-phase commit.
Action:Notify the database administrator.
The remote databases will automatically resynchronize when the failure is repaired.
大致意思为:一种可能为网络失败,另外一种是事务在两个阶段commit时,远程节点操作不成功,这时,它会记在dba_2pc_pending中,reco进程后续会自动尝试去继续完成这个事务。但如果能确定这个节点段时间内不能完成,可能就需要人工处理这个pending transaction了。
查询数据字典dba_2pc_pending和dba_2pc_neighbors均无任何记录,说明事务已经自动被恢复了。
附录:
dba_2pc_pending describes distributed transactions awaiting recovery.
dba_2pc_neighbors describes incoming and outgoing connections for pending transactions.
在我这个例子中,由于事务已经被自动恢复,所以无需人工干预,如果没有自动恢复的话,可以通过如下语句处理:
强制提交或回滚:
SQL>commit force '199.40.297416'; 或
SQL>rollback force '199.40.297416';
如果事务的状态为collecting,则不需要commit/rollback force,可这样处理,如下:
1、disable分布式恢复
SQL>alter system disable distributed recovery;
2、purge in-doubt transaction entry
SQL>exec dbms_transaction.purge_lost_db_entry('199.40.297416');
3、最后enable分布式恢复
SQL>alter system enable distributed recovery;