昨天下午1点,收到公司在客户现场的电话,数据库访问不了了,因为我也不在现场,当时那个操作系统是windows 2003 server,oracle版本是oracle10g win32版本(10.2.0.1.0),觉得是操作系统不稳定的问题(先天不良),也没有多想,告诉他重启一下机器看看,然后就没事了,快到下班的时候,再次出现了这个问题。
通过teamViewer等上这个系统,取下了alert_xx.log日志的,简单的see了一下,trace文件在bdump下,顺便把bdump文件夹取下来,一起分析。
alert_xx.log文件报错误如下:
Mon May 27 07:21:54 2013
Thread 1 advanced to log sequence 194
Current log# 1 seq# 194 mem# 0: F:\ORACLE\PRODUCT\10.2.0\ORADATA\KXONE\REDO01.LOG
Mon May 27 13:43:44 2013
Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
Mon May 27 13:43:45 2013
Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C95A308] [ADDR:0xC0C56E3] [UNABLE_TO_WRITE] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
Mon May 27 13:43:46 2013
Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C95A308] [ADDR:0xC0C56E3] [UNABLE_TO_WRITE] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
大概说明是:内存地址不可读取,同时ora-7445和ora-600的一样,是oracle两个严重的错误。
我们看到这是是5月27下午1点的报错,,也现场打电话过来的时间也一致,j000进程是oracle 的job调用进程,于是打看这个trace:
Dump file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc
Mon May 27 13:43:44 2013
ORACLE V10.2.0.1.0 - Production vsnsta=0
vsnsql=14 vsnxtr=3
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
Windows Server 2003 Version V5.2 Service Pack 2
CPU : 32 - type 586, 2 Physical Cores
Process Affinity : 0x00000000
Memory (Avail/Total): Ph:20800M/24553M, Ph+PgF:22741M/26282M, VA:1209M/2047M
Instance name: kxone
Redo thread mounted by this instance: 1
Oracle process number: 42
Windows thread id: 7236, image: ORACLE.EXE
*** 2013-05-27 13:43:44.810
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
Current SQL information unavailable - no session.
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
7C959F40 00000000
77C4E78F CALL??? 00000000
77C4E761 CALLrel 77C4E76D
77C468FF CALLrel 77C4E759
77C46216 CALLrel 77C46883
77C46196 CALLrel 77C461EA
77C4F860 CALLrel 77C4F87D
77C4F78C CALLrel 77C4F7C9
77C4F3BE CALLreg 00000000
77C4EECF CALLrel 77C4EEDD
77C4F61F CALLrel 77C4EEC5
77CB40DF CALLrel 77C4F600
76E350DC CALLrel 76E34365 76E34370 76E342B4 AA3E118
76E34FFC CALLrel 76E350C5
76E35102 CALLrel 76E33FD2
71A83F89 CALLrel 71A83E82
71A83F16 CALLrel 71A83F5F
71A83473 CALLrel 71A83E8D
71B66DBD CALLreg 00000000
71B66D9B CALLrel 71B66DA9
71B66D65 CALLrel 71B66D84
71B66D03 CALLrel 71B66D1D
71B678E9 CALLrel 71B66C91
71B67843 CALLrel 71B6788B
71B67D29 CALLrel 71B677DA
_sldmGetHostInfo+27 CALL??? 00000000
_sldmInit+97 CALLrel _sldmGetHostInfo+0 93D93F4 93D94F4 100 93D93F4
100
_ldmInit+318 CALLrel _sldmInit+0 93D93F0 8AF6618 93D93F0 0 23C
8E0F9A8 0 658
_keltnfy+148 CALLrel _ldmInit+0 AA3EA30 0 6BBE922C E73D908
当时没有sql在运行(current sql),这个是一个十分重要的信息,一般oracle的在各平台下的bug,大多可能是应用的sql或是dba的某些操作触发的,而函数调用为:_sldmGetHostInfo,_sldmInit,同时这个trace上面还有一个os的的极重要的内存信息:Memory (Avail/Total): Ph:20800M/24553M,即操作系统内存为24G。
把ora-7445 [ACCESS_VIOLATION]和ora-7445 _sldmGetHostInfo _sldmInit no session在mos上找了一下,没有相同的症状。之前也看了baidu了一下这方面的问题,在老杨博客上找到一篇文章(http://yangtingkun.net/?p=401):
查询MOS后,确认是Windows平台上的bug,详情参考MS-Windows: ORA-7445 On functions: ssbev_env, slzgetevar + no SGA [ID 1305096.1]。这个bug影响10.2.0.5以前的Windows平台下的10g,导致问题的原因是一个未公布的Bug:8592848,Oracle在10.2.0.5和11.2.0.1中fixed了这个bug。
但和我的还不一样,继续找alert.log发现了另一个trace文件:
Tue May 28 17:02:49 2013
Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
Tue May 28 17:02:49 2013
Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
Tue May 28 17:02:50 2013
Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
这是5月28日的,打开 f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc
Dump file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc
Tue May 28 17:02:49 2013
ORACLE V10.2.0.1.0 - Production vsnsta=0
vsnsql=14 vsnxtr=3
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
Windows Server 2003 Version V5.2 Service Pack 2
CPU : 32 - type 586, 2 Physical Cores
Process Affinity : 0x00000000
Memory (Avail/Total): Ph:22687M/24553M, Ph+PgF:24657M/26282M, VA:1243M/2047M
Instance name: kxone
Redo thread mounted by this instance: 0 <none>
Oracle process number: 0
Windows thread id: 4176, image: ORACLE.EXE
*** 2013-05-28 17:02:49.437
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
Current SQL information unavailable - no SGA.
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
7C959F40 00000000
7C95A913 CALLrel 7C959E46
7C963BBF CALL??? 00000000
7C830BC2 CALL??? 00000000
_ssbev_env+40 CALL??? 00000000
_slzgetevar+278 CALLrel _ssbev_env+0
_kpummpin+745 CALLrel _slzgetevar+0 894F3A8 607B5174 1A 894F388
这个和上面 ID 1305096.1 描述的一致,大致判断为:10.2.0.1.0在windows 32 bit上的一个bug,根本解决办法是升级oracle到10.2.0.5.0。
这时,我打开了一开始为确认现场环境相关情况时要求对方给我发一个操作信息截图:
这和上面的trace文件中的描述基本一致,但通常情况下,服务器的内存为8G,16G,32G的为多,一个24G的内存,这时又得到一个信息,就是两台相同的机器,那台是32G的,会不会是内存故障,此时 ,在我远程修改sga_max_size,sga_target的时候(oracle给出的解决方案中也有增大和减小sga的方案),修改完之后,进行了一个startup force,接关oracle就hang住了 ,但在这之前,有个操作系统的报错:
同样的内存不可读,会不会是内存硬件故障?
打电话到现场让他们查看,相关的指示灯,发现ibm的这台server上,mem的灯已经开始告警,同时确认了这台机器上原来有32G的内存,可操作系统和oracle只识别出了24G,基本确认是内存物理故障。
同时在alert.log里也有一点东西证实了这一点:
就是数据库从open到故障的时间间隔很短,当然这只是一个佐证。
后记:这台机器的操作系统是windows server 2003 32位,raid可能是raid5,没有热备盘,数据库oracle 10.2.0.1单机环境,无集群,做为一个准备上线的重要生产环境,两个重要问题:
操作系统先天不良,windows定位问题困难,oracle的版本至少要patch到10.2.0.5.0。
最好有rac环境,独立存储。
最后感谢:dba群里笑笑提供的mos相关支持。