昨天下午1点,收到公司在客户现场的电话,数据库访问不了了,因为我也不在现场,当时那个操作系统是windows 2003 server,oracle版本是oracle10g win32版本(10.2.0.1.0),觉得是操作系统不稳定的问题(先天不良),也没有多想,告诉他重启一下机器看看,然后就没事了,快到下班的时候,再次出现了这个问题。

通过teamViewer等上这个系统,取下了alert_xx.log日志的,简单的see了一下,trace文件在bdump下,顺便把bdump文件夹取下来,一起分析。

alert_xx.log文件报错误如下:



Mon May 27 07:21:54 2013
Thread 1 advanced to log sequence 194
  Current log# 1 seq# 194 mem# 0: F:\ORACLE\PRODUCT\10.2.0\ORADATA\KXONE\REDO01.LOG
Mon May 27 13:43:44 2013
Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []

Mon May 27 13:43:45 2013
Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C95A308] [ADDR:0xC0C56E3] [UNABLE_TO_WRITE] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []

Mon May 27 13:43:46 2013
Errors in file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C95A308] [ADDR:0xC0C56E3] [UNABLE_TO_WRITE] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []



大概说明是:内存地址不可读取,同时ora-7445和ora-600的一样,是oracle两个严重的错误。

我们看到这是是5月27下午1点的报错,,也现场打电话过来的时间也一致,j000进程是oracle 的job调用进程,于是打看这个trace:



Dump file f:\oracle\product\10.2.0\admin\kxone\bdump\kxone_j000_7236.trc
Mon May 27 13:43:44 2013
ORACLE V10.2.0.1.0 - Production vsnsta=0
vsnsql=14 vsnxtr=3
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
Windows Server 2003 Version V5.2 Service Pack 2
CPU                 : 32 - type 586, 2 Physical Cores
Process Affinity    : 0x00000000
Memory (Avail/Total): Ph:20800M/24553M, Ph+PgF:22741M/26282M, VA:1209M/2047M
Instance name: kxone

Redo thread mounted by this instance: 1

Oracle process number: 42

Windows thread id: 7236, image: ORACLE.EXE


*** 2013-05-27 13:43:44.810
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
Current SQL information unavailable - no session.
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
7C959F40                      00000000             
77C4E78F             CALL???  00000000             
77C4E761             CALLrel  77C4E76D             
77C468FF             CALLrel  77C4E759             
77C46216             CALLrel  77C46883             
77C46196             CALLrel  77C461EA             
77C4F860             CALLrel  77C4F87D             
77C4F78C             CALLrel  77C4F7C9             
77C4F3BE             CALLreg  00000000             
77C4EECF             CALLrel  77C4EEDD             
77C4F61F             CALLrel  77C4EEC5             
77CB40DF             CALLrel  77C4F600             
76E350DC             CALLrel  76E34365             76E34370 76E342B4 AA3E118
76E34FFC             CALLrel  76E350C5             
76E35102             CALLrel  76E33FD2             
71A83F89             CALLrel  71A83E82             
71A83F16             CALLrel  71A83F5F             
71A83473             CALLrel  71A83E8D             
71B66DBD             CALLreg  00000000             
71B66D9B             CALLrel  71B66DA9             
71B66D65             CALLrel  71B66D84             
71B66D03             CALLrel  71B66D1D             
71B678E9             CALLrel  71B66C91             
71B67843             CALLrel  71B6788B             
71B67D29             CALLrel  71B677DA             
_sldmGetHostInfo+27  CALL???  00000000             
_sldmInit+97         CALLrel  _sldmGetHostInfo+0   93D93F4 93D94F4 100 93D93F4
                                                   100
_ldmInit+318         CALLrel  _sldmInit+0          93D93F0 8AF6618 93D93F0 0 23C
                                                   8E0F9A8 0 658
_keltnfy+148         CALLrel  _ldmInit+0           AA3EA30 0 6BBE922C E73D908



 当时没有sql在运行(current sql),这个是一个十分重要的信息,一般oracle的在各平台下的bug,大多可能是应用的sql或是dba的某些操作触发的,而函数调用为:_sldmGetHostInfo,_sldmInit,同时这个trace上面还有一个os的的极重要的内存信息:Memory (Avail/Total): Ph:20800M/24553M,即操作系统内存为24G。

把ora-7445 [ACCESS_VIOLATION]和ora-7445 _sldmGetHostInfo _sldmInit no session在mos上找了一下,没有相同的症状。之前也看了baidu了一下这方面的问题,在老杨博客上找到一篇文章(http://yangtingkun.net/?p=401):



查询MOS后,确认是Windows平台上的bug,详情参考MS-Windows: ORA-7445 On functions: ssbev_env, slzgetevar + no SGA [ID 1305096.1]。这个bug影响10.2.0.5以前的Windows平台下的10g,导致问题的原因是一个未公布的Bug:8592848,Oracle在10.2.0.5和11.2.0.1中fixed了这个bug。



但和我的还不一样,继续找alert.log发现了另一个trace文件:



Tue May 28 17:02:49 2013
Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []

Tue May 28 17:02:49 2013
Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []

Tue May 28 17:02:50 2013
Errors in file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C94BB52] [ADDR:0xFF016AFF] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []



这是5月28日的,打开 f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc



Dump file f:\oracle\product\10.2.0\db_1\rdbms\trace\kxone_ora_4176.trc
Tue May 28 17:02:49 2013
ORACLE V10.2.0.1.0 - Production vsnsta=0
vsnsql=14 vsnxtr=3
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
Windows Server 2003 Version V5.2 Service Pack 2
CPU                 : 32 - type 586, 2 Physical Cores
Process Affinity    : 0x00000000
Memory (Avail/Total): Ph:22687M/24553M, Ph+PgF:24657M/26282M, VA:1243M/2047M
Instance name: kxone

Redo thread mounted by this instance: 0 <none>

Oracle process number: 0

Windows thread id: 4176, image: ORACLE.EXE


*** 2013-05-28 17:02:49.437
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C959F40] [ADDR:0x52004D] [UNABLE_TO_WRITE] []
Current SQL information unavailable - no SGA.
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
7C959F40                      00000000             
7C95A913             CALLrel  7C959E46             
7C963BBF             CALL???  00000000             
7C830BC2             CALL???  00000000             
_ssbev_env+40        CALL???  00000000             
_slzgetevar+278      CALLrel  _ssbev_env+0         
_kpummpin+745        CALLrel  _slzgetevar+0        894F3A8 607B5174 1A 894F388



这个和上面 ID 1305096.1 描述的一致,大致判断为:10.2.0.1.0在windows 32 bit上的一个bug,根本解决办法是升级oracle到10.2.0.5.0。

这时,我打开了一开始为确认现场环境相关情况时要求对方给我发一个操作信息截图:

ebgp建立不起来的原因_oracle

这和上面的trace文件中的描述基本一致,但通常情况下,服务器的内存为8G,16G,32G的为多,一个24G的内存,这时又得到一个信息,就是两台相同的机器,那台是32G的,会不会是内存故障,此时 ,在我远程修改sga_max_size,sga_target的时候(oracle给出的解决方案中也有增大和减小sga的方案),修改完之后,进行了一个startup force,接关oracle就hang住了 ,但在这之前,有个操作系统的报错:

ebgp建立不起来的原因_oracle_02

同样的内存不可读,会不会是内存硬件故障?

打电话到现场让他们查看,相关的指示灯,发现ibm的这台server上,mem的灯已经开始告警,同时确认了这台机器上原来有32G的内存,可操作系统和oracle只识别出了24G,基本确认是内存物理故障。

同时在alert.log里也有一点东西证实了这一点:

ebgp建立不起来的原因_数据库_03

就是数据库从open到故障的时间间隔很短,当然这只是一个佐证。

 

后记:这台机器的操作系统是windows server 2003 32位,raid可能是raid5,没有热备盘,数据库oracle 10.2.0.1单机环境,无集群,做为一个准备上线的重要生产环境,两个重要问题:

操作系统先天不良,windows定位问题困难,oracle的版本至少要patch到10.2.0.5.0。

最好有rac环境,独立存储。

最后感谢:dba群里笑笑提供的mos相关支持。