参考文档:
Exadata - Database reports corrupted blocks errors where content of blocks have pattern BADFDAT (Doc ID 1579291.1)
ABMR: How to test Automatic Block Recover Feature (Doc ID 1266059.1)
RDBMS 11.2.0.4 ,一套RAC主库,带多套RAC备库,其中一套RAC备库出现坏块
坏块记录,最严重的时候,坏块导致MRP进程停止,随后启动MRP进程,数据同步正常,但是此时没有考虑去查询坏块。
Reread of blocknum=2811006, file=+DATA_ABC/prod/datafile/AAAts_01.291.1107879227. found same corrupt data
Fri Jul 22 07:06:39 2022
Automatic block media recovery requested for (file# 94, block# 2810890)
Fri Jul 22 07:07:40 2022
Automatic block media recovery requested for (file# 94, block# 2810893)
Fri Jul 22 07:08:41 2022
Errors in file /u01/PROD/oracle/diag/rdbms/prod/PROD1/trace/PROD1_ora_289384.trc (incident=3912730):
Ÿå (æ–‡ä»¶å· 94, åå· 2810890)
®åæ
ORA-01578: ORACLE æ•°æ
•å…ƒ: 9952
å
ORA-27616: ASM 分é…
®æ–‡ä»¶ 94: '+DATA_AAA/prod/datafile/AAAts_01.291.1107879227'
ORA-01110: æ•°æ
Incident details in: /u01/PROD/oracle/diag/rdbms/prod/PROD1/incident/incdir_3912730/PROD1_ora_289384_i3912730.trc
Fri Jul 22 07:08:42 2022
Dumping diagnostic data in directory=[cdmp_20220722070842], requested by (instance=1, osid=289384), summary=[incident=3912730].
Fri Jul 22 07:08:44 2022
Sweep [inc][3912730]: completed
Sweep [inc2][3912730]: completed
Sweep [inc2][3912729]: completed
Fri Jul 22 07:09:07 2022
Automatic block media recovery requested for (file# 94, block# 2810880)
Fri Jul 22 07:10:08 2022
Automatic block media recovery requested for (file# 94, block# 2810892)
Fri Jul 22 07:11:09 2022
Automatic block media recovery requested for (file# 94, block# 2810883)
Fri Jul 22 07:12:09 2022
Automatic block media recovery requested for (file# 94, block# 2810886)
Fri Jul 22 07:13:10 2022
Errors in file /u01/PROD/oracle/diag/rdbms/prod/PROD1/trace/PROD1_ora_289384.trc (incident=3912731):
Ÿå (æ–‡ä»¶å· 94, åå· 2810880)
®åæ
ORA-01578: ORACLE æ•°æ
®æ–‡ä»¶ 94: '+DATA_AAA/prod/datafile/AAAts_01.291.1107879227'
Incident details in: /u01/PROD/oracle/diag/rdbms/prod/PROD1/incident/incdir_3912731/PROD1_ora_289384_i3912731.trc
Reread of blocknum=2810989, file=+DATA_AAA/prod/datafile/AAAts_01.291.1107879227. found same corrupt data
Reread of blocknum=2810989, file=+DATA_AAA/prod/datafile/AAAts_01.291.1107879227. found same corrupt data
Reread of blocknum=2810989, file=+DATA_AAA/prod/datafile/AAAts_01.291.1107879227. found same corrupt data
Reread of blocknum=2810989, file=+DATA_AAA/prod/datafile/AAAts_01.291.1107879227. found same corrupt data
Reread of blocknum=2810989, file=+DATA_AAA/prod/datafile/AAAts_01.291.1107879227. found same corrupt data
Hex dump of (file 94, block 2810992) in trace file /u01/PROD/oracle/diag/rdbms/prod/PROD1/incident/incdir_3912729/PROD1_m000_71496_i3912729_a.trc
Corrupt block relative dba: 0x17aae470 (file 94, block 2810992)
Bad header found during validation
在ASM alert中出现的告警,这个告警,可以参考这个文档 :Exadata - Database reports corrupted blocks errors where content of blocks have pattern BADFDAT (Doc ID 1579291.1)
Thu Jul 07 08:10:31 2022
NOTE: starting media scrubbing repair for group 1/0x6fc765b6 (DATA_AAA) in R000
WARNING: group 1, file 285, extent 1983: filling extent with BADFDA7A
NOTE: completed media scrubbing repair for group 1/0x6fc765b6 (DATA_AAA) (relocated 1 extents)
Thu Jul 14 08:10:13 2022
NOTE: starting media scrubbing repair for group 1/0x6fc765b6 (DATA_AAA) in R000
WARNING: group 1, file 287, extent 3809: filling extent with BADFDA7A
NOTE: completed media scrubbing repair for group 1/0x6fc765b6 (DATA_AAA) (relocated 1 extents)
Thu Jul 21 08:09:10 2022
NOTE: starting media scrubbing repair for group 1/0x6fc765b6 (DATA_AAA) in R000
WARNING: group 1, file 283, extent 6766: filling extent with BADFDA7A
NOTE: completed media scrubbing repair for group 1/0x6fc765b6 (DATA_AAA) (relocated 1 extents)
受上面文档的启发,查看坏块,果然有坏块 ,后面发现,在没停止mrp进程的时候,修复坏块,居然还有别的文件的坏块生成。
SYS@PROD1> select * from V$DATABASE_BLOCK_CORRUPTION;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
94 2810880 98 0 CORRUPT
94 2810980 1 0 CORRUPT
94 2810992 1 0 CORRUPT
94 2811002 1 0 CORRUPT
48 1015296 50 0 CORRUPT
48 1015347 4 0 CORRUPT
48 1015352 13 0 CORRUPT
48 1015367 7 0 CORRUPT
48 1015375 49 0 CORRUPT
94 2810982 1 0 CORRUPT
94 2810984 1 0 CORRUPT
94 2810987 3 0 CORRUPT
94 2810998 1 0 CORRUPT
94 2811005 2 0 CORRUPT
14 rows selected.
尝试修复坏块,失败,因为没有备份 ,此时也没有停掉mrp进程,结果花费了大量的时间,还没有成功。
RMAN> RECOVER CORRUPTION LIST;
Starting recover at 22-JUL-22
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=5272 instance=PROD1 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=6024 instance=PROD1 device type=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: SID=8281 instance=PROD1 device type=DISK
finished primary search, recovered 1 blocks
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 07/22/2022 15:13:32
RMAN-06026: some targets not found - aborting restore
RMAN-06023: no backup or copy of datafile 94 found to restore
RMAN-06023: no backup or copy of datafile 48 found to restore
同时查看alert log中的日志 ,会看到大量的 “Automatic block media recovery requested for”,同时会发现RECOVER CORRUPTION LIST很慢
Automatic block media recovery requested for (file# 48, block# 1015297)
Fri Jul 22 11:47:10 2022
Hex dump of (file 94, block 2811005) in trace file /u01/PROD/oracle/diag/rdbms/prod/PROD1/trace/PROD1_pr02_87305.trc
Corrupt block relative dba: 0x17aae47d (file 94, block 2811005)
Bad header found during media recovery
Data in bad block:
type: 122 format: 2 rdba: 0xbadfda7a
last change scn: 0xda7a.badfda7a seq: 0xdf flg: 0xba
spare1: 0xdf spare2: 0xba spare3: 0xbadf
consistency value in tail: 0xbadfda7a
check value in block header: 0xda7a
block checksum disabled
Reading datafile '+DATA_AAA/prod/datafile/AAAts_01.291.1107879227' for corruption at rdba: 0x17aae47d (file 94, block 2811005)
Reread (file 94, block 2811005) found same corrupt data (no logical check)
Automatic block media recovery requested for (file# 94, block# 2811005)
Automatic block media recovery successful for (file# 94, block# 2811005)
Fri Jul 22 11:47:46 2022
Automatic block media recovery requested for (file# 48, block# 1015298)
Fri Jul 22 11:48:47 2022
Automatic block media recovery requested for (file# 48, block# 1015299)
Fri Jul 22 11:49:47 2022
Automatic block media recovery requested for (file# 48, block# 1015300)
Fri Jul 22 11:50:47 2022
Automatic block media recovery requested for (file# 48, block# 1015301)
Fri Jul 22 11:51:47 2022
Automatic block media recovery requested for (file# 48, block# 1015303)
Fri Jul 22 11:52:47 2022
Automatic block media recovery requested for (file# 48, block# 1015304)
Fri Jul 22 11:53:48 2022
Automatic block media recovery requested for (file# 48, block# 1015305)
查看提示个别自动修复成功,但是大部分都没有自动修复成功,因为备库上没有备份
Automatic block media recovery requested for (file# 90, block# 1950310)
Automatic block media recovery successful for (file# 90, block# 1950310)
Thu Jul 21 15:53:11 2022
Hex dump of (file 46, block 3464276) in trace file /u01/PROD/oracle/diag/rdbms/prod/PROD1/trace/PROD1_pr0e_87329.trc
Corrupt block relative dba: 0x0bb4dc54 (file 46, block 3464276)
Bad header found during media recovery
Data in bad block:
type: 122 format: 2 rdba: 0xbadfda7a
last change scn: 0xda7a.badfda7a seq: 0xdf flg: 0xba
spare1: 0xdf spare2: 0xba spare3: 0xbadf
consistency value in tail: 0xbadfda7a
check value in block header: 0xda7a
block checksum disabled
Reading datafile '+DATA_AAA/prod/datafile/AAAtsx.283.1107879225' for corruption at rdba: 0x0bb4dc54 (file 46, block 3464276)
Reread (file 46, block 3464276) found same corrupt data (no logical check)
Automatic block media recovery requested for (file# 46, block# 3464276)
Automatic block media recovery successful for (file# 46, block# 3464276)
Thu Jul 21 16:28:47 2022
尝试恢复某一个块,成功了,但是很耗费时间(因为此时没有关闭MRP),使用的是7天前的全备的备份集进行恢复的
RMAN> recover datafile 94 block 2810982 2> ;
Starting recover at 22-JUL-22
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=2261 instance=PROD1 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=3014 instance=PROD1 device type=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: SID=3767 instance=PROD1 device type=DISK
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00094
channel ORA_DISK_1: reading from backup piece /mntbackupdata/rman/AAA_bak/PROD_incr0_116327_20220717
channel ORA_DISK_1: piece handle=/mntbackupdata/rman/AAA_bak/PROD_incr0_116327_20220717 tag=DB_INCR0_BAK
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 01:10:25
starting media recovery
media recovery failed
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 07/22/2022 16:38:51
ORA-00283: recovery session canceled due to errors
ORA-01153: an incompatible media recovery is active
RMAN>
停止掉MRP,使用命令backup validate database 重新验证数据库中所有的坏块(过程略), 然后全部修复。
停掉mrp进程,重新恢复快块(相对于打开MRP的情况下,恢复块,很快)
RMAN> BLOCKRECOVER CORRUPTION LIST;
Starting recover at 23-JUL-22
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00048
channel ORA_DISK_1: reading from backup piece /mntbackupdata/rman/AAA_bak/PROD_incr0_116319_20220717
channel ORA_DISK_1: piece handle=/mntbackupdata/rman/AAA_bak/PROD_incr0_116319_20220717 tag=DB_INCR0_BAK
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 00:40:05
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00090
channel ORA_DISK_1: reading from backup piece /mntbackupdata/rman/AAA_bak/PROD_incr0_116323_20220717
channel ORA_DISK_1: piece handle=/mntbackupdata/rman/AAA_bak/PROD_incr0_116323_20220717 tag=DB_INCR0_BAK
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 01:07:45
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00094
channel ORA_DISK_1: reading from backup piece /mntbackupdata/rman/AAA_bak/PROD_incr0_116327_20220717
channel ORA_DISK_1: piece handle=/mntbackupdata/rman/AAA_bak/PROD_incr0_116327_20220717 tag=DB_INCR0_BAK
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 01:11:46
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00046
channel ORA_DISK_1: reading from backup piece /mntbackupdata/rman/AAA_bak/PROD_incr0_116328_20220717
channel ORA_DISK_1: piece handle=/mntbackupdata/rman/AAA_bak/PROD_incr0_116328_20220717 tag=DB_INCR0_BAK
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 01:04:05
starting media recovery
media recovery failed
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 07/23/2022 04:53:40
ORA-00283: recovery session canceled due to errors
ORA-01122: database file 46 failed verification check
ORA-01110: data file 46: '+DATA_AAA/prod/datafile/AAAtsx.283.1107879225'
ORA-01207: file is more recent than control file - old control file
RMAN>
--日志,alert log中没有 Automatic block media recovery requested for 之类的信息
-- 开启mrp进程,同步正常
-- 发现备库有gap,,以及主库的日志上有一下信息
Wed Jul 20 02:28:04 2022
Errors in file /u01/PROD/oracle/diag/rdbms/x5AAAdb/AAAdb2/trace/AAAdb2_nsa4_143985.trc:
ORA-00338: log 19 of thread 2 is more recent than control file
ORA-00312: online log 19 thread 2: '+DATA_mnt/x5AAAdb/onlinelog/group_19.842.1103275941'
ORA-00338: log 19 of thread 2 is more recent than control file
ORA-00312: online log 19 thread 2: '+DATA_mnt/x5AAAdb/onlinelog/group_19.848.1103275941'
Errors in file /u01/PROD/oracle/diag/rdbms/x5AAAdb/AAAdb2/trace/AAAdb2_nsa4_143985.trc:
ORA-00338: log 19 of thread 2 is more recent than control file
ORA-00312: online log 19 thread 2: '+DATA_mnt/x5AAAdb/onlinelog/group_19.842.1103275941'
ORA-00338: log 19 of thread 2 is more recent than control file
ORA-00312: online log 19 thread 2: '+DATA_mnt/x5AAAdb/onlinelog/group_19.848.1103275941'
Archived Log entry 72662 added for thread 2 sequence 128054 ID 0x1ea2d4be dest 1:
LNS: Standby redo logfile selected for thread 2 sequence 128055 for destination LOG_ARCHIVE_DEST_4
LNS: Standby redo logfile selected for thread 2 sequence 128055 for destination LOG_ARCHIVE_DEST_5
RFS[1]: Selected log 8 for thread 1 sequence 139132 dbid 123456789 branch 832687336
Sat Jul 23 00:16:52 2022
Archived Log entry 5819 added for thread 1 sequence 139131 ID 0x1ea2d4be dest 1:
Sat Jul 23 00:17:02 2022
Fetching gap sequence in thread 2, gap sequence 128365-128365
Sat Jul 23 00:19:01 2022
FAL[client]: Failed to request gap sequence
GAP - thread 2 sequence 128365-128365
DBID 123456789 branch 832687336
FAL[client]: All defined FAL servers have been attempted.
原因:当前的归档日志已经到了128369,但是下一个归档的归档日志是128364 。同时,查看asm磁盘组上的归档日志发现只到128363,很明显,日志切换太频繁。控制文件中认为当前的归档日志是128369,
认为128364、128365已经生成了,实际上还没有生成。一段时间等待后,gap消失。archive log list正常 。
SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination +RECO_mnt
Oldest online log sequence 128364
Next log sequence to archive 128364
Current log sequence 128369
SQL>
再次检查同步情况,坏块情况。坏块消失。
另一个问题。在使用srvctl 命令重启db的时候,关闭掉了db,但是打开的时候提示这个错误
SYS@PROD1> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DATA_ABC/PROD/spfilePROD.ora'
ORA-17503: ksfdopn:10 Failed to open file +DATA_ABC/PROD/spfilePROD.ora
ORA-15056: additional error message
ORA-15173: entry 'spfileprod.ora' does not exist in directory 'prod'
ORA-06512: at line 4
SYS@PROD1>
参数文件没有了,查看asm磁盘组上,有一个参数文件,但是不敢确定,还有好几个控制文件。
如果是主库的话,直接自己造个pfile就启动了(只要直到数据库名),但是这个是备库,好多参数要设置,而且控制文件还好几个。
最终,在alert log中,查看数据库的启动过程,可以看到是使用那个spfile启动的。随后修改initPRODX.ora文件,启动正常。
查看initPROD1.ora中的参数,发现是这样的,initPRODX.ora被srvctl命令修改了 ?看下面的# line added by Agent 应该是被srvctl命令修改了。
[oracle@abcdb01 dbs]$ more initPROD1.ora
SPFILE='+DATA_ABC/PROD/spfilePROD.ora' # line added by Agent
[oracle@abcdb01 dbs]$
[oracle@abcdb02 dbs]$ more initPROD2.ora
#SPFILE='+DATA_ABC/PROD/spfileprod.ora' # line added by Agent
#SPFILE='+data_ABC/prod/PARAMETERFILE/spfile.256.1107452321'
SPFILE='+DATA_ABC/PROD/spfilePROD.ora' # line added by Agent
[oracle@abcdb02 dbs]$
查看了下集群中db的配置,果然,集群中的spfile是 spfile<SID>.ora。 原来做备库的时候,恢复出来的是spfile.256.1107452321
[grid@abcdb01 ~]$ srvctl config database -d prod
Database unique name: PROD
Database name: PROD
Oracle home: /u01/PROD/oracle/product/11.2.0.4
Oracle user: oracle
Spfile: +DATA_ABC/PROD/spfilePROD.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: PROD
Database instances: PROD1,PROD2
Disk Groups: DATA_ABC,RECO_ABC
Mount point paths:
Services:
Type: RAC
Database is administrator managed
[grid@abcdb01 ~]$
END。