ORA-00600:[kcbsor_2], [3], [2]

Recently we had a node crash of one node in 3-node 11.2.0.3 RAC database configuration.The alert log showed –

Thu Apr 26 00:15:16 2012
Hex dump of (file 488, block 345039) in trace file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_dbw1_16218.trc
Corrupt block relative dba: 0x7a0543cf (file 488, block 345039)
Bad header found during preparing block for write
Data in bad block:
 type: 6 format: 2 rdba: 0x644ebb3d
 last change scn: 0x0b7a.f77b7353 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x73530601
 check value in block header: 0x80ae
 computed block checksum: 0x0
Thu Apr 26 00:15:16 2012
Block Corruption Recovery: buf=0x453d5e9b8 seqno=1
Block recovery from logseq 58471, block 224143 to scn 12695053893631
Recovery of Online Redo Log: Thread 3 Group 19 Seq 58471 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot3g19m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot3g19m2.rdo
Hex dump of (file 488, block 345039) in trace file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_pmon_16123.trc
Reading datafile '/u10/oradata/matrix/test_idx_01_177.dbf' for corruption at rdba: 0x7a0543cf (file 488, block 345039)
Reread (file 488, block 345039) found same corrupt data (logically corrupt)
Errors in file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_pmon_16123.trc  (incident=48017):
ORA-00600: internal error code, arguments: [kcbsor_2], [3], [2], [], [], [], [], [], [], [], [], []
Errors in file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_pmon_16123.trc:
ORA-00600: internal error code, arguments: [kcbsor_2], [3], [2], [], [], [], [], [], [], [], [], []
PMON (ospid: 16123): terminating the instance due to error 472
Thu Apr 26 00:15:18 2012
opiodr aborting process unknown ospid (30802) as a result of ORA-1092
Thu Apr 26 00:15:18 2012
opiodr aborting process unknown ospid (15384) as a result of ORA-1092
Thu Apr 26 00:15:18 2012
opiodr aborting process unknown ospid (1661) as a result of ORA-1092
Thu Apr 26 00:15:18 2012
System state dump requested by (instance=3, osid=16123 (PMON)), summary=[abnormal instance termination].
Thu Apr 26 00:15:30 2012
Termination issued to instance processes. Waiting for the processes to exit
Thu Apr 26 00:15:33 2012
ORA-1092 : opitsk aborting process

The trace file “/u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_pmon_16123.trc” shows –

PMON: received BCR msg

*** 2012-04-26 00:15:16.373
Block Corruption Recovery: buf=0x453d5e9b8 seqno=1 tries=1
Trying to recover a block. Dump of buffer header.
----------------------------------------------------------
BH (0x453d5e9b8) file#: 488 rdba: 0x7a0543cf (488/345039) class: 1 ba: 0x45321a000
  set: 38 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 0,25
  dbwrid: 1 obj: 61329 objn: 61329 tsn: 7 afn: 488 hint: f
  hash: [0x83f1fa230,0x83f1fa230] lru-req: [0x8345f9390,0x8345f9390]
  obj-flags: object_ckpt_list
  ckptq: [0x547bd2c78,0xb3cfff70] fileq: [0xdbc48778,0x8395dec88] objq: [0x287d85f68,0x7c833f750] objaq: [0x46bd59140,0x303b53fc8]
  st: XCURRENT md: NULL fpin: 'kdiwh22: kdifind' tch: 1 le: 0x40fdd8170
  flags: buffer_dirty being_written block_written_once corrupt_being_recovered
          redo_since_read remote_transfered
  LRBA: [0xe467.36b8f.0] LSCN: [0xb8b.cc2d7fff] HSCN: [0xb8b.cc2d7fff] HSUB: [1]
GLOBAL CACHE ELEMENT DUMP (address: 0x40fdd8170):
  id1: 0x543cf id2: 0x1e8 pkey: OBJ#61329 block: (488/345039)
  lock: XG rls: 0x0 acq: 0x0 latch: 10
  flags: 0x20 fair: 0 recovery: 0 fpin: 'kdiwh22: kdifind'
  bscn: 0xb8b.cc2bc912 bctx: (nil) write: 0 scan: 0x0
  lcp: (nil) lnk: [NULL] lch: [0x453d5eae8,0x453d5eae8]
  seq: 3347 hist: 225 502 72 257 324 21 143:0 19 16 352 32 197 48 121
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x0aa00021 state: XCURRENT tsn: 7 tsh: 1 foq: 4
      addr: 0x453d5e9b8 obj: 61329 cls: DATA bscn: 0xb7a.f77b7353
Trying to recover a block with no users.
Doing block recovery for file 488 block 345039
Block header before block recovery:
buffer tsn: 7 rdba: 0x7a0543cf (488/345039)
scn: 0x0b8b.c8d4d9f4 seq: 0x01 flg: 0x04 tail: 0xd9f40601
frmt: 0x02 chkval: 0xb26b type: 0x06=trans data
Resuming block recovery (PMON) for file 488 block 345039
Block recovery from logseq 58471, block 224143 to scn 12695053893631

*** 2012-04-26 00:15:16.692
Recovery of Online Redo Log: Thread 3 Group 19 Seq 58471 Reading mem 0

*** 2012-04-26 00:15:16.933
CHANGE IN FUTURE OF BLOCK
CHANGE #2 TYP:0 CLS:1 AFN:488 DBA:0x7a0543cf OBJ:61329 SCN:0x0b8b.cc2bc912 SEQ:1 OP:10.2 ENC:0 RBL:0
buffer tsn: 7 rdba: 0x7a0543cf (488/345039)
scn: 0x0b8b.c8d4d9f4 seq: 0x01 flg: 0x04 tail: 0xd9f40601
frmt: 0x02 chkval: 0xb26b type: 0x06=trans data
GLOBAL CACHE ELEMENT DUMP (address: 0x40fdd8170):
  id1: 0x543cf id2: 0x1e8 pkey: OBJ#61329 block: (488/345039)
  lock: X rls: 0x0 acq: 0x0 latch: 10
  flags: 0x20 fair: 0 recovery: 0 fpin: 'kdiwh22: kdifind'
  bscn: 0xb8b.cc2bc912 bctx: (nil) write: 0 scan: 0x0
  lcp: (nil) lnk: [NULL] lch: [0x453d5eae8,0x453d5eae8]
  seq: 3350 hist: 104 147:0 68 225 502 72 257 324 21 143:0 19 16
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x0aa00005 state: XCURRENT tsn: 7 tsh: 1 mode: EXCL foq: 4
      pin: 'ktuwh61: ktugus:ktucmt'
      addr: 0x453d5e9b8 obj: 61329 cls: DATA bscn: 0xb8b.c8d4d9f4
kcra_dump_redo_internal: skipped for critical process
Hex dump of (file 488, block 345039)
Dump of memory from 0x000000045321A000 to 0x000000045321C000
45321A000 0000A206 7A0543CF C8D4D9F4 04010B8B  [......z........]
........................
45321BFF0 73736563 FFFFFFFF 4E01FFFF D9F40601  [c.s.......N....]
Reading datafile '/u10/oradata/matrix/test_idx_01_177.dbf' for corruption at rdba: 0x7a0543cf (file 488, block 345039)
Reread (file 488, block 345039) found same corrupt data (logically corrupt)
Reread for stuck block recovery failed
Incident 48017 created, dump file: /u01/oradiag/diag/rdbms/matrix_adc/matrix3/incident/incdir_48017/matrix3_pmon_16123_i48017.trc
ORA-00600: internal error code, arguments: [kcbsor_2], [3], [2], [], [], [], [], [], [], [], [], []

error 472 detected in background process
ORA-00600: internal error code, arguments: [kcbsor_2], [3], [2], [], [], [], [], [], [], [], [], []
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()+1332<-ksbrdp()+3344<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<
-main()+201<-__libc_start_main()+244<-_start()+37
----- End of Abridged Call Stack Trace -----

*** 2012-04-26 00:15:18.418
PMON (ospid: 16123): terminating the instance due to error 472

*** 2012-04-26 00:15:20.990
ksuitm: waiting up to [5] seconds before killing DIAG(16141)

This relates to Bug 9451143 where PMON starts online block recovery from a wrong LOW RBA which is provided in the block header.As per the MOS note 9451143.8 , the issue is fixed in 11.2.0.3, though we still had a hit. Instance startup was hung at –

Thu Apr 26 00:47:41 2012
Started redo application at
 Thread 1: logseq 86819, block 358549
 Thread 2: logseq 62762, block 134200
 Thread 3: logseq 58471, block 195623
Recovery of Online Redo Log: Thread 1 Group 2 Seq 86819 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot1g2m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot1g2m2.rdo
Recovery of Online Redo Log: Thread 2 Group 6 Seq 62762 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot2g6m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot2g6m2.rdo
Recovery of Online Redo Log: Thread 3 Group 19 Seq 58471 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot3g19m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot3g19m2.rdo
Thu Apr 26 00:47:42 2012
Hex dump of (file 488, block 345039) in trace file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_p000_30168.trc
Reading datafile '/u10/oradata/matrix/test_idx_01_177.dbf' for corruption at rdba: 0x7a0543cf (file 488, block 345039)
Reread (file 488, block 345039) found same corrupt data (logically corrupt)
RECOVERY OF THREAD 3 STUCK AT BLOCK 345039 OF FILE 488

“Logical Corruption happens when a data block has a valid checksum, etc., but the block contents are logically inconsistent. Logical block corruption can also occur when the structure below the beginning of the block (below the block header) is corrupt. In this case, the block checksum is correct but the block structures may be corrupt. Logical corruption can also result from a lost write in which a version of the block that was sent to disk somehow was not actually written. The result is that the version of that block on disk is older than the version in the buffer cache. Lost writes are usually caused by bugs in the operating system or hardware.”

We tried recover datafile 488 block 345039

RMAN>  RECOVER DATAFILE 488 BLOCK  345039;

Starting recover at 26-APR-12
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=2550 instance=matrix1 device type=DISK

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 26-APR-12
Thu Apr 26 01:13:25 2012
alter database recover datafile list clear
Completed: alter database recover datafile list clear

There wasn’t any error, but media recovery complete showed elapsed time of 00:00:00 , which made us doubtful.We tried starting the instance and again it got stuck at the same “RECOVERY OF THREAD 3 STUCK AT BLOCK 345039 OF FILE 488”. Meanwhile the other 2 instances also got evicted cause of high load average on the server causing downtime.We then performed datafile recovery

Thu Apr 26 01:22:17 2012
ALTER DATABASE RECOVER  datafile '/u10/oradata/matrix/test_idx_01_177.dbf'
Media Recovery Start
Serial Media Recovery started
Recovery of Online Redo Log: Thread 1 Group 2 Seq 86819 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot1g2m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot1g2m2.rdo
Recovery of Online Redo Log: Thread 2 Group 5 Seq 62761 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot2g5m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot2g5m2.rdo
Recovery of Online Redo Log: Thread 3 Group 22 Seq 58470 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot3g22m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot3g22m2.rdo
Thu Apr 26 01:22:50 2012
Recovery of Online Redo Log: Thread 3 Group 19 Seq 58471 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot3g19m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot3g19m2.rdo
Recovery of Online Redo Log: Thread 2 Group 6 Seq 62762 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot2g6m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot2g6m2.rdo
Thu Apr 26 01:23:22 2012
Recovery of Online Redo Log: Thread 1 Group 3 Seq 86820 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot1g3m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot1g3m2.rdo
Media Recovery Complete (matrix1)
Completed: ALTER DATABASE RECOVER  datafile '/u10/oradata/matrix/test_idx_01_177.dbf'

The datafile was recovered and the instance started.All 3-nodes were brought up.

From 11gr2 we have Automatic Block recovery (ABMR) feature, but as per my knowledge it would be useful in physical block corruption scenario only.ABMR requires data guard to be in real-time apply mode and sync or async transport modes.For more read-

http://docs.oracle.com/cd/E11882_01/server.112/e17157/unplanned.htm#BABCECIC

http://emrebaransel.blogspot.com.au/2012/02/automatic-block-media-recovery-on.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s