ORA-00600 – [kfnsInstanceReg00]

An interesting case of ORA-600 🙂

I was paged for standby lag and started looking into the issue. The db version was 11.1.0.7.0 version. Standby was lagging behind by 36 archives

SQL> @mrp

   INST_ID PROCESS   STATUS          THREAD#  SEQUENCE#     BLOCK# DELAY_MINS
---------- --------- ------------ ---------- ---------- ---------- ----------
         1 MRP0      WAIT_FOR_LOG          1     568330          0          0 


   INST_ID PROCESS   STATUS          THREAD#  SEQUENCE#     BLOCK# DELAY_MINS
---------- --------- ------------ ---------- ---------- ---------- ----------
         1 RFS       IDLE                  0          0          0          0
         1 RFS       IDLE                  0          0          0          0
         1 RFS       WRITING               1     568330          1          0 
         1 RFS       IDLE                  0          0          0          0
         1 RFS       IDLE                  1     568366     309529          0

After sometime checked back again and still on the same sequence#. Status of MRP and RFS

SQL> @mrp

   INST_ID PROCESS   STATUS          THREAD#  SEQUENCE#     BLOCK# DELAY_MINS
---------- --------- ------------ ---------- ---------- ---------- ----------
         1 MRP0      WAIT_FOR_LOG          1     568330          0          0


   INST_ID PROCESS   STATUS          THREAD#  SEQUENCE#     BLOCK# DELAY_MINS
---------- --------- ------------ ---------- ---------- ---------- ----------
         1 RFS       IDLE                  0          0          0          0
         1 RFS       IDLE                  0          0          0          0
         1 RFS       WRITING               1     568330          1          0 
         1 RFS       IDLE                  0          0          0          0
         1 RFS       IDLE                  1     568366     795006          0

No error was reported in alert log or was in v$archive_dest_status.Archive log genertaion on Primary was usual like any other day, nothing changed.Seq# 568330 was present on Primary and size was 497Mb whereas the same seq# size on standby was 513Mb. I am not sure why!!!

So, i thought to copy the archive from the primary, stop mrp , perform manual recovery using ‘recover standby standby’ as other 35 archives were present on disk and then start real time apply.

Once the copy completed, i stopped the MRP and started manual recovery

SQL> alter database recover managed standby database cancel;

Database altered.

SQL> select name,open_mode,database_role from v$database;

NAME      OPEN_MODE  DATABASE_ROLE
--------- ---------- ----------------
TEST      MOUNTED    PHYSICAL STANDBY

SQL> recover standby database;
ORA-00279: change 69956910076 generated at 10/26/2014 21:46:36 needed for
thread 1
ORA-00289: suggestion :
/oraclebackup/fra/TEST/archivelog/2014_10_26/o1_mf_1_568330_b4vdl546_.arc
ORA-00280: change 69956910076 for thread 1 is in sequence #568330


Specify log: {=suggested | filename | AUTO | CANCEL}
AUTO
ORA-16145: archival for thread# 1 sequence# 568330 in progress 
SQL> @mrp

no rows selected


   INST_ID PROCESS   STATUS          THREAD#  SEQUENCE#     BLOCK# DELAY_MINS
---------- --------- ------------ ---------- ---------- ---------- ----------
         1 RFS       IDLE                  0          0          0          0
         1 RFS       IDLE                  0          0          0          0
         1 RFS       WRITING               1     568330          1          0 
         1 RFS       IDLE                  0          0          0          0
         1 RFS       IDLE                  1     568372     563688          0

At this point i thought to bounce the standby and then perform recovery. I issued shutdown immediate and it was hung. Almost after 10mins alert log showed

Active call for process 45660 user 'oracle' program 'oracle@test.com'
SHUTDOWN: waiting for active calls to complete.

I checked the PID 45660 and as it showed LOCAL=NO, did kill -9

[oracle@test:~/app/oracle (test)]$ ps -ef | grep 45660
oracle   41463 16050  0 00:43 pts/0    00:00:00 grep 45660
oracle   45660     1  1 Oct23 ?        01:09:22 oracletest (LOCAL=NO)
[oracle@test:~/app/oracle (test)]$ kill -9 45660 
[oracle@test:~/app/oracle (test)] ps -ef | grep 45660
oracle   42181 16050  0 00:44 pts/0    00:00:00 grep 45660
oracle   45660     1  1 Oct23 ?        01:09:22 oracletest (LOCAL=NO)

Kill -9 didn’t kill the process. As it was standby, i thought to go ahead and Abort the instance and this is what happend when i tried startup mount

SQL> startup mount
ORACLE instance started.

Total System Global Area 6.1466E+10 bytes
Fixed Size                  2174600 bytes
Variable Size            2.0535E+10 bytes
Database Buffers         4.0802E+10 bytes
Redo Buffers              126640128 bytes

ORA-03113: end-of-file on communication channel
Process ID: 47331
Session ID: 995 Serial number: 3

Alert log showed

Mon Oct 27 00:48:09 2014
ALTER DATABASE MOUNT
Mon Oct 27 00:53:10 2014
System State dumped to trace file /home/oracle/app/oracle/diag/rdbms/test_TEST_db/test/trace/test_ckpt_47252.trc
Mon Oct 27 00:53:11 2014
Errors in file /home/oracle/app/oracle/diag/rdbms/test_TEST_db/test/trace/test_asmb_47260.trc (incident=491160):
ORA-00600: internal error code, arguments: [600], [ORA_NPI_ERROR], [ORA-00600: internal error code, arguments: [kfnsInstanceReg00], [test:test_TEST_DB], [30000], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Incident details in: /home/oracle/app/oracle/diag/rdbms/test_TEST_db/test/incident/incdir_491160/test_asmb_47260_i491160.trc
Errors in file /home/oracle/app/oracle/diag/rdbms/test_TEST_db/test/trace/test_asmb_47260.trc:
ORA-15064: communication failure with ASM instance

ORA-00600: internal error code, arguments: [600], [ORA_NPI_ERROR], [ORA-00600: internal error code, arguments: [kfnsInstanceReg00], [test:test_TEST_DB], [30000], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Mon Oct 27 00:53:12 2014
Mon Oct 27 00:53:12 2014
Sweep Incident[491160]: completed
Mon Oct 27 00:53:12 2014
Trace dumping is performing id=[cdmp_20141027005312]
Errors in file /home/oracle/app/oracle/diag/rdbms/test_TEST_db/prod/trace/test_ckpt_47252.trc:
ORA-15083: failed to communicate with ASMB background process
CKPT (ospid: 47252): terminating the instance due to error 15083
Mon Oct 27 00:53:20 2014
Instance terminated by CKPT, pid = 47252

The CKPT trace showed


15083: Timeout waiting for ASMB to be ready (state=1)
ASMB is ALIVE
—– Abridged Call Stack Trace —–

SYSTEM STATE (level=10, with short stacks)
————
System global information:
processes: base 0xe90ea0088, size 900, cleanup 0xea0ec6598
allocation: free sessions 0xeb8f53d68, free calls (nil)
control alloc errors: 0 (process), 0 (session), 0 (call)
PMON latch cleanup depth: 0
seconds since PMON's last scan for dead processes: 1
system statistics:
22 logons cumulative
19 logons current
5 opened cursors cumulative
1 opened cursors current
0 user commits
0 user rollbacks
51 user calls
19 recursive calls
0 recursive cpu usage
0 session logical reads
0 session stored procedure space
0 CPU used when call started
0 CPU used by this session
1 DB time

Looking at the ASM instance alert log

System State dumped to trace file /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_33024.trc
Errors in file /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_33024.trc:

more /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_33024.trc


*** ACTION NAME:() 2014-10-27 02:18:39.197

kfnmsg_wait: KSR failure reaping message (0xffffff83)

*** 2014-10-27 02:18:39.197
Communication failure sending message to UFGs (0xffffff83)
kfnmsg_wait: opcode=35 msg=0x6b3f0928
2014-10-27 02:18:39.197762 :kfnmsg_Dump(): kfnmsg_SendtoUFGs: (kfnmsg) op=35 (KFNMS_GROUP_RLS) (gnum) [1.0]
CommunicationFailure 0xffffff83 after 300s

The server was hosting 2 standby db instances and 1 ASM instance. We thought to bring down the other standby db instance and then the ASM instance and then finally restart all. The shutdown immediate for 2nd standby instance was hung for long time so we aborted it. After ASM was brought down, still we could see below 5 pids

$ ps -ef | grep oracle
oracle 12032 16050 0 02:44 pts/0 00:00:00 grep arc
oracle 23135 1 0 Oct18 ? 00:02:15 ora_arc5_test
oracle 25390 1 0 Oct18 ? 00:01:19 ora_arc1_test2
oracle 25392 1 0 Oct18 ? 00:00:50 ora_arc2_test2

oracle 45660 1 1 Oct23 ? 01:09:22 oracleprod (LOCAL=NO)

Even after all the instances were down on server we could see few archiver Pids and a Pid with LOCAL=NO. Kill -9 also didn’t kill the pid.

Started the ASM instance and then when i started the standby instance TEST , ASM crashed

NOTE: check client alert log.
NOTE: Trace records dumped in trace file /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_57380.trc
Errors in file /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_57380.trc (incident=280159):
ORA-00600: internal error code, arguments: [kfupsRelease01], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /home/oracle/app/oracle/diag/asm/+asm/+ASM/incident/incdir_280159/+ASM_ora_57380_i280159.trc

At this point escalated to client for server reboot. After the server reboot was done, all the instances came up properly. Started MRP and standby came insync.

Also you can reference to the below Metalink documents

Database startup Reports ORA-00600:[kfnsInstanceReg00] (Doc ID 1552422.1)
Bug 6934636 – Hang possible for foreground processes in ASM instance (Doc ID 6934636.8)
ORA-600 [kfnsInstanceReg00] and long waits on “ASM FILE METADATA OPERATION” (Doc ID 1480320.1)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s