12cR1 RAC Installation on OEL7

To build Oracle Clusterware Database at Home, I believe , RAC ATTACK is the best place to learn. Its is a free curriculum and platform for hands-on learning labs related to Oracle RAC. While reviewing the article, I thought to perform 12cR1 RAC installation on OEL 7.2.

Attached is the document :- 12c_RAC_on_OEL7

The attached article is inspired by

RAC ATTACK :- https://en.wikibooks.org/wiki/RAC_Attack_-_Oracle_Cluster_Database_at_Home/RAC_Attack_12c

Tim Hall’s article :- https://oracle-base.com/articles/12c/oracle-db-12cr1-rac-installation-on-oracle-linux-7-using-virtualbox 

Deploying Oracle RAC Database 12c on RHEL 7 – Best Practices :- https://www.redhat.com/en/resources/deploying-oracle-rac-database-12c-rhel-7-best-practices

Big Thank you to RAC Attack members!!!

I hope the document helps some of you. Please feel free to comment.

Its all about learning 🙂

ORA-00600:[kcbsor_2], [3], [2]

Recently we had a node crash of one node in 3-node 11.2.0.3 RAC database configuration.The alert log showed –

Thu Apr 26 00:15:16 2012
Hex dump of (file 488, block 345039) in trace file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_dbw1_16218.trc
Corrupt block relative dba: 0x7a0543cf (file 488, block 345039)
Bad header found during preparing block for write
Data in bad block:
 type: 6 format: 2 rdba: 0x644ebb3d
 last change scn: 0x0b7a.f77b7353 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x73530601
 check value in block header: 0x80ae
 computed block checksum: 0x0
Thu Apr 26 00:15:16 2012
Block Corruption Recovery: buf=0x453d5e9b8 seqno=1
Block recovery from logseq 58471, block 224143 to scn 12695053893631
Recovery of Online Redo Log: Thread 3 Group 19 Seq 58471 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot3g19m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot3g19m2.rdo
Hex dump of (file 488, block 345039) in trace file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_pmon_16123.trc
Reading datafile '/u10/oradata/matrix/test_idx_01_177.dbf' for corruption at rdba: 0x7a0543cf (file 488, block 345039)
Reread (file 488, block 345039) found same corrupt data (logically corrupt)
Errors in file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_pmon_16123.trc  (incident=48017):
ORA-00600: internal error code, arguments: [kcbsor_2], [3], [2], [], [], [], [], [], [], [], [], []
Errors in file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_pmon_16123.trc:
ORA-00600: internal error code, arguments: [kcbsor_2], [3], [2], [], [], [], [], [], [], [], [], []
PMON (ospid: 16123): terminating the instance due to error 472
Thu Apr 26 00:15:18 2012
opiodr aborting process unknown ospid (30802) as a result of ORA-1092
Thu Apr 26 00:15:18 2012
opiodr aborting process unknown ospid (15384) as a result of ORA-1092
Thu Apr 26 00:15:18 2012
opiodr aborting process unknown ospid (1661) as a result of ORA-1092
Thu Apr 26 00:15:18 2012
System state dump requested by (instance=3, osid=16123 (PMON)), summary=[abnormal instance termination].
Thu Apr 26 00:15:30 2012
Termination issued to instance processes. Waiting for the processes to exit
Thu Apr 26 00:15:33 2012
ORA-1092 : opitsk aborting process

The trace file “/u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_pmon_16123.trc” shows –

PMON: received BCR msg

*** 2012-04-26 00:15:16.373
Block Corruption Recovery: buf=0x453d5e9b8 seqno=1 tries=1
Trying to recover a block. Dump of buffer header.
----------------------------------------------------------
BH (0x453d5e9b8) file#: 488 rdba: 0x7a0543cf (488/345039) class: 1 ba: 0x45321a000
  set: 38 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 0,25
  dbwrid: 1 obj: 61329 objn: 61329 tsn: 7 afn: 488 hint: f
  hash: [0x83f1fa230,0x83f1fa230] lru-req: [0x8345f9390,0x8345f9390]
  obj-flags: object_ckpt_list
  ckptq: [0x547bd2c78,0xb3cfff70] fileq: [0xdbc48778,0x8395dec88] objq: [0x287d85f68,0x7c833f750] objaq: [0x46bd59140,0x303b53fc8]
  st: XCURRENT md: NULL fpin: 'kdiwh22: kdifind' tch: 1 le: 0x40fdd8170
  flags: buffer_dirty being_written block_written_once corrupt_being_recovered
          redo_since_read remote_transfered
  LRBA: [0xe467.36b8f.0] LSCN: [0xb8b.cc2d7fff] HSCN: [0xb8b.cc2d7fff] HSUB: [1]
GLOBAL CACHE ELEMENT DUMP (address: 0x40fdd8170):
  id1: 0x543cf id2: 0x1e8 pkey: OBJ#61329 block: (488/345039)
  lock: XG rls: 0x0 acq: 0x0 latch: 10
  flags: 0x20 fair: 0 recovery: 0 fpin: 'kdiwh22: kdifind'
  bscn: 0xb8b.cc2bc912 bctx: (nil) write: 0 scan: 0x0
  lcp: (nil) lnk: [NULL] lch: [0x453d5eae8,0x453d5eae8]
  seq: 3347 hist: 225 502 72 257 324 21 143:0 19 16 352 32 197 48 121
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x0aa00021 state: XCURRENT tsn: 7 tsh: 1 foq: 4
      addr: 0x453d5e9b8 obj: 61329 cls: DATA bscn: 0xb7a.f77b7353
Trying to recover a block with no users.
Doing block recovery for file 488 block 345039
Block header before block recovery:
buffer tsn: 7 rdba: 0x7a0543cf (488/345039)
scn: 0x0b8b.c8d4d9f4 seq: 0x01 flg: 0x04 tail: 0xd9f40601
frmt: 0x02 chkval: 0xb26b type: 0x06=trans data
Resuming block recovery (PMON) for file 488 block 345039
Block recovery from logseq 58471, block 224143 to scn 12695053893631

*** 2012-04-26 00:15:16.692
Recovery of Online Redo Log: Thread 3 Group 19 Seq 58471 Reading mem 0

*** 2012-04-26 00:15:16.933
CHANGE IN FUTURE OF BLOCK
CHANGE #2 TYP:0 CLS:1 AFN:488 DBA:0x7a0543cf OBJ:61329 SCN:0x0b8b.cc2bc912 SEQ:1 OP:10.2 ENC:0 RBL:0
buffer tsn: 7 rdba: 0x7a0543cf (488/345039)
scn: 0x0b8b.c8d4d9f4 seq: 0x01 flg: 0x04 tail: 0xd9f40601
frmt: 0x02 chkval: 0xb26b type: 0x06=trans data
GLOBAL CACHE ELEMENT DUMP (address: 0x40fdd8170):
  id1: 0x543cf id2: 0x1e8 pkey: OBJ#61329 block: (488/345039)
  lock: X rls: 0x0 acq: 0x0 latch: 10
  flags: 0x20 fair: 0 recovery: 0 fpin: 'kdiwh22: kdifind'
  bscn: 0xb8b.cc2bc912 bctx: (nil) write: 0 scan: 0x0
  lcp: (nil) lnk: [NULL] lch: [0x453d5eae8,0x453d5eae8]
  seq: 3350 hist: 104 147:0 68 225 502 72 257 324 21 143:0 19 16
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x0aa00005 state: XCURRENT tsn: 7 tsh: 1 mode: EXCL foq: 4
      pin: 'ktuwh61: ktugus:ktucmt'
      addr: 0x453d5e9b8 obj: 61329 cls: DATA bscn: 0xb8b.c8d4d9f4
kcra_dump_redo_internal: skipped for critical process
Hex dump of (file 488, block 345039)
Dump of memory from 0x000000045321A000 to 0x000000045321C000
45321A000 0000A206 7A0543CF C8D4D9F4 04010B8B  [......z........]
........................
45321BFF0 73736563 FFFFFFFF 4E01FFFF D9F40601  [c.s.......N....]
Reading datafile '/u10/oradata/matrix/test_idx_01_177.dbf' for corruption at rdba: 0x7a0543cf (file 488, block 345039)
Reread (file 488, block 345039) found same corrupt data (logically corrupt)
Reread for stuck block recovery failed
Incident 48017 created, dump file: /u01/oradiag/diag/rdbms/matrix_adc/matrix3/incident/incdir_48017/matrix3_pmon_16123_i48017.trc
ORA-00600: internal error code, arguments: [kcbsor_2], [3], [2], [], [], [], [], [], [], [], [], []

error 472 detected in background process
ORA-00600: internal error code, arguments: [kcbsor_2], [3], [2], [], [], [], [], [], [], [], [], []
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()+1332<-ksbrdp()+3344<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<
-main()+201<-__libc_start_main()+244<-_start()+37
----- End of Abridged Call Stack Trace -----

*** 2012-04-26 00:15:18.418
PMON (ospid: 16123): terminating the instance due to error 472

*** 2012-04-26 00:15:20.990
ksuitm: waiting up to [5] seconds before killing DIAG(16141)

This relates to Bug 9451143 where PMON starts online block recovery from a wrong LOW RBA which is provided in the block header.As per the MOS note 9451143.8 , the issue is fixed in 11.2.0.3, though we still had a hit. Instance startup was hung at –

Thu Apr 26 00:47:41 2012
Started redo application at
 Thread 1: logseq 86819, block 358549
 Thread 2: logseq 62762, block 134200
 Thread 3: logseq 58471, block 195623
Recovery of Online Redo Log: Thread 1 Group 2 Seq 86819 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot1g2m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot1g2m2.rdo
Recovery of Online Redo Log: Thread 2 Group 6 Seq 62762 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot2g6m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot2g6m2.rdo
Recovery of Online Redo Log: Thread 3 Group 19 Seq 58471 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot3g19m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot3g19m2.rdo
Thu Apr 26 00:47:42 2012
Hex dump of (file 488, block 345039) in trace file /u01/oradiag/diag/rdbms/matrix_adc/matrix3/trace/matrix3_p000_30168.trc
Reading datafile '/u10/oradata/matrix/test_idx_01_177.dbf' for corruption at rdba: 0x7a0543cf (file 488, block 345039)
Reread (file 488, block 345039) found same corrupt data (logically corrupt)
RECOVERY OF THREAD 3 STUCK AT BLOCK 345039 OF FILE 488

“Logical Corruption happens when a data block has a valid checksum, etc., but the block contents are logically inconsistent. Logical block corruption can also occur when the structure below the beginning of the block (below the block header) is corrupt. In this case, the block checksum is correct but the block structures may be corrupt. Logical corruption can also result from a lost write in which a version of the block that was sent to disk somehow was not actually written. The result is that the version of that block on disk is older than the version in the buffer cache. Lost writes are usually caused by bugs in the operating system or hardware.”

We tried recover datafile 488 block 345039

RMAN>  RECOVER DATAFILE 488 BLOCK  345039;

Starting recover at 26-APR-12
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=2550 instance=matrix1 device type=DISK

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 26-APR-12
Thu Apr 26 01:13:25 2012
alter database recover datafile list clear
Completed: alter database recover datafile list clear

There wasn’t any error, but media recovery complete showed elapsed time of 00:00:00 , which made us doubtful.We tried starting the instance and again it got stuck at the same “RECOVERY OF THREAD 3 STUCK AT BLOCK 345039 OF FILE 488”. Meanwhile the other 2 instances also got evicted cause of high load average on the server causing downtime.We then performed datafile recovery

Thu Apr 26 01:22:17 2012
ALTER DATABASE RECOVER  datafile '/u10/oradata/matrix/test_idx_01_177.dbf'
Media Recovery Start
Serial Media Recovery started
Recovery of Online Redo Log: Thread 1 Group 2 Seq 86819 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot1g2m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot1g2m2.rdo
Recovery of Online Redo Log: Thread 2 Group 5 Seq 62761 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot2g5m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot2g5m2.rdo
Recovery of Online Redo Log: Thread 3 Group 22 Seq 58470 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot3g22m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot3g22m2.rdo
Thu Apr 26 01:22:50 2012
Recovery of Online Redo Log: Thread 3 Group 19 Seq 58471 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot3g19m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot3g19m2.rdo
Recovery of Online Redo Log: Thread 2 Group 6 Seq 62762 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot2g6m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot2g6m2.rdo
Thu Apr 26 01:23:22 2012
Recovery of Online Redo Log: Thread 1 Group 3 Seq 86820 Reading mem 0
  Mem# 0: /u02/oraredo/matrix/redot1g3m1.rdo
  Mem# 1: /u05/oraredo/matrix/redot1g3m2.rdo
Media Recovery Complete (matrix1)
Completed: ALTER DATABASE RECOVER  datafile '/u10/oradata/matrix/test_idx_01_177.dbf'

The datafile was recovered and the instance started.All 3-nodes were brought up.

From 11gr2 we have Automatic Block recovery (ABMR) feature, but as per my knowledge it would be useful in physical block corruption scenario only.ABMR requires data guard to be in real-time apply mode and sync or async transport modes.For more read-

http://docs.oracle.com/cd/E11882_01/server.112/e17157/unplanned.htm#BABCECIC

http://emrebaransel.blogspot.com.au/2012/02/automatic-block-media-recovery-on.html

hasGetClusterStatus Status verification failed due to cluvfy execution failure for node(s) – 10gR2 RAC

Received a mail from colleague reporting the below error from emgc Metric Collection Errors –

Target		crs_node1
Type		Cluster
Metric		Clusterware Status
Collection Timestamp	Dec 14, 2011 5:35:02 AM
Error Type		Collection Problem
Message		WARN|has::Common::hasGetClusterStatus Status verification failed due to cluvfy execution failure for node(s) node3:EFAIL,NODE_STATUS::node2:EFAIL,NODE_STATUS::node1:EFAIL,NODE_STATUS::node4:EFAIL,OVERALL_STATUS: 

The first thing i did is ran cluvfy


node1: (matrix1) /app/oracle/crs/bin> ./cluvfy stage -post crsinst -n node1,node2,node3,node4 -verbose

Performing post-checks for cluster services setup

Checking node reachability...

Check: Node reachability from node "node1"
  Destination Node                      Reachable?
  ------------------------------------  ------------------------
  node3                                yes
  node1                                yes
  node4                                yes
  node2                                yes
Result: Node reachability check passed from node "node1".


Checking user equivalence...

Check: User equivalence for user "oracle"
  Node Name                             Comment
  ------------------------------------  ------------------------
  node3                                passed
  node2                                passed
  node1                                passed
  node4                                passed
Result: User equivalence check passed for user "oracle".

ERROR:
The location "/tmp/CVU_10.2.0.5.0.1_dba/" is owned by another user on nodes:
        node3,node1,node2
Verification will proceed with nodes:
        node4

ERROR:
CRS is not installed on any of the nodes.
Verification cannot proceed.


Post-check for cluster services setup was unsuccessful on all the nodes.

The cluvfy verification failed. Checked for the permission on /tmp/CVU_10.2.0.5.0.1_dba

drwxr-----  3 em  dba      4096 May 27  2010 CVU_10.2.0.5.0.1_dba

Changed the permission to 770

drwxrwx---  3 em  dba      4096 May 27  2010 CVU_10.2.0.5.0.1_dba

After changing the permission of /tmp/CVU_10.2.0.5.0.1_dba/ on all the nodes ran cluvfy again

node1: (matrix1) /app/oracle/crs/bin> ./cluvfy stage -post crsinst -n node1,node2,node3,node4 -verbose

Performing post-checks for cluster services setup

Checking node reachability...

Check: Node reachability from node "node1"
  Destination Node                      Reachable?
  ------------------------------------  ------------------------
  node3                                yes
  node1                                yes
  node4                                yes
  node2                                yes
Result: Node reachability check passed from node "node1".


Checking user equivalence...

Check: User equivalence for user "oracle"
  Node Name                             Comment
  ------------------------------------  ------------------------
  node3                                passed
  node2                                passed
  node1                                passed
  node4                                passed
Result: User equivalence check passed for user "oracle".

WARNING:
CRS is not installed on nodes:
        node4
Verification will proceed with nodes:
        node3,node2,node1
.........................

Post-check for cluster services setup was unsuccessful.
Checks did not pass for the following node(s):
        node4

olsnodes showed all the 4 nodes, but cluvfy showed CRS is not installed on nodes: node4

To find out more, digged into the cluvfy log in $CRS_HOME/cv/log

[main] [5:36:32:622] [OUIData.readInventoryData:393]  ==== CRS home added: Oracle home properties:
Name     : OraCrs10g_home
Type     : CRS-HOME
Location : /app/oracle/crs
Node list: [node1, node2, node3]
...............................
[main] [5:43:44:502] [Stage.verify:359]  m_currentTaskSet.size=1; Wed Dec 14 05:43:44 CST 2011
[main] [5:43:44:502] [TaskNodeAppCreation.performTask:157]  Performing NodeApp Creation Verification Task... ; Wed Dec 14 05:43:44 CST 2011
[main] [5:43:44:502] [sVerificationUtil.getInventoryFileLocation:133]  Inventory Config File's name is:'/etc/oraInst.loc'; Wed Dec 14 05:43:44 CST 2011
[main] [5:43:44:503] [sVerificationUtil.getInventoryFileLocation:168]  inventory_loc=/app/oracle/crs/oraInventory; Wed Dec 14 05:43:44 CST 2011
[main] [5:43:44:503] [sVerificationUtil.getInventoryFileLocation:170]  Inventory File Location is-->/app/oracle/crs/oraInventory/ContentsXML/inventory.xml; Wed Dec 14 05:43:44 CST 2011
[main] [5:43:44:504] [VerificationUtil.isCRSInstalled:1208]  CRS found installed  on node: node3; Wed Dec 14 05:43:44 CST 2011
[main] [5:43:44:504] [VerificationUtil.isCRSInstalled:1208]  CRS found installed  on node: node2; Wed Dec 14 05:43:44 CST 2011
[main] [5:43:44:504] [VerificationUtil.isCRSInstalled:1208]  CRS found installed  on node: node1; Wed Dec 14 05:43:44 CST 2011
[main] [5:43:44:504] [TaskNodeAppCreation.performTask:178]  ==== Nodes with CRS installed is: 3; Wed Dec 14 05:43:44 CST 2011

..................

Checked the inventory.xml in /app/oracle/crs/oraInventory/ContentsXML/ which showed

{HOME NAME="OraCrs10g_home" LOC="/app/oracle/crs" TYPE="O" IDX="1" CRS="true"}
   {NODE_LIST}
      {NODE NAME="node1"/}
      {NODE NAME="node2"/}
      {NODE NAME="node3"/}
   {/NODE_LIST}
{/HOME}

Added NODE NAME “node4” to the inventory.xml using

node1: (matrix1) /app/oracle/crs/oui/bin> ./runInstaller -updateNodeList -silent "CLUSTER_NODES={node1,node2,node3,node4}" ORACLE_HOME=$CRS_HOME CRS=true

Starting Oracle Universal Installer...

No pre-requisite checks found in oraparam.ini, no system pre-requisite checks will be executed.
The inventory pointer is located at /etc/oraInst.loc
The inventory is located at /app/oracle/crs/oraInventory
'UpdateNodeList' was successful.

After it got completed, inventory.xml showed

{HOME_LIST}
{HOME NAME="O]raCrs10g_home" LOC="/app/oracle/crs" TYPE="O" IDX="1" CRS="true"}
   {NODE_LIST
      {NODE NAME="node1"/}
      {NODE NAME="node2"/}
      {NODE NAME="node3"/}
      {NODE NAME="node4"/}
   {/NODE_LIST}
{/HOME}
node1: (matrix1) /app/oracle/crs/bin> ./cluvfy stage -post crsinst -n node1,node2,node3,node4 -verbose

Performing post-checks for cluster services setup

..................
..................
Result: Check passed.


Post-check for cluster services setup was successful.

Extract keeps ABENDING – No valid archive log dest for Thread N – OGG

Recently we refreshed one of the 2-node UAT database from 3-node production.The 2-node UAT acts as the source for the GoldenGate.After the refresh of source and the initial load on the target, we started the extract and replicat process.As, the source was 2-node RAC, extract was added with 2 Threads information

ADD EXTRACT ext, TRANLOG, THREADS 2, BEGIN NOW

Everything ran well for almost 2 weeks when all of a sudden extract kept abending and GG error log started showing

2011-11-14 03:44:47  INFO    OGG-01629  Oracle GoldenGate Capture for Oracle, ext.prm:  BOUNDED RECOVERY: PERSISTED OBJECTS RECOVERED: <>.
2011-11-14 03:44:51  WARNING OGG-01423  Oracle GoldenGate Capture for Oracle, ext.prm:  No valid default archive log destination directory found for thread 3.

Though the source was 2-node RAC the alert showed – No valid default archive log destination directory found for thread 3.

On looking at the source db more carefully showed thread 3 in v$thread.Someone had mistakenly added redo log groups with thread 3 and they were in UNUSED state.Once we “dropped” the LOG GROUPs for Thread 3 and restarted the Extract, everything is working fine.

Now,suppose you have source with 3 threads and you mistakenly added extract with 2 threads, you can get the same error in future, then –

1. Take the showch ouput of the existing extract
2. Consider the earliest Recovery checkpoint timestamp to begin the extract
3. ggsci> delete ext
4. ggsci> add extract ext TRANLOG, THREADS 3, BEGIN {Earliest Recovery Checkpiont timestamp taken in step2}
5. ggsci> add exttrail {path}, extract ext, megabytes XXX
6. ggsci> start ext

Incase, if you donot wish to capture the transactions from node N, edit the extract parameter file –

THREADOPTIONS PROCESSTHREADS EXCEPT N –> N is the thread number.

OHASD doesn’t start – 11gR2

Few weeks back had an issue where 2nd node of 4-node RAC got evicted and the alert log showed the below error before the instance was evicted –

Errors in file /u04/oraout/matrix/diag/rdbms/matrix_adc/matrix2/trace/matrix2_ora_8418.trc  (incident=16804):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 169.254.*.* not found. Check output from ifconfig command
Sat Oct 22 23:54:41 2011

ORA-29740: evicted by instance number 2, group incarnation 24
LMON (ospid: 29328): terminating the instance due to error 29740
Sun Oct 23 00:00:01 2011
Instance terminated by LMON, pid = 29328

We tried starting the instance with srvctl and manually using startup command, but both failed.During the startup the interesting thing i noticed was

Private Interface 'bond2' configured from GPnP for use as a private interconnect.
  [name='bond2', type=1, ip=144.xx.xx.xxx, mac=xx-xx-xx-xx-xx-xx, net=144.20.xxx.xxx/xx, mask=255.255.x.x, use=cluster_interconnect/6]

But in normal cases it should have been like

Private Interface 'bond2:1' configured from GPnP for use as a private interconnect.
  [name='bond2:1', type=1, ip=169.254.*.*, mac=xx-xx-xx-xx-xx-xx, net=169.254.x.x/xx, mask=255.255.x.x, use=haip:cluster_interconnect/62]

Now, the question comes up what is “haip”. HAIP is High Availability IP,

Grid automatically picks free link local addresses from reserved 169.254.*.* subnet for HAIP. According to RFC-3927, link local subnet 169.254.*.* should not be used for any other purpose. With HAIP, by default, interconnect traffic will be load balanced across all active interconnect interfaces, and corresponding HAIP address will be failed over transparently to other adapters if one fails or becomes non-communicative. .

The number of HAIP addresses is decided by how many private network adapters are active when Grid comes up on the first node in the cluster . If there’s only one active private network, Grid will create one.Grid Infrastructure can activate a maximum of four private network adapters at a time even if more are defined.

Few commands to check –

$oifcfg iflist -p -n

$crsctl stat res -t -init  --> ora.cluster_interconnect.haip must be ONLINE

$ oifcfg getif

select inst_id,name,ip_address from gv$cluster_interconnects;

We got network team involved, but as per them everything was well on network side, so we finally decided to go for server rebooted, after which OHAS deamon wasn’t coming up automatically, though

$ cat crsstart
enable

TEST:oracle> (matrix2:11.2.0.2_matrix) /etc/oracle/scls_scr/test/root
$ cat ohasdstr
enable

No logs in $GRID_HOME/log/test/ were getting updated, so it was little difficult to diagnose it.As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up, which wasn’t happening.

One of the reasons for ohasd not coming up is, if any rc Snncommand script is stuck at OS level

 root      2744     1  0 02:20 ?        00:00:00 /bin/bash /etc/rc.d/rc 3
 root      4888  2744  0 02:30 ?        00:00:00 /bin/sh /etc/rc3.d/S98gcstartup start

This S98gcsstartup was stuck.Checked the script which showed related to OMS startup. Renamed the file and got server rebooted, OHASD and all other resources came up successfully.

$ ls -lrt /etc/rc3.d/old_S98gcstartup
lrwxrwxrwx 1 root root 27 Jun  1 07:09 /etc/rc3.d/old_S98gcstartup -> /etc//rc.d/init.d/gcstartup

There are few other reasons too like ,inaccessible/corrupted OLR , CRS autostart disabled etc.

But still i was unable to find why we got “additional information: requested interface 169.254.*.* not found” all of a sudden when things were running fine.

crsctl start crs doesn’t start crs – 10gR2

Today morning a box got rebooted, crs(10.2.0.5 vesrsion) didn’t come up automatically. Manual crsctl start crs command didn’t update any log file in crsd,cssd, or even the alertxx4040.log

xx4040: (test1) /etc/oracle/scls_scr/test1/root> cat crsstart
enable
xx4040: (test1) /etc/oracle/scls_scr/test1/root>

So, as crsstart is enabled, it should have come up automatically.Ran cluvfy

xx4040: (test1) /home/oracle>cluvfy stage -pre crsinst -n test1,test2 -verbose

Performing pre-checks for cluster services setup

Checking node reachability...

Check: Node reachability from node "test1"
  Destination Node                      Reachable?
  ------------------------------------  ------------------------
  test1                                yes
  test2                                yes
Result: Node reachability check passed from node "test1".


Checking user equivalence...

Check: User equivalence for user "oracle"
  Node Name                             Comment
  ------------------------------------  ------------------------
  test2                                passed
  test1                                passed
Result: User equivalence check passed for user "oracle".

ERROR:
Path "/tmp/" is not a writable directory on nodes:
        test1,test2
Please choose a different work area using CV_DESTLOC.
Verification cannot proceed.

Pre-check for cluster services setup was unsuccessful on all the nodes.

/var/log/messages showed –

Oct 25 00:10:12 xx4040 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.5232.
Oct 25 00:10:12 xx4040 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.4628.
Oct 25 00:10:12 xx4040 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.5230.

Checked the ownership and permission on /tmp

drwxr-xr-x    14 root   root   4096 Oct 25 00:08 tmp

Changed the permission to 777 of tmp and started the crs, and all started well :).

Update from the comments by Frits Hoogland –> The required permission for tmp is 1777 where 1 is for the sticky bit.

While searching on internet, got another article which could be helpful –

http://blogs.oracle.com/gverma/entry/crsctl_start_crs_does_not_work

ORA-01567: dropping log N would leave less than 2 log files for instance UNNAMED_INSTANCE_N (thread N)

Recently we had a database refresh where a standalone db got refreshed from the 4-node RAC. After the completion of the refresh, saw 10 redo log groups.As it was a single instance database and wouldn’t have much workload, thought of dropping 6 groups.


SQL> select GROUP# from v$log;

    GROUP#
----------
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10

SQL> alter database drop logfile group 10;
alter database drop logfile group 10
*
ERROR at line 1:
ORA-01567: dropping log 10 would leave less than 2 log files for instance UNNAMED_INSTANCE_4 (thread 4)
ORA-00312: online log 10 thread 4: '/u05/flashback/matrix/MATRIX/onlinelog/o1_mf_10_79t3s884_.log'

As it was single instance database, was little astonished seeing “UNNAMED_INSTANCE_4 (thread 4)”.So started checking for some more clue

SQL> show parameter thread

NAME                                 TYPE                              VALUE
------------------------------------ --------------------------------- ------------------------------
parallel_threads_per_cpu             integer                           2
thread                               integer                           0

SQL> show parameter cluster

NAME                                 TYPE                              VALUE
------------------------------------ --------------------------------- ------------------------------
cluster_database                     boolean                           FALSE
cluster_database_instances           integer                           1

Parameter settings seems fine.Lets check what v$thread has to say –


SQL> select thread#,status from v$thread;

   THREAD# STATUS
---------- ------------------
         1 OPEN
         2 CLOSED
         3 CLOSED
         4 CLOSED

This is where the problem is. So, disabled the thread which in turn removed the row from v$thread


SQL> alter database disable thread 4;

Database altered.

SQL> alter database drop logfile group 10;

Database altered.

SQL>  alter database drop logfile group 9;

Database altered.

Similarly, disabled thread 2 and 3 and dropped the redo log group.


SQL> select thread#,status from v$thread;

   THREAD# STATUS
---------- ------------------
         1 OPEN