OHASD doesn’t start – 11gR2
Few weeks back had an issue where 2nd node of 4-node RAC got evicted and the alert log showed the below error before the instance was evicted -
Errors in file /u04/oraout/matrix/diag/rdbms/matrix_adc/matrix2/trace/matrix2_ora_8418.trc (incident=16804): ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:if_not_found failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpvaddr9 ORA-27303: additional information: requested interface 169.254.*.* not found. Check output from ifconfig command Sat Oct 22 23:54:41 2011 ORA-29740: evicted by instance number 2, group incarnation 24 LMON (ospid: 29328): terminating the instance due to error 29740 Sun Oct 23 00:00:01 2011 Instance terminated by LMON, pid = 29328
We tried starting the instance with srvctl and manually using startup command, but both failed.During the startup the interesting thing i noticed was
Private Interface 'bond2' configured from GPnP for use as a private interconnect. [name='bond2', type=1, ip=144.xx.xx.xxx, mac=xx-xx-xx-xx-xx-xx, net=144.20.xxx.xxx/xx, mask=255.255.x.x, use=cluster_interconnect/6]
But in normal cases it should have been like
Private Interface 'bond2:1' configured from GPnP for use as a private interconnect. [name='bond2:1', type=1, ip=169.254.*.*, mac=xx-xx-xx-xx-xx-xx, net=169.254.x.x/xx, mask=255.255.x.x, use=haip:cluster_interconnect/62]
Now, the question comes up what is “haip”. HAIP is High Availability IP,
Grid automatically picks free link local addresses from reserved 169.254.*.* subnet for HAIP. According to RFC-3927, link local subnet 169.254.*.* should not be used for any other purpose. With HAIP, by default, interconnect traffic will be load balanced across all active interconnect interfaces, and corresponding HAIP address will be failed over transparently to other adapters if one fails or becomes non-communicative. .
The number of HAIP addresses is decided by how many private network adapters are active when Grid comes up on the first node in the cluster . If there’s only one active private network, Grid will create one.Grid Infrastructure can activate a maximum of four private network adapters at a time even if more are defined.
Few commands to check -
$oifcfg iflist -p -n $crsctl stat res -t -init --> ora.cluster_interconnect.haip must be ONLINE $ oifcfg getif select inst_id,name,ip_address from gv$cluster_interconnects;
We got network team involved, but as per them everything was well on network side, so we finally decided to go for server rebooted, after which OHAS deamon wasn’t coming up automatically, though
$ cat crsstart enable TEST:oracle> (matrix2:184.108.40.206_matrix) /etc/oracle/scls_scr/test/root $ cat ohasdstr enable
No logs in $GRID_HOME/log/test/ were getting updated, so it was little difficult to diagnose it.As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up, which wasn’t happening.
One of the reasons for ohasd not coming up is, if any rc Snncommand script is stuck at OS level
root 2744 1 0 02:20 ? 00:00:00 /bin/bash /etc/rc.d/rc 3 root 4888 2744 0 02:30 ? 00:00:00 /bin/sh /etc/rc3.d/S98gcstartup start
This S98gcsstartup was stuck.Checked the script which showed related to OMS startup. Renamed the file and got server rebooted, OHASD and all other resources came up successfully.
$ ls -lrt /etc/rc3.d/old_S98gcstartup lrwxrwxrwx 1 root root 27 Jun 1 07:09 /etc/rc3.d/old_S98gcstartup -> /etc//rc.d/init.d/gcstartup
There are few other reasons too like ,inaccessible/corrupted OLR , CRS autostart disabled etc.
But still i was unable to find why we got “additional information: requested interface 169.254.*.* not found” all of a sudden when things were running fine.