Clusterware version consistency failed

Recently we rollbacked (qtree snapshots were restored) 11.2.0.3 2-node RAC back to 10.2.0.5 version after testing successful upgrade.Again it was time for mock upgrade to 11.2.0.3, before performing it on production.We started the runInstaller for CRS upgrade and after the “Prerequisite Checks” it showed ” clusterware version consistency failed for both the nodes”.

The to be new 11gR2 Grid_HOME –> /u01/app/grid/11.2.0
The 10gR2 CRS_HOME –> /u01/app/oracle/product/crs

xy4000: (mode1)crsctl query crs softwareversion
CRS software version on node [xy4000] is [10.2.0.5.0]
xy4000: (node1)crsctl query crs activeversion
CRS active version on the cluster is [10.2.0.5.0]

xy4001: (node2) /u01/app/grid> crsctl query crs activeversion
CRS active version on the cluster is [10.2.0.5.0]
xy4001: (node2) /u01/app/grid> crsctl query crs softwareversion
CRS software version on node [xy4001] is [10.2.0.5.0]

So the active and software version looks correct on both the nodes.Started the runInstaller in debug mode and saw

[pool-1-thread-1] [ 2012-03-13 02:18:32.093 CDT ] [UnixSystem.getCRSHome:2762]  remote copy file result=1| :successful
[pool-1-thread-1] [ 2012-03-13 02:18:32.094 CDT ] [UnixSystem.getCRSHome:2786]  configFile=/tmp/olr.loc13316231114317895692168542383724.tmp
[pool-1-thread-1] [ 2012-03-13 02:18:32.097 CDT ] [Utils.getPropertyValue:241]  keyName=olrconfig_loc props.val=/u01/app/grid/11.2.0/cdata/xy4001.olr propValue=/u01/app/grid/11.
2.0/cdata/xy4001.olr
[pool-1-thread-1] [ 2012-03-13 02:18:32.100 CDT ] [Utils.getPropertyValue:241]  keyName=crs_home props.val=/u01/app/grid/11.2.0 propValue=/u01/app/grid/11.2.0
[pool-1-thread-1] [ 2012-03-13 02:18:32.103 CDT ] [Utils.getPropertyValue:301]  propName=crs_home propValue=/u01/app/grid/11.2.0
[pool-1-thread-1] [ 2012-03-13 02:18:32.106 CDT ] [UnixSystem.getCRSHome:2794]  crs_home=/u01/app/grid/11.2.0

It gave us the clue that things are being read for olr (Oracle Local Registry – new in 11gR2).Checked the /etc/oracle and olr.loc existed.

xy4000: (node1) /etc/oracle> ls -lrt
total 2244
drwxr-xr-x  3 root dba    4096 Apr 19  2009 scls_scr
-rw-r--r--  1 root dba     131 Apr 19  2009 ocr.loc
-rw-r--r--  1 root dba      82 Feb 29 01:06 olr.loc

Renamed it on both the nodes, and started the runInstaller, all went fine after it 🙂

Advertisements

OHASD doesn’t start – 11gR2

Few weeks back had an issue where 2nd node of 4-node RAC got evicted and the alert log showed the below error before the instance was evicted –

Errors in file /u04/oraout/matrix/diag/rdbms/matrix_adc/matrix2/trace/matrix2_ora_8418.trc  (incident=16804):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 169.254.*.* not found. Check output from ifconfig command
Sat Oct 22 23:54:41 2011

ORA-29740: evicted by instance number 2, group incarnation 24
LMON (ospid: 29328): terminating the instance due to error 29740
Sun Oct 23 00:00:01 2011
Instance terminated by LMON, pid = 29328

We tried starting the instance with srvctl and manually using startup command, but both failed.During the startup the interesting thing i noticed was

Private Interface 'bond2' configured from GPnP for use as a private interconnect.
  [name='bond2', type=1, ip=144.xx.xx.xxx, mac=xx-xx-xx-xx-xx-xx, net=144.20.xxx.xxx/xx, mask=255.255.x.x, use=cluster_interconnect/6]

But in normal cases it should have been like

Private Interface 'bond2:1' configured from GPnP for use as a private interconnect.
  [name='bond2:1', type=1, ip=169.254.*.*, mac=xx-xx-xx-xx-xx-xx, net=169.254.x.x/xx, mask=255.255.x.x, use=haip:cluster_interconnect/62]

Now, the question comes up what is “haip”. HAIP is High Availability IP,

Grid automatically picks free link local addresses from reserved 169.254.*.* subnet for HAIP. According to RFC-3927, link local subnet 169.254.*.* should not be used for any other purpose. With HAIP, by default, interconnect traffic will be load balanced across all active interconnect interfaces, and corresponding HAIP address will be failed over transparently to other adapters if one fails or becomes non-communicative. .

The number of HAIP addresses is decided by how many private network adapters are active when Grid comes up on the first node in the cluster . If there’s only one active private network, Grid will create one.Grid Infrastructure can activate a maximum of four private network adapters at a time even if more are defined.

Few commands to check –

$oifcfg iflist -p -n

$crsctl stat res -t -init  --> ora.cluster_interconnect.haip must be ONLINE

$ oifcfg getif

select inst_id,name,ip_address from gv$cluster_interconnects;

We got network team involved, but as per them everything was well on network side, so we finally decided to go for server rebooted, after which OHAS deamon wasn’t coming up automatically, though

$ cat crsstart
enable

TEST:oracle> (matrix2:11.2.0.2_matrix) /etc/oracle/scls_scr/test/root
$ cat ohasdstr
enable

No logs in $GRID_HOME/log/test/ were getting updated, so it was little difficult to diagnose it.As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up, which wasn’t happening.

One of the reasons for ohasd not coming up is, if any rc Snncommand script is stuck at OS level

 root      2744     1  0 02:20 ?        00:00:00 /bin/bash /etc/rc.d/rc 3
 root      4888  2744  0 02:30 ?        00:00:00 /bin/sh /etc/rc3.d/S98gcstartup start

This S98gcsstartup was stuck.Checked the script which showed related to OMS startup. Renamed the file and got server rebooted, OHASD and all other resources came up successfully.

$ ls -lrt /etc/rc3.d/old_S98gcstartup
lrwxrwxrwx 1 root root 27 Jun  1 07:09 /etc/rc3.d/old_S98gcstartup -> /etc//rc.d/init.d/gcstartup

There are few other reasons too like ,inaccessible/corrupted OLR , CRS autostart disabled etc.

But still i was unable to find why we got “additional information: requested interface 169.254.*.* not found” all of a sudden when things were running fine.

ERROR -OGG-00446 -Oracle GoldenGate Capture for Oracle: No valid log files for current redo sequence XXX, thread X

I was trying to setup Oracle GoldenGate (OGG) on my test box having 11gR2 database with ASM. On starting the extract process on the source showed –

GGSCI (anand-lab) 6> START EXTRACT ext1

Sending START request to MANAGER ...
EXTRACT EXT1 starting


GGSCI (anand-lab) 7> INFO EXTRACT ext1

EXTRACT    EXT1      Initialized   2011-07-17 14:22   Status STOPPED
Checkpoint Lag       00:00:00 (updated 01:22:49 ago)
Log Read Checkpoint  Oracle Redo Logs
                     2011-07-17 14:22:50  Seqno 0, RBA 0

The extract process stopped.Checked the log which showed

2011-07-17 15:44:35  INFO    OGG-00975  Oracle GoldenGate Manager for Oracle, MGR.prm:  EXTRACT EXT1 starting.
2011-07-17 15:44:36  INFO    OGG-00992  Oracle GoldenGate Capture for Oracle, EXT1.prm:  EXTRACT EXT1 starting.
2011-07-17 15:44:36  INFO    OGG-01635  Oracle GoldenGate Capture for Oracle, EXT1.prm:  BOUNDED RECOVERY: reset to initial or altered checkpoint.
2011-07-17 15:44:39  INFO    OGG-01515  Oracle GoldenGate Capture for Oracle, EXT1.prm:  Positioning to begin time Jul 17, 2011 2:22:50 PM.
2011-07-17 15:45:00  ERROR   OGG-00446  Oracle GoldenGate Capture for Oracle, EXT1.prm:  No valid log files for current redo sequence 140, thread 1, error retrieving redo file name for sequence 140, archived = 0, use_alternate = 0Not able to establish initial position for begin time 2011-07-17 14:22:50.
2011-07-17 15:45:00  ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, EXT1.prm:  PROCESS ABENDING.

The parameter file for the Online Extract group ext1

EXTRACT ext1
USERID gg_owner, PASSWORD gg123
RMTHOST anand-lab, MGRPORT 7809
RMTTRAIL /media/sf_database/gg/dirdat/rt
TABLE hr.jobs;
TABLE scott.emp;

As the redo log file was stored under ASM, the process was unable to connect to ASM leading to the error.So, for the Extract process to run successfully, specify a user that can connect to the ASM instance using the below in Extract Parameter file –

TRANLOGOPTIONS ASMUSER {user}@{ASM_TNS_ALIAS} ASMPASSWORD {password}

Edited my Extract Parameter with the required parameter and Extract started running with errors –

EXTRACT ext1
USERID gg_owner, PASSWORD gg123
TRANLOGOPTIONS ASMUSER sys@ASM ASMPASSWORD sysasm123
RMTHOST anand-lab, MGRPORT 7809
RMTTRAIL /media/sf_database/gg/dirdat/rt
TABLE hr.jobs;
TABLE scott.emp;

ERROR: No checkpoint table specified for ADD REPLICAT – Oracle GoldenGate

This is just a quick note for myself to remember and reference for in case someone comes across this error while setting up Oracle GoldenGate.

While configuring the online synchronization replication, a default checkpoint table is created.The table’s name is mentioned in file named GLOBALS.

GGSCI (anand-lab) 29> EDIT PARAMS ./GLOBALS

GGSCHEMA GG_OWNER
CHECKPOINTTABLE GG_OWNER.CKPTAB

GGSCI (anand-lab) 31> DBLOGIN USERID gg_owner,PASSWORD gg123
Successfully logged into database.

GGSCI (anand-lab) 33> ADD CHECKPOINTTABLE CKPTAB

Successfully created checkpoint table CKPTAB.

Now, when you try to create the Replicat group, the ADD REPLICAT command might produces an error

GGSCI (anand-lab) 41> ADD REPLICAT rep1, EXTTRAIL /media/sf_database/gg/dirdat/rt
ERROR: No checkpoint table specified for ADD REPLICAT.

The solution is simply to exit that GGSCI session and then start another one before issuing ADD REPLICAT.The ADD REPLICATE command fails if issued from the session where the GLOBALS file using the GGSCI command “EDIT PARAMS ./GLOBALS” was created.This is because the name of the checkpoint table is read from GLOBALS by GGSCI. The session in which the GLOBALS was created cannot read the file.

Reference – MOS 965256.1