You are on page 1of 20

Log Files for Troubleshooting Oracle RAC issues

The cluster has a number of log files that can be examined to gain any insight of occurring problems A good place to start diagnosis for the cluster problems is from the $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log All clusterware log files are stored under $ORA_CRS_HOME/log/ directory. 1. alert<nodename>.log : Important clusterware alerts are stored in this log file. It is stored in $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log 2. crsd.log : CRS logs are stored in $ORA_CRS_HOME/log/<hostname>/crsd/ directory. The crsd.log file is archived every 10MB as crsd.101, crsd.102 ... 3. cssd.log : CSS logs are stored in $ORA_CRS_HOME/log/<hostname>/cssd/ directory. The cssd.log file is archived every 20MB as cssd.101, cssd.102.... 4. evmd.log : EVM logs are stored in $ORA_CRS_HOME/log/<hostname>/evmd/ directory. 5. OCR logs : OCR logs (ocrdump, ocrconfig, ocrcheck) log files are stored in $ORA_CRS_HOME/log/<hostname>/client/ directory. 6. SRVCTL logs: srvctl logs are stored in two locations, $ORA_CRS_HOME/log/<hostname>/client/ and in $ORACLE_HOME/log/<hostname>/client/ directories. 7. RACG logs : The high availability trace files are stored in two locations $ORA_CRS_HOME/log/<hostname>/racg/ and in $ORACLE_HOME/log/<hostname>/racg/ directories. RACG contains log files for node applications such as VIP, ONS etc. ONS log filename = ora.<hostname>.ons.log VIP log filename = ora.<hostname>.vip.log Each RACG executable has a sub directory assigned exclusively for that executable. racgeut : $ORA_CRS_HOME/log/<hostname>/racg/racgeut/ racgevtf : $ORA_CRS_HOME/log/<hostname>/racg/racgevtf/ racgmain : $ORA_CRS_HOME/log/<hostname>/racg/racgmain/ racgeut : $ORACLE_HOME/log/<hostname>/racg/racgeut/ racgmain: $ORACLE_HOME/log/<hostname>/racg/racgmain/ racgmdb : $ORACLE_HOME/log/<hostname>/racg/racgmdb/ racgimon: $ORACLE_HOME/log/<hostname>/racg/racgimon/ As in a normal Oracle single instance environment, a RAC environment contains the standard RDBMS log files: These files are located by the parameters : background_dest_dump contan the alert log and backgrond process trace files. user_dump_dest contains any trace file generated by a user process.

core_dump_dest contains core files that are generated due to a core dump in a user process.

RAC - Issues & Troubleshooting

Whenever a node is having issues joining the cluster back post reboot, here is a quick check list I would suggest:

/var/log/messages ifconfig ip route /etc/hosts /etc/sysconfig/network-scripts/ifcfg-eth* ethtool mii-tool cluvfy $ORA_CRS_HOME/log

Let us now take a closer look at specifc issues with examples and steps taken for their resolution. These are all tested on Oracle 10.2.0.4 database on RHEL4 U8 x-64

1. srvctl not able to start Oracle Instance but sqlplus able to start a. Check racg log for actual error message. % more $ORACLE_HOME/log/`hostname -s`/racg/ora.{DBNAME}. {INSTANCENAME}.inst.log

b. Check if srvctl is configured to use correct parameter file(pfile/spfile) % srvctl config database -d {DBNAME} -a

You can also validate parameter file by using sqlplus to see the exact error message.

c. Check ownership for $ORACLE_HOME/log If this is owned by root, srvctl won't be able to start instance as oracle user. # chown -R oracle:dba $ORACLE_HOME/log

2. VIP has failed over to another node but is not coming back to the original node Fix: The node where the VIP has failed over, bring it down manually as root Example: ifconfig eth0:2 down PS: Be careful to bring down only VIP. A small typo may bring down your public interface:)

3. Moving OCR to a different location PS: This can be done while CRS is up as root. While trying to change ocr mirror or the ocr to a new location, ocrconfig complaints. The fix is to touch the new file. Example: # ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile PROT-21: Invalid parameter

# touch /crs_new/cludata/ocrfile # chown root:dba /crs_new/cludata/ocrfile # ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile

Verify: a. Validate using "ocrcheck". Device/File Name should point to the new one with integrity check succeeded. b. Ensure OCR inventory is updated correctly # cat /etc/oracle/ocr.loc ocrconfig_loc and ocrmirrorconfig_loc should point to correct locations.

4. Moving Voting Disk to a different location PS: CRS must be down while moving the voting disk.

The idea is to add new voting disks and delete the older ones. Find below sample errors and their fix. # crsctl add css votedisk /crs_new/cludata/cssfile_new Cluster is not in a ready state for online disk addition

We need to use force option. However, before using force option, ensure CRS is down. If CRS is up, DO NOT use force option else it may corrupt your OCR.

# crsctl add css votedisk /crs_new/cludata/cssfile_new -force Now formatting voting disk: /crs_new/cludata/cssfile_new successful addition of votedisk /crs_new/cludata/cssfile_new.

Verify using "crsctl query css votedisk" and then delete the old votedisks. While deleting too, you'll need to use force option.

Also verify the permissions of the voting disk files. It should be oracle:dba If voting disks were added using root, the permission should be changed to oracle:dba

5. Manually registering listener resource to OCR Listener was registered manually with OCR but srvctl was unable to bring up the listener Let us first see example of how to manually do this. From an existing available node, print the listener resource % crs_stat -p ora.test-server2.LISTENER_TEST-SERVER2.lsnr > /tmp/res % cat /tmp/res NAME=ora.test-server2.LISTENER_TEST-SERVER2.lsnr TYPE=application ACTION_SCRIPT=/orahome/ora10g/product/10.2.0/db_1/bin/racgwrap ACTIVE_PLACEMENT=0 AUTO_START=1 CHECK_INTERVAL=600 DESCRIPTION=CRS application for listener on node FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS=test-server2 OPTIONAL_RESOURCES= PLACEMENT=restricted

REQUIRED_RESOURCES=ora.test-server2.vip RESTART_ATTEMPTS=5 SCRIPT_TIMEOUT=600 START_TIMEOUT=0 STOP_TIMEOUT=0 UPTIME_THRESHOLD=7d USR_ORA_ALERT_NAME= USR_ORA_CHECK_TIMEOUT=0 USR_ORA_CONNECT_STR=/ as sysdba USR_ORA_DEBUG=0 USR_ORA_DISCONNECT=false USR_ORA_FLAGS= USR_ORA_IF= USR_ORA_INST_NOT_SHUTDOWN= USR_ORA_LANG= USR_ORA_NETMASK= USR_ORA_OPEN_MODE= USR_ORA_OPI=false USR_ORA_PFILE= USR_ORA_PRECONNECT=none USR_ORA_SRV= USR_ORA_START_TIMEOUT=0 USR_ORA_STOP_MODE=immediate USR_ORA_STOP_TIMEOUT=0 USR_ORA_VIP=

Modify relevant parameters in the resource file to point to correct instance. Rename as resourcename.cap % mv /tmp/res /tmp/ora.test-server1.LISTENER_TEST-SERVER1.lsnr.cap

Register with OCR % crs_register ora.test-server1.LISTENER_TEST-SERVER1.lsnr -dir /tmp/

Start listener % srvctl start listener -d testdb -n test-server1

While trying to start listener, srvctl is throwing errors like "Unable to read from listener log file" The listener log file exists. If resource is registered using root, then srvctl won't be able to start using oracle user. So all the aforementioned operations while registering the listener manually should be done using oracle user.

6. Services While checking status of a service, it says "not running" If we try to start it using srvctl, the error message is "No such service exists" or "already running" If we try to add service with same name, it says "already exists" This happens because the service is in an "Unknown" state in the OCR Using crs_stat, check if any related resource for service(resource names ending with .srv and .cs) is still lying around.

srvctl remove service -f has been tried and the issue persists. Here is the fix: # crs_stop -f {resourcename} # crs_unregister {resourcename} Now service can be added and started correctly.

7. Post host reboot, CRS is not starting After host reboot, CRS was not coming up. No CRS logs in $ORA_CRS_HOME Check /var/log/messages "Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.9559" No logs seen in /tmp/crsctl.*

Run cluvfy to identify the issue $ORA_CRS_HOME/bin/cluvfy stage -post crsinst -n {nodename}

/tmp was not writable

/etc/fstab was incorrect and was fixed for making /tmp available

If you see messages like "Shutdown CacheLocal. my hash ids don't match" in the CRS log, then check if /etc/oracle/ocr.loc is same across all nodes of the cluster.

8. CRS binary restored by copying from existing node in the cluster

CRS not starting with following messages in /var/log/messages; "Id "h1" respawning too fast: disabled for 5 minutes"

CRSD log showing "no listener"

If CRS binary is restored by copying from existing node in the cluster, then you need to ensure: a. Hostnames are modified correctly in $ORA_CRS_HOME/log b. You may need to cleanup socket files from /var/tmp/.oracle

PS:Exercise caution while working with the socket files. If CRS is up, you should never touch those files otherwise reboot may be inevitable.

9. CRS rebooting frequently by oprocd Check /etc/oracle/oprocd/ and grep for "Rebooting". Check /var/log/messages and grep for "restart" If the timestamps are matching, this confirms reboots are being initated by oprocd process.

%ps -ef grep oprocd root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f

-t 1000 means oprocd would wake up every 1000ms -m 500 means allow upto 500ms margin of error Basically with these options if oprocd wakes up after > 1.5 secs its going to force a reboot.

This is conceptually analogous to what hangcheck timer used to do pre 10.2.0.4 Oracle releases on Linux.

Fix is to set CSS diagwait to 13 #crsctl set css diagwait 13 -force

# /oracle/product/crs/bin/crsctl get css diagwait 13

This actually changes what parameters oprocd runs with %ps -ef grep oprocd root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f

Note that the margin has now changed to 10000ms i.e 10 seconds in place of the default 0.5 seconds.

PS: Setting diagwait requires a full shutdown of Oracle Clusterware on ALL nodes.

10. Cluster hung. All SQL queries on GV$ views are hanging. Alert log from all instance have message like below: INST1: IPC Send timeout detected. Receiver ospid 1650

INST2:IPC Send timeout detected.Sender: ospid 24692 Receiver: inst 1 binc 150 ospid 1650

INST3: IPC Send timeout detected.Sender: ospid 12955 Receiver: inst 1 binc 150 ospid 1650

The ospid on all instances belong to LCK0 - Lock Process In case of inter-instance lock issues, it's important to identify the instance from where it's initiating. As seen from above, INST1 is the one that needs to be fixed. Just identify the process that is causing row cache lock and kill it otherwise reboot node 1.

11. Inconsistent OCR with invalid permissions % srvctl add db -d testdb -o /oracle/product/10.2 PRKR-1005 : adding of cluster database testdb configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]

crs_stat doesn't have any trace of it so utilities like crs_setperm/crs_unregister/crs_stop won't work in this case.

ocrdump shows: [DATABASE.LOG.testdb] UNDEF : SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

[DATABASE.LOG.testdb.INSTANCE] UNDEF : SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

These logs are owned by root and that's the problem. This means that the resource was perhaps added into OCR using root. Though it has been removed by root but now it cannot be added by oracle user unless we get rid of the aforementioned.

Shutdown the entire cluster and either restore from previous good backup of OCR using: ocrconfig -restore backupfilename

You can get list of backups using: ocrconfig -showbackup

If you are not sure of last good backup, there you can also do the following: Take export backup of OCR using: ocrconfig -export /tmp/export -s online

Edit /tmp/export and remove those 2 lines pointing to DATABASE.LOG.testdb and DATABASE.LOG.testdb.INSTANCE owned by root

Import it back now ocrconfig -import /tmp/export

After starting the cluster, verify using ocrdump. The OCRDUMPFILE should not have any trace of those leftover log entries owned by root.

Troubleshooting Oracle Clusters and Oracle RAC


Oracle Clusterware has its moments. There are times when it does not want to start for various reasons. In this section we talk about diagnosing the health of the cluster, collecting diagnostic information, and trying to correct problems with Oracle Clusterware.

Checking the Health of the Cluster


Once you have configured a new cluster you will probably want to check the health of that cluster. Additionally you might want to check the health of the cluster after you have added or removed a node or if there is something about the cluster that is causing you to suspect that its suffering from some problem. Follow these steps to give your cluster a full health check:

1. Use the crsctl check has command to check if OHASD is running on the local node and that its healthy:
[oracle@rac1 admin]$crsctl check has CRS-4638: Oracle High Availability Services is online

2.

Use the crsctl check crs command to check OHASD, CRSD, ocssd and EVM daemons.

[oracle@rac1 admin]$crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online

3.

Use the crsctl check cluster all command to check all daemons on all nodes of the cluster.

[oracle@rac1 admin]$crsctl check cluster all ******************************************************** rac1: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online rac2: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
4. Check the cluster logs for any error messages that might have been logged.

If these steps do not indicate a problem and you feel there is a problem with the cluster, you can do the following:

1. Stop the cluster (crsctl stop cluster) 2. Start the cluster (crsctl start cluster), monitoring the startup message to see if any errors occur.

3.

If errors do occur or you still feel there is a problem, check the cluster logs for error messages.

Log Files, Collecting Diagnostic Information and Trouble Resolution


Because Oracle Clusterware 11g Release 2 consists of a number of different components it follows that there are a number of different log files associated with these processes. In this section we will first document the log files associated with the various Clusterware processes. We will then discuss a method of collecting the data in these logs into a single source that you can reference when doing trouble diagnosis. Oracle Clusterware 11g Release 2 generates a number of different log files that can be used to troubleshoot Clusterware problems. Oracle Clusterware 11g adds a new environment variable called GRID_HOME to reference the base of the Oracle Clusterware software home. The Clusterware log files are typically stored under the GRID_HOME directory in a sub-directory called log. Under that directory is another directory with the host name and then a directory that indicates the Clusterware component that the specific logs are associated with. For example, GRID_HOME/log/myrac1/crsd stores the log files associated with CRSD for the host myrac1. The following table lists the log file directories and the contents of those directories: Directory Path GRID_HOME/log/<host>/alert<host>.log Contents Clusterware alert log Disk Monitor Daemon OCRDUMP, OCRCHECK, OCRCONFIG, CRSCTL Cluster Time Synchronization Service Grid Interprocess Communication Daemon Oracle High Availability Services Daemon Cluster Ready Services Daemon Grid Plug and Play Daemon Mulitcast Domain Name Service Daemon Event Manager Daemon RAC RACG RAC RACG RAC RACG RAC RACG (only used if pre-11.1 database is installed) Cluster Synchronization Service Daemon Server Manager HA Service Daemon Agent HA Service Daemon CSS Agent

GRID_HOME/log/<host>/diskmon GRID_HOME/log/<host>/client GRID_HOME/log/<host>/ctssd GRID_HOME/log/<host>/gipcd GRID_HOME/log/<host>/ohasd GRID_HOME/log/<host>/crsd GRID_HOME/log/<host>/gpnpd GRID_HOME/log/<host>/mdnsd GRID_HOME/log/<host>/evmd GRID_HOME/log/<host>/racg/racgmain GRID_HOME/log/<host>/racg/racgeut GRID_HOME/log/<host>/racg/racgevtf GRID_HOME/log/<host>/racg GRID_HOME/log/<host>/cssd GRID_HOME/log/<host>/srvm GRID_HOME/log/<host>/agent/ohasd/oraagent_oracle11 GRID_HOME/log/<host>/agent/ohasd/oracssdagent_root

GRID_HOME/log/<host>/agent/ohasd/oracssdmonitor_root HA Service Daemon ocssdMonitor Agent GRID_HOME/log/<host>/agent/ohasd/orarootagent_root


HA Service Daemon Oracle Root Agent

GRID_HOME/log/<host>/agent/crsd/oraagent_oracle11 GRID_HOME/log/<host> agent/crsd/orarootagent_root GRID_HOME/log/<host> agent/crsd/ora_oc4j_type_oracle11g GRID_HOME/log/<host>/gnsd

CRS Daemon Oracle Agent CRS Daemon Oracle Root Agent CRS Daemon Oracle OC4J Agent Grid Naming Service Daemon

The following diagram provides additional detail as to the location of the Oracle Clusterware log files:

Oracle Clusterware will rotate logs over time. This is known as a rollover of the log. Rollover log files will typically have the same name as the logfile but it will have a version number attached to the end. This helps to maintain control of space utilization in the GRID_HOME directory. Each log file type has its own rotation time frame. An example of rolling of log files can be seen in this listing of the GRID_HOME/log/rac1/agent/crsd/oraagent_oracle directory. In the listing note that there is the current oraagent_oracle log file with an extension of .log and then there are the additional oraagent_oracle log files with extensions from l01 to l10. These latter log files are the backup log files, of which 10 are maintained.

[oracle@rac1 oraagent_oracle]$ pwd /ora01/app/11.2.0/grid/log/rac1/agent/crsd/oraagent_oracle [oracle@rac1 oraagent_oracle]$ ls -al total 109320 drwxr-xr-t 2 oracle oinstall 4096 Jun 10 20:02 . drwxrwxrwt 5 root oinstall 4096 Jun 8 10:29 .. -rw-r--r-- 1 oracle oinstall 10565073 Jun 10 20:02 oraagent_oracle.l01 -rw-r--r-- 1 oracle oinstall 10583355 Jun 10 13:35 oraagent_oracle.l02 -rw-r--r-- 1 oracle oinstall 10583346 Jun 10 07:13 oraagent_oracle.l03 -rw-r--r-- 1 oracle oinstall 10583397 Jun 10 00:51 oraagent_oracle.l04 -rw-r--r-- 1 oracle oinstall 10583902 Jun 9 18:29 oraagent_oracle.l05 -rw-r--r-- 1 oracle oinstall 10584515 Jun 9 13:17 oraagent_oracle.l06 -rw-r--r-- 1 oracle oinstall 10584397 Jun 9 09:26 oraagent_oracle.l07 -rw-r--r-- 1 oracle oinstall 10584344 Jun 9 05:37 oraagent_oracle.l08 -rw-r--r-- 1 oracle oinstall 10584126 Jun 9 01:50 oraagent_oracle.l09 -rw-r--r-- 1 oracle oinstall 10539847 Jun 8 21:09 oraagent_oracle.l10 -rw-r--r-- 1 oracle oinstall 5955542 Jun 10 23:38 oraagent_oracle.log -rw-r--r-- 1 oracle oinstall 0 Jun 8 10:29 oraagent_oracleOUT.log -rw-r--r-- 1 oracle oinstall 6 Jun 10 21:20 oraagent_oracle.pid

Collecting Clusterware Diagnostic Data


Oracle provides utilities that make it easier to determine the status of the Cluster and collect the Clusterware log files for problem diagnosis. In this section we will review the diagcollection.pl script which is used to collect logfile information. We will then look at the Cluster Verify Utility (CVU).

Using Diagcollection.pl

Clearly Oracle Clusterware has a number of log files. Often when troubleshooting problems you will want to review several of the log files. This can involve traversing directories which can be tedious at best. Additionally Oracle support might well ask that you collect up all the Clusterware log files so they can diagnose the problem that you are having. The diagcollection.pl script comes with the following options:

--collect Collect diagnostic information. --clean Cleans the directory of previous files created by previous runs of diagcollection.pl. Options to only collect specific information. These options include crs, --core or all. all is the default setting.

To make collection of the Clusterware log data easier Oracle provides a program called diagcollection.pl which is contained in $GRID_HOME/bin. This script will collect Clusterware log files and other helpful diagnostic information. The script has a collect option that you invoke to collect the diagnostic information. When invoked the script creates four files in the local directory. These four files are gzipped tarballs and are listed in the following table: Script Name Coredata*tar.gz crsData*tar*gz ocrData*tar*gz osData*tar*gz Contains Core files and related analysis files. Contains log files from GRID_HOME/log/<host> directory structure. Contains the results of an execution of ocrdump and ocrcheck . Current OCR backups are also listed. Contains /var/log/messages and other related files.

Oracle Cluster Verification Utility (CVU)

The CVU is used to verify that there are no configuration issues with the cluster. CVU is located in the GRID_HOME/bin directory and also in $ORACLE_HOME/bin. CVU supports Oracle Clusterware versions 10gR1 onwards. You can also run CVU from the Oracle 11g Release 2 install media. In this case, call the program runcluvfy.sh which calls CVU. Prior to Oracle Clusterware 11g Release 2 you would need to download it from OTN. In Oracle Clusterware 11g Release 2 CVU is installed as a part of Oracle Clusterware. CVU can be run in various situations including:

During various phases of the install of the initial cluster to confirm that key components are in place and operational (such as SSH). For example, OUI makes calls to the CVU during the creation of the cluster to ensure that pre-requisites were executed. After you have completed the initial creation of the cluster After you add or remove a node from the cluster If you suspect there is a problem with the cluster.

CVU diagnosis/verifies specific components. Components are groupings based on functionality. Examples of components are space, integrity of the cluster, OCR integrity, clock synchronization and so on. You can use CVU to check one or all components of the cluster. In some cases, when problems are detected CVU can create fixup scripts that are designed to correct problems that were detected.

Checking the Oracle Cluster Registry

Node evictions or other problems can be caused by corruption in the OCR. The ocrcheck program provides a way to check the integrity of the OCR. Performs checksum operations on the blocks within the OCR to ensure they are not corrupt. Here is an example of running the ocrcheck program:

[oracle@rac1 admin]$ ocrcheck Status of Oracle Cluster Registry is as follows : Version : 3 Total space (kbytes) : 262120 Used space (kbytes) : 2580 Available space (kbytes) : 259540 ID : 749518627 Device/File Name : +DATA Device/File integrity check succeeded Cluster registry integrity check succeeded Logical corruption check bypassed due to non-privileged user

Oracle Clusterware Trouble Resolution


When dealing with difficult Clusterware issues that befuddles you there are some basic first steps to perform. These steps are: 1. Check and double check your RAC database backups are current. If they are not and at least one node survives, backup your database. If you have a good backup, backing up the archived redo logs is also a very good idea. The bottom line is that you have an unstable environment. Make sure you have protected your data should the whole thing go bottom up. Open an SR with Oracle Support. After opening the SR, search Metalink Oracle Support (MOS) for the problem you are experiencing. If you find nothing on MOS, do a Google search for the problem you are experiencing. Using the diagcollection.pl script, collect the Clusterware logs. Review the logs for error messages that might give you some insight into the problem at hand.

2. 3. 4. 5.

The truth is that Oracle Clusterware is a very complex beast. For the DBA who does not deal with solving Clusterware problems on a day-in-day-out basis, determining the nature and resolution to a problem can be an overwhelming challenge. In your attempts to solve the problem, you can cause additional problems and damage to the cluster. It is

far better, if you do not know the solution, to let Oracle support work with you on a solution. Thats what you pay them for. Note that the number one step is backup of any RAC databases on the cluster. Keep in mind that one possible problem your cluster could be starting to experience is issues with the storage infrastructure. Consider this carefully when performing a backup on an unstable cluster. It may be that you will want to try to backup to some other storage medium (NAS for example) that uses a different hardware path (for example, does not use your HBAs) if possible. If the cluster is starting to have issues, there is a lot that can go wrong and a lot of damage that can be occur (this is true with a non-clustered database too).

Dealing With Node Evictions

Node evictions can be hard to diagnose. There are many possible causes of node evictions, some of which might be obvious and some which might not be. In this section we address dealing with node evictions. First we ask the question, what can cause a node eviction. We then discuss finding out what actually caused our node eviction.

What Can Cause an Eviction?

A common problem that DBAs have to face with Clusterware is node evictions which usually leads to a reboot of the node that was evicted. With Oracle Clusterware 11g Release 2 there are two main processes that can cause node evictions:

Oracle Clusterware Kill Daemon (Oclskd) Used by CSS to reboot a node when the reboot is requested by one or more other nodes. CSSDMONITOR OCSSD daemon is monitored by cssdmonitor. If a hang is indicated (say the ocssd daemon is lost) then the node will be rebooted.

Previous to Oracle Clusterware 11g Release 2 the hangcheck-timer module was configured and could also be a cause of nodes rebooting. As of Oracle Clusterware 11g Release 2 this module is no longer needed and should not be enabled.

Finding What Caused the Eviction

Very often with node evictions you will need to engage Oracle support. Clusterware is complex enough that it will take the support tools that Oracle Support has available to diagnose the problem. However, there are some initial things you can do that might help to solve some basic problems, like node mis-configurations. Some things you might want to do are: 1. Determine the time the node rebooted. Using the uptime UNIX command for example. This will help you determine where in the various logs you will want to look for additional information. Check the following logfiles to begin with:

2.

a. b. c.

/var/log/messages GRID_HOME/log/<host>/cssd/ocssd.log GRID_HOME/log/host/alert<host>.log

Perhaps the biggest causes for node evictions are:

1. Node time coordination We have found that even though Oracle Clusterware 11g Release 2 does not
indicate that NTP is a requirement Clusterware does seem to be more stable when its enabled and working correctly. We recommend that you configure NTP for all nodes.

2. Interconnect issues A common cause of node eviction issues is that the interconnect is not completely
isolated from other network traffic. Ensure that the interconnect is completely isolated from all other network traffic. This includes the switches that the interconnect is attached to.

3. Configuration/certification issues Ensure that the hardware and software you are using is certified by
Oracle. This includes the specific version and even patchset number of each component.

4. Patches Ensure that all patch sets are installed as required. 5. OS Software components Ensure that all OS software components have been installed as directed by
Oracle. Dont decide to not install a component just because you dont think you are going to need to use it. Make sure that you are installing the correct revision of those components. The Oracle documentation and OMS provide a complete list of all required patch sets that must be installed for Clusterware and RAC to work correctly. 6. Software bugs

The biggest piece of advice that can be given to avoid instability within a cluster is to get the setup and configuration of that cluster right the first time. Take the time to ensure that you have the correct patch sets installed and that you have followed the install directions carefully. If in doubt about any step of the instillation, contact Oracle for support.

You might also like