Professional Documents
Culture Documents
The cluster has a number of log files that can be examined to gain any insight of occurring problems A good place to start diagnosis for the cluster problems is from the $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log All clusterware log files are stored under $ORA_CRS_HOME/log/ directory. 1. alert<nodename>.log : Important clusterware alerts are stored in this log file. It is stored in $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log 2. crsd.log : CRS logs are stored in $ORA_CRS_HOME/log/<hostname>/crsd/ directory. The crsd.log file is archived every 10MB as crsd.101, crsd.102 ... 3. cssd.log : CSS logs are stored in $ORA_CRS_HOME/log/<hostname>/cssd/ directory. The cssd.log file is archived every 20MB as cssd.101, cssd.102.... 4. evmd.log : EVM logs are stored in $ORA_CRS_HOME/log/<hostname>/evmd/ directory. 5. OCR logs : OCR logs (ocrdump, ocrconfig, ocrcheck) log files are stored in $ORA_CRS_HOME/log/<hostname>/client/ directory. 6. SRVCTL logs: srvctl logs are stored in two locations, $ORA_CRS_HOME/log/<hostname>/client/ and in $ORACLE_HOME/log/<hostname>/client/ directories. 7. RACG logs : The high availability trace files are stored in two locations $ORA_CRS_HOME/log/<hostname>/racg/ and in $ORACLE_HOME/log/<hostname>/racg/ directories. RACG contains log files for node applications such as VIP, ONS etc. ONS log filename = ora.<hostname>.ons.log VIP log filename = ora.<hostname>.vip.log Each RACG executable has a sub directory assigned exclusively for that executable. racgeut : $ORA_CRS_HOME/log/<hostname>/racg/racgeut/ racgevtf : $ORA_CRS_HOME/log/<hostname>/racg/racgevtf/ racgmain : $ORA_CRS_HOME/log/<hostname>/racg/racgmain/ racgeut : $ORACLE_HOME/log/<hostname>/racg/racgeut/ racgmain: $ORACLE_HOME/log/<hostname>/racg/racgmain/ racgmdb : $ORACLE_HOME/log/<hostname>/racg/racgmdb/ racgimon: $ORACLE_HOME/log/<hostname>/racg/racgimon/ As in a normal Oracle single instance environment, a RAC environment contains the standard RDBMS log files: These files are located by the parameters : background_dest_dump contan the alert log and backgrond process trace files. user_dump_dest contains any trace file generated by a user process.
core_dump_dest contains core files that are generated due to a core dump in a user process.
Whenever a node is having issues joining the cluster back post reboot, here is a quick check list I would suggest:
Let us now take a closer look at specifc issues with examples and steps taken for their resolution. These are all tested on Oracle 10.2.0.4 database on RHEL4 U8 x-64
1. srvctl not able to start Oracle Instance but sqlplus able to start a. Check racg log for actual error message. % more $ORACLE_HOME/log/`hostname -s`/racg/ora.{DBNAME}. {INSTANCENAME}.inst.log
b. Check if srvctl is configured to use correct parameter file(pfile/spfile) % srvctl config database -d {DBNAME} -a
You can also validate parameter file by using sqlplus to see the exact error message.
c. Check ownership for $ORACLE_HOME/log If this is owned by root, srvctl won't be able to start instance as oracle user. # chown -R oracle:dba $ORACLE_HOME/log
2. VIP has failed over to another node but is not coming back to the original node Fix: The node where the VIP has failed over, bring it down manually as root Example: ifconfig eth0:2 down PS: Be careful to bring down only VIP. A small typo may bring down your public interface:)
3. Moving OCR to a different location PS: This can be done while CRS is up as root. While trying to change ocr mirror or the ocr to a new location, ocrconfig complaints. The fix is to touch the new file. Example: # ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile PROT-21: Invalid parameter
Verify: a. Validate using "ocrcheck". Device/File Name should point to the new one with integrity check succeeded. b. Ensure OCR inventory is updated correctly # cat /etc/oracle/ocr.loc ocrconfig_loc and ocrmirrorconfig_loc should point to correct locations.
4. Moving Voting Disk to a different location PS: CRS must be down while moving the voting disk.
The idea is to add new voting disks and delete the older ones. Find below sample errors and their fix. # crsctl add css votedisk /crs_new/cludata/cssfile_new Cluster is not in a ready state for online disk addition
We need to use force option. However, before using force option, ensure CRS is down. If CRS is up, DO NOT use force option else it may corrupt your OCR.
# crsctl add css votedisk /crs_new/cludata/cssfile_new -force Now formatting voting disk: /crs_new/cludata/cssfile_new successful addition of votedisk /crs_new/cludata/cssfile_new.
Verify using "crsctl query css votedisk" and then delete the old votedisks. While deleting too, you'll need to use force option.
Also verify the permissions of the voting disk files. It should be oracle:dba If voting disks were added using root, the permission should be changed to oracle:dba
5. Manually registering listener resource to OCR Listener was registered manually with OCR but srvctl was unable to bring up the listener Let us first see example of how to manually do this. From an existing available node, print the listener resource % crs_stat -p ora.test-server2.LISTENER_TEST-SERVER2.lsnr > /tmp/res % cat /tmp/res NAME=ora.test-server2.LISTENER_TEST-SERVER2.lsnr TYPE=application ACTION_SCRIPT=/orahome/ora10g/product/10.2.0/db_1/bin/racgwrap ACTIVE_PLACEMENT=0 AUTO_START=1 CHECK_INTERVAL=600 DESCRIPTION=CRS application for listener on node FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS=test-server2 OPTIONAL_RESOURCES= PLACEMENT=restricted
REQUIRED_RESOURCES=ora.test-server2.vip RESTART_ATTEMPTS=5 SCRIPT_TIMEOUT=600 START_TIMEOUT=0 STOP_TIMEOUT=0 UPTIME_THRESHOLD=7d USR_ORA_ALERT_NAME= USR_ORA_CHECK_TIMEOUT=0 USR_ORA_CONNECT_STR=/ as sysdba USR_ORA_DEBUG=0 USR_ORA_DISCONNECT=false USR_ORA_FLAGS= USR_ORA_IF= USR_ORA_INST_NOT_SHUTDOWN= USR_ORA_LANG= USR_ORA_NETMASK= USR_ORA_OPEN_MODE= USR_ORA_OPI=false USR_ORA_PFILE= USR_ORA_PRECONNECT=none USR_ORA_SRV= USR_ORA_START_TIMEOUT=0 USR_ORA_STOP_MODE=immediate USR_ORA_STOP_TIMEOUT=0 USR_ORA_VIP=
Modify relevant parameters in the resource file to point to correct instance. Rename as resourcename.cap % mv /tmp/res /tmp/ora.test-server1.LISTENER_TEST-SERVER1.lsnr.cap
While trying to start listener, srvctl is throwing errors like "Unable to read from listener log file" The listener log file exists. If resource is registered using root, then srvctl won't be able to start using oracle user. So all the aforementioned operations while registering the listener manually should be done using oracle user.
6. Services While checking status of a service, it says "not running" If we try to start it using srvctl, the error message is "No such service exists" or "already running" If we try to add service with same name, it says "already exists" This happens because the service is in an "Unknown" state in the OCR Using crs_stat, check if any related resource for service(resource names ending with .srv and .cs) is still lying around.
srvctl remove service -f has been tried and the issue persists. Here is the fix: # crs_stop -f {resourcename} # crs_unregister {resourcename} Now service can be added and started correctly.
7. Post host reboot, CRS is not starting After host reboot, CRS was not coming up. No CRS logs in $ORA_CRS_HOME Check /var/log/messages "Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.9559" No logs seen in /tmp/crsctl.*
Run cluvfy to identify the issue $ORA_CRS_HOME/bin/cluvfy stage -post crsinst -n {nodename}
/etc/fstab was incorrect and was fixed for making /tmp available
If you see messages like "Shutdown CacheLocal. my hash ids don't match" in the CRS log, then check if /etc/oracle/ocr.loc is same across all nodes of the cluster.
CRS not starting with following messages in /var/log/messages; "Id "h1" respawning too fast: disabled for 5 minutes"
If CRS binary is restored by copying from existing node in the cluster, then you need to ensure: a. Hostnames are modified correctly in $ORA_CRS_HOME/log b. You may need to cleanup socket files from /var/tmp/.oracle
PS:Exercise caution while working with the socket files. If CRS is up, you should never touch those files otherwise reboot may be inevitable.
9. CRS rebooting frequently by oprocd Check /etc/oracle/oprocd/ and grep for "Rebooting". Check /var/log/messages and grep for "restart" If the timestamps are matching, this confirms reboots are being initated by oprocd process.
%ps -ef grep oprocd root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f
-t 1000 means oprocd would wake up every 1000ms -m 500 means allow upto 500ms margin of error Basically with these options if oprocd wakes up after > 1.5 secs its going to force a reboot.
This is conceptually analogous to what hangcheck timer used to do pre 10.2.0.4 Oracle releases on Linux.
This actually changes what parameters oprocd runs with %ps -ef grep oprocd root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f
Note that the margin has now changed to 10000ms i.e 10 seconds in place of the default 0.5 seconds.
PS: Setting diagwait requires a full shutdown of Oracle Clusterware on ALL nodes.
10. Cluster hung. All SQL queries on GV$ views are hanging. Alert log from all instance have message like below: INST1: IPC Send timeout detected. Receiver ospid 1650
INST2:IPC Send timeout detected.Sender: ospid 24692 Receiver: inst 1 binc 150 ospid 1650
INST3: IPC Send timeout detected.Sender: ospid 12955 Receiver: inst 1 binc 150 ospid 1650
The ospid on all instances belong to LCK0 - Lock Process In case of inter-instance lock issues, it's important to identify the instance from where it's initiating. As seen from above, INST1 is the one that needs to be fixed. Just identify the process that is causing row cache lock and kill it otherwise reboot node 1.
11. Inconsistent OCR with invalid permissions % srvctl add db -d testdb -o /oracle/product/10.2 PRKR-1005 : adding of cluster database testdb configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]
crs_stat doesn't have any trace of it so utilities like crs_setperm/crs_unregister/crs_stop won't work in this case.
ocrdump shows: [DATABASE.LOG.testdb] UNDEF : SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}
[DATABASE.LOG.testdb.INSTANCE] UNDEF : SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}
These logs are owned by root and that's the problem. This means that the resource was perhaps added into OCR using root. Though it has been removed by root but now it cannot be added by oracle user unless we get rid of the aforementioned.
Shutdown the entire cluster and either restore from previous good backup of OCR using: ocrconfig -restore backupfilename
If you are not sure of last good backup, there you can also do the following: Take export backup of OCR using: ocrconfig -export /tmp/export -s online
Edit /tmp/export and remove those 2 lines pointing to DATABASE.LOG.testdb and DATABASE.LOG.testdb.INSTANCE owned by root
After starting the cluster, verify using ocrdump. The OCRDUMPFILE should not have any trace of those leftover log entries owned by root.
1. Use the crsctl check has command to check if OHASD is running on the local node and that its healthy:
[oracle@rac1 admin]$crsctl check has CRS-4638: Oracle High Availability Services is online
2.
Use the crsctl check crs command to check OHASD, CRSD, ocssd and EVM daemons.
[oracle@rac1 admin]$crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
3.
Use the crsctl check cluster all command to check all daemons on all nodes of the cluster.
[oracle@rac1 admin]$crsctl check cluster all ******************************************************** rac1: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online rac2: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
4. Check the cluster logs for any error messages that might have been logged.
If these steps do not indicate a problem and you feel there is a problem with the cluster, you can do the following:
1. Stop the cluster (crsctl stop cluster) 2. Start the cluster (crsctl start cluster), monitoring the startup message to see if any errors occur.
3.
If errors do occur or you still feel there is a problem, check the cluster logs for error messages.
GRID_HOME/log/<host>/diskmon GRID_HOME/log/<host>/client GRID_HOME/log/<host>/ctssd GRID_HOME/log/<host>/gipcd GRID_HOME/log/<host>/ohasd GRID_HOME/log/<host>/crsd GRID_HOME/log/<host>/gpnpd GRID_HOME/log/<host>/mdnsd GRID_HOME/log/<host>/evmd GRID_HOME/log/<host>/racg/racgmain GRID_HOME/log/<host>/racg/racgeut GRID_HOME/log/<host>/racg/racgevtf GRID_HOME/log/<host>/racg GRID_HOME/log/<host>/cssd GRID_HOME/log/<host>/srvm GRID_HOME/log/<host>/agent/ohasd/oraagent_oracle11 GRID_HOME/log/<host>/agent/ohasd/oracssdagent_root
CRS Daemon Oracle Agent CRS Daemon Oracle Root Agent CRS Daemon Oracle OC4J Agent Grid Naming Service Daemon
The following diagram provides additional detail as to the location of the Oracle Clusterware log files:
Oracle Clusterware will rotate logs over time. This is known as a rollover of the log. Rollover log files will typically have the same name as the logfile but it will have a version number attached to the end. This helps to maintain control of space utilization in the GRID_HOME directory. Each log file type has its own rotation time frame. An example of rolling of log files can be seen in this listing of the GRID_HOME/log/rac1/agent/crsd/oraagent_oracle directory. In the listing note that there is the current oraagent_oracle log file with an extension of .log and then there are the additional oraagent_oracle log files with extensions from l01 to l10. These latter log files are the backup log files, of which 10 are maintained.
[oracle@rac1 oraagent_oracle]$ pwd /ora01/app/11.2.0/grid/log/rac1/agent/crsd/oraagent_oracle [oracle@rac1 oraagent_oracle]$ ls -al total 109320 drwxr-xr-t 2 oracle oinstall 4096 Jun 10 20:02 . drwxrwxrwt 5 root oinstall 4096 Jun 8 10:29 .. -rw-r--r-- 1 oracle oinstall 10565073 Jun 10 20:02 oraagent_oracle.l01 -rw-r--r-- 1 oracle oinstall 10583355 Jun 10 13:35 oraagent_oracle.l02 -rw-r--r-- 1 oracle oinstall 10583346 Jun 10 07:13 oraagent_oracle.l03 -rw-r--r-- 1 oracle oinstall 10583397 Jun 10 00:51 oraagent_oracle.l04 -rw-r--r-- 1 oracle oinstall 10583902 Jun 9 18:29 oraagent_oracle.l05 -rw-r--r-- 1 oracle oinstall 10584515 Jun 9 13:17 oraagent_oracle.l06 -rw-r--r-- 1 oracle oinstall 10584397 Jun 9 09:26 oraagent_oracle.l07 -rw-r--r-- 1 oracle oinstall 10584344 Jun 9 05:37 oraagent_oracle.l08 -rw-r--r-- 1 oracle oinstall 10584126 Jun 9 01:50 oraagent_oracle.l09 -rw-r--r-- 1 oracle oinstall 10539847 Jun 8 21:09 oraagent_oracle.l10 -rw-r--r-- 1 oracle oinstall 5955542 Jun 10 23:38 oraagent_oracle.log -rw-r--r-- 1 oracle oinstall 0 Jun 8 10:29 oraagent_oracleOUT.log -rw-r--r-- 1 oracle oinstall 6 Jun 10 21:20 oraagent_oracle.pid
Using Diagcollection.pl
Clearly Oracle Clusterware has a number of log files. Often when troubleshooting problems you will want to review several of the log files. This can involve traversing directories which can be tedious at best. Additionally Oracle support might well ask that you collect up all the Clusterware log files so they can diagnose the problem that you are having. The diagcollection.pl script comes with the following options:
--collect Collect diagnostic information. --clean Cleans the directory of previous files created by previous runs of diagcollection.pl. Options to only collect specific information. These options include crs, --core or all. all is the default setting.
To make collection of the Clusterware log data easier Oracle provides a program called diagcollection.pl which is contained in $GRID_HOME/bin. This script will collect Clusterware log files and other helpful diagnostic information. The script has a collect option that you invoke to collect the diagnostic information. When invoked the script creates four files in the local directory. These four files are gzipped tarballs and are listed in the following table: Script Name Coredata*tar.gz crsData*tar*gz ocrData*tar*gz osData*tar*gz Contains Core files and related analysis files. Contains log files from GRID_HOME/log/<host> directory structure. Contains the results of an execution of ocrdump and ocrcheck . Current OCR backups are also listed. Contains /var/log/messages and other related files.
The CVU is used to verify that there are no configuration issues with the cluster. CVU is located in the GRID_HOME/bin directory and also in $ORACLE_HOME/bin. CVU supports Oracle Clusterware versions 10gR1 onwards. You can also run CVU from the Oracle 11g Release 2 install media. In this case, call the program runcluvfy.sh which calls CVU. Prior to Oracle Clusterware 11g Release 2 you would need to download it from OTN. In Oracle Clusterware 11g Release 2 CVU is installed as a part of Oracle Clusterware. CVU can be run in various situations including:
During various phases of the install of the initial cluster to confirm that key components are in place and operational (such as SSH). For example, OUI makes calls to the CVU during the creation of the cluster to ensure that pre-requisites were executed. After you have completed the initial creation of the cluster After you add or remove a node from the cluster If you suspect there is a problem with the cluster.
CVU diagnosis/verifies specific components. Components are groupings based on functionality. Examples of components are space, integrity of the cluster, OCR integrity, clock synchronization and so on. You can use CVU to check one or all components of the cluster. In some cases, when problems are detected CVU can create fixup scripts that are designed to correct problems that were detected.
Node evictions or other problems can be caused by corruption in the OCR. The ocrcheck program provides a way to check the integrity of the OCR. Performs checksum operations on the blocks within the OCR to ensure they are not corrupt. Here is an example of running the ocrcheck program:
[oracle@rac1 admin]$ ocrcheck Status of Oracle Cluster Registry is as follows : Version : 3 Total space (kbytes) : 262120 Used space (kbytes) : 2580 Available space (kbytes) : 259540 ID : 749518627 Device/File Name : +DATA Device/File integrity check succeeded Cluster registry integrity check succeeded Logical corruption check bypassed due to non-privileged user
2. 3. 4. 5.
The truth is that Oracle Clusterware is a very complex beast. For the DBA who does not deal with solving Clusterware problems on a day-in-day-out basis, determining the nature and resolution to a problem can be an overwhelming challenge. In your attempts to solve the problem, you can cause additional problems and damage to the cluster. It is
far better, if you do not know the solution, to let Oracle support work with you on a solution. Thats what you pay them for. Note that the number one step is backup of any RAC databases on the cluster. Keep in mind that one possible problem your cluster could be starting to experience is issues with the storage infrastructure. Consider this carefully when performing a backup on an unstable cluster. It may be that you will want to try to backup to some other storage medium (NAS for example) that uses a different hardware path (for example, does not use your HBAs) if possible. If the cluster is starting to have issues, there is a lot that can go wrong and a lot of damage that can be occur (this is true with a non-clustered database too).
Node evictions can be hard to diagnose. There are many possible causes of node evictions, some of which might be obvious and some which might not be. In this section we address dealing with node evictions. First we ask the question, what can cause a node eviction. We then discuss finding out what actually caused our node eviction.
A common problem that DBAs have to face with Clusterware is node evictions which usually leads to a reboot of the node that was evicted. With Oracle Clusterware 11g Release 2 there are two main processes that can cause node evictions:
Oracle Clusterware Kill Daemon (Oclskd) Used by CSS to reboot a node when the reboot is requested by one or more other nodes. CSSDMONITOR OCSSD daemon is monitored by cssdmonitor. If a hang is indicated (say the ocssd daemon is lost) then the node will be rebooted.
Previous to Oracle Clusterware 11g Release 2 the hangcheck-timer module was configured and could also be a cause of nodes rebooting. As of Oracle Clusterware 11g Release 2 this module is no longer needed and should not be enabled.
Very often with node evictions you will need to engage Oracle support. Clusterware is complex enough that it will take the support tools that Oracle Support has available to diagnose the problem. However, there are some initial things you can do that might help to solve some basic problems, like node mis-configurations. Some things you might want to do are: 1. Determine the time the node rebooted. Using the uptime UNIX command for example. This will help you determine where in the various logs you will want to look for additional information. Check the following logfiles to begin with:
2.
a. b. c.
1. Node time coordination We have found that even though Oracle Clusterware 11g Release 2 does not
indicate that NTP is a requirement Clusterware does seem to be more stable when its enabled and working correctly. We recommend that you configure NTP for all nodes.
2. Interconnect issues A common cause of node eviction issues is that the interconnect is not completely
isolated from other network traffic. Ensure that the interconnect is completely isolated from all other network traffic. This includes the switches that the interconnect is attached to.
3. Configuration/certification issues Ensure that the hardware and software you are using is certified by
Oracle. This includes the specific version and even patchset number of each component.
4. Patches Ensure that all patch sets are installed as required. 5. OS Software components Ensure that all OS software components have been installed as directed by
Oracle. Dont decide to not install a component just because you dont think you are going to need to use it. Make sure that you are installing the correct revision of those components. The Oracle documentation and OMS provide a complete list of all required patch sets that must be installed for Clusterware and RAC to work correctly. 6. Software bugs
The biggest piece of advice that can be given to avoid instability within a cluster is to get the setup and configuration of that cluster right the first time. Take the time to ensure that you have the correct patch sets installed and that you have followed the install directions carefully. If in doubt about any step of the instillation, contact Oracle for support.