Professional Documents
Culture Documents
Confessions of an Oracle Database Junkie - Arup Nanda The opinions expressed here are mine and mine alone. They may not necessarily reflect that of my employers and customers - both past or present. The comments left by the reviewers are theirs alone and may not reflect my opinion whether implied or not. None of the advice is warranted to be free of errors and ommision. Please use at your own risk and after thorough testing in your environment.
SATURDAY, AUGUST 23, 2008
$ sqlplus -prelim SQL> Note, it didnt say anything familiar like Connected to Oracle Database 10.2.0.3, etc. All it showed was the SQL> prompt. That was because it didnt actually connect to the database. (2) Then I used the oradebug utility to analyze the SGA SQL> oradebug setmypid SQL> oradebug hanganalyze 12 This produced a tracefile in the user_dump_dest directory. The file wasnt difficult to find, since it was the last file created. Even if I didnt find the file, I could have used the process ID to find the file. The file would have been named crmprd1_ora_13392.trc, assuming 13392 was the process ID. (3) Lets examine the file. Here are first few lines: *** 2008-08-23 01:21:44.200 ============== HANG ANALYSIS: ============== Found 163 objects waiting for <0/226/17/0x1502dab8/16108/no> Open chains found: Chain 1 : : <0/226/17/0x1502dab8/16108/no> <0/146/1/0x1503e898/19923/latch:> This tells me a lot. First it shows that the SID 146 Serial# 17 is waiting for library cache latch. The blocking session is SID 226 Serial# 17. The latter is not waiting for anything of blocking nature. I also noted the OS process IDs of these sessions 16108 and 19923. (4) Next I checked for two more tracefiles with these OS PIDs in their names. crmprd1_ora_16108.trc crmprd1_ora_19923.trc (5) I opened the first one, the one that is the blocker. Here are the first few lines: *** 2008-08-23 01:08:18.840 *** SERVICE NAME:(SYS$USERS) 2008-08-23 01:08:18.781 *** SESSION ID:(226.17) 2008-08-23 01:08:18.781 LIBRARY OBJECT HANDLE: handle=c0000008dc703810 mtx=c0000008dc703940(8000) cdp=32737 name=UPDATE DW_ETL.FRRS_PROFILER SET CONSUMER_LINK = :"SYS_B_0", ADDRESS_LINK = :"SYS_B_1", ADDRESS_MATCH = :"SYS_B_2", PROC ESSED=:"SYS_B_3" WHERE RNUM = :"SYS_B_4" hash=a029fce7bb89655493e7e51a544592a4 timestamp=08-23-2008 00:10:23 namespace=CRSR flags=RON/KGHP/TIM/OBS/PN0/MED/KST/DBN/MTX/[504100d0] kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=10 hpc=0058 hlc=0058 lwt=c0000008dc7038b8[c0000008dc7038b8,c0000008dc7038b8] ltm=c0000008dc7038c8[c0000008dc7038c8,c0000008dc7038c8] pwt=c0000008dc703880[c0000008dc703880,c0000008dc703880] ptm=c0000008dc703890[c0000008dc703890,c0000008dc703890] ref=c0000008dc7038e8[c0000008dc7038e8,c0000008dc7038e8]
lnd=c0000008dc703900[c0000008dc703900,c0000008dc703900] LOCK OWNERS: lock user session count mode flags ---------------- ---------------- ---------------- ----- ---- ----------------------c0000008d079f1b8 c0000006151744d8 c0000006151744d8 16 N [00] c0000008d4e90c40 c0000006151bcb58 c0000006151bcb58 16 N [00] c0000008d0812c40 c0000008151a0438 c0000008151a0438 16 N [00] (6) This is a treasure trove of information for debugging. First it shows the SID and Serial# (226.17), which confirms the SID we identified earlier. It shows the exact SQL statement being used. Finally it shows all the locks. I didnt particularly care about the specifics of locks; but it gave me enough information to prove that the SID 226 was causing a wait for a whole lot of other sessions. (7) My investigation is not done; I need to find out the sessions waiting for this. So, I searched the file for a section called PROCESS STATE. Here is a snippet from the file: PROCESS STATE ------------Process global information: process: c00000081502dab8, call: c000000817167890, xact: 0000000000000000, curses: c00000081519ef88, usrses: c000000815 19ef88 ---------------------------------------SO: c00000081502dab8, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=370, calls cur/top: c000000817167890/c000000817167890, flag: (0) int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 115 0 4 last post received-location: kslpsr last process to post me: c000000615002038 1 6 last post sent: 0 0 24 last post sent-location: ksasnd last process posted by me: c000000615002038 1 6 (latch info) wait_event=0 bits=20 holding (efd=4) c0000008d7b69598 Child library cache level=5 child#=10 Location from where latch is held: kglhdgc: child:: latch Context saved from call: 13 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 291 (197, 1219468295, 197) 279 (197, 1219468295, 197) 374 (197, 1219468295, 197) 267 (197, 1219468295, 197) 372 (197, 1219468295, 197) ... several lines sniped ... 307 (15, 1219468295, 15) 181 (6, 1219468295, 6) waiter count=58 Process Group: DEFAULT, pseudo proc: c0000008e03150d8 O/S info: user: oracrmp, term: UNKNOWN, ospid: 16108 OSD pid info: Unix process pid: 16108, image: oracle@sdwhpdb1 (8) This told me everything I needed to know. There were 58 sessions waiting for library cache latch held by SID 226. I also knew the OS Process ID and the SQL statement of the blocking session.
(9) At that time we engaged the Application Owner to explain what was going on. As he explained it, he issues the update statement in a loop. And thats not all; he executes it in 8 different threads. No wonder we have had library cache latch contention. So, we had to track 8; not just one session. We trudged on. All the sessions dumped their information. So, I searched the directory for some other files with the same issues: $ grep UPDATE DW_ETL *.trc (10) And I found 9 more sessions (or, rather, processes). Here is a snippet from another file: 350 (167, 1219470122, 167) 197 (167, 1219470122, 167) waiter count=185 Process Group: DEFAULT, pseudo proc: c0000008e03150d8 O/S info: user: oracrmp, term: UNKNOWN, ospid: 16114 This process had 185 waiters! Ouch! (11) Now comes a decision point. I knew who is blocking and who were being blocked; although I didnt yet know what latch exactly is being contented for. I could have dumped the library cache latches to get that information; but the application owner volunteered to terminate the sessions. The application, fortunately, was restartable. So, we decided to kill all of these errant sessions on the unix prompt. $ kill -9 (12) After killing a few processes, the database started responding. After killing all of them, the database wait events came back to completely normal. Connections were established and applications started behaving normally. After step 11, I could have used the library cache dump to examine the exact library element in the center of the contention; but thats a topic for another blog. Takeaways (1) When something seems to be hung, dont get hung up on that. A session almost always waits for something; rarely it is just hung. You should check what it is waiting for by selecting the EVENT column of V$SESSION (10g) or V$SESSION_WAIT (9i). (2) When you cant logon to the database to get the information, try using oradebug command. (3) To use oradebug, you should use SQL*Plus. Since you cant login, use sqlplus -prelim" to get the SQL prompt. (4) Use oradebug setmypid to start the oradebug session, and then use oradebug hanganalyze to create a dumpfile of all hang related issues (5) Use oradebug help to see all oradebug commands