Recently, i got an opportunity to look into CBC latch contention on 9.2.0.8 database version at one of the client’s site.As the number of users connecting to the database increased its performance degraded with poor application performance.
The first thing was to check the statspack report.Lets see what the “Top 5 Wait Events” had to say –
Top 5 Wait events Event Waits Time (s) Ela Time ----------------------------------------- ------------ ----------- -------- latch free 1,509,709 280,494 77.44 CPU time 36,721 10.14 db file scattered read 12,965,784 21,092 5.82 db file sequential read 8,060,280 11,007 3.04 buffer busy waits 171,452 7,484 2.07
Few more informations :-
Wait Events for DB (Top 3) Avg Total wait Waits Event Waits Timeouts WaitTime(s) (ms) /txn -------------------------- ------------ ---------- ------------- -------- -------- latch free 1,509,709 755,962 280,494 186 31.9 db file scattered read 12,965,784 0 21,092 2 274 db file sequential read 8,060,280 0 11,007 1 170.3 Latch Sleep breakdown for DB: Instance: Snaps: 812 -813 -> ordered by misses desc Get Spin&Latch Name Requests Misses Sleeps ----------------------- ---------------- ----------- ------ multiblock read objects 39,774,699 2,237,038 192,388 cache buffers chains 914,095,905 2,147,734 745,743 library cache 95,843,429 1,742,253 346,113
Checking for the top most sql query with “SQL ordered by Gets ” –
CPU Elapsd Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 66,719,592 575 116,034.1 14.9 7142.96 ######### 674460927
Checking for the top most sql query with “SQL ordered by Reads” –
CPU Elapsd Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 37,702,363 575 65,569.3 57.3 7142.96 ######### 674460927
The sql with hash value 674460927 is at the top for ordered by gets and reads.Looking at “Top 5 Physical Reads per Segment for DB“, shows the table used in the sql (with hash value 674460927) at the top most with 60% of total physical reads during an hour snapshot.
Interestingly, when checking the explain plan (using explain plan for)for that particular sql query showed :-
-------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | -------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 3 | 99 | 18 | |* 1 | FILTER | | | | | | 2 | NESTED LOOPS | | 3 | 99 | 18 | |* 3 | TABLE ACCESS BY INDEX ROWID| TABLE_A | 3 | 69 | 12 | |* 4 | INDEX RANGE SCAN | IN_TAB_A | 10 | | 3 | |* 5 | TABLE ACCESS BY INDEX ROWID| TABLE_B | 1 | 10 | 2 | |* 6 | INDEX UNIQUE SCAN | IN_TAB_B | 1 | | 1 | ----------------------------------------------------------------------------------------------
But while checking the plan from v$sql_plan showed Full Table Scan on TABLE_A.The same was confirmed from the TKPROF.
select id,parent_id,LPAD(' ', level-1) || operation || ' ' || options operation, object_name, cost from v$sql_plan where hash_value=674460927; ID PARENT_ID OPERATION OBJECT_NAME COST --------- ---------- ---------------------------------------- ---------------------------------------- ---------- 0 SELECT STATEMENT 10753 1 0 NESTED LOOPS 10753 2 1 TABLE ACCESS FULL TABLE_A 10747 3 1 TABLE ACCESS BY INDEX ROWID TABLE_B 2 4 3 INDEX UNIQUE SCAN IN_TAB_B 1
Won’t forget to mention and thanks Tanel Poder ( http://blog.tanelpoder.com/ )for the awesome “LatchProf and LatchProfX” scripts (http://www.tanelpoder.com/files/scripts/latchprof.sql).
The TKPROF output for the query with hash value 674460927 showed :-
call count cpu elapsed disk query current rows ------- ------ --- -------- ------ ------ ------- ---- Parse 4 0 0 0 0 0 0 Execute 4 0 0 0 4 0 0 Fetch 4 0 62.26 471166 472065 0 1 ------- ------ ----- ------- ------ ------- ------- ---- total 12 0 62.27 471166 472069 0 1
To fetch just 1 row from the database, the query performed “472069” logical I/O and “471166” physical reads which definitely was too huge.
About CBC LATCH
This latch is needed when user processes try to scan the SGA for database cache buffers. If the same block is needed by other multiple sessions, then they need to obtain the CACHE BUFFERS CHAINS latch to walk the same chain to access the same block. Access to a block is normally a fast operation but if concurrent users access a block fast enough, repeatedly then simple access to the block can become an bottleneck. The most common occurrence of cbc (cache buffer chains) latch contention happens when multiple users are running nest loop joins on a table and accessing the table driven into via an index.
Reducing contention for the cache buffer chains latch will usually require reducing logical I/O rates by tuning and minimizing the I/O requirements of the SQL involved.
The application code was wrapped and hence no changes were possible to be made.So thought of playing with the index stats.The clustering factor of the index “IN_TAB_A” was almost equivalent to the number of rows in the tables and hence the index was not getting used.Also, point to note is the “COST” in the explain plans shown above.The explain plan of the query showed index being used and Cost as 18,while the actual plan (from v$sql_plan as well tkprof) showed cost of “10753”.So was it a case of explain plan lying???
Tried to play with the index stats but was of not much use.So, based on the query created a composite index and without gathering the stats on the newly created index, took the 10046 trace of the session running the application.The tkprof output for the same sql query with hash value 674460927 showed :-
call count cpu elapsed disk query current rows ------- ------- -------- ---------- -------- -------- ------- ------- Parse 3 0.00 0.00 0 0 0 0 Execute 3 0.00 0.00 0 4 0 0 Fetch 3 0.00 2.01 1404 4212 0 1 ------- ------- -------- ---------- -------- -------- ------- ------- total 9 0.00 2.02 1404 4216 0 1
The Row Source Operation in the tkprof also confirmed the new composite index being used with no more db file scattered reads.Also, the sql query was no more in the statspack report.
The reason for “without gathering the stats” being in bold above is, the moment stats was gathered on that particular new composite index,it was no more being used, and the query showed full table scan with excessive logical I/O.
The latch free wait event dropped for top in “Top 5 Wait events” to fifth position with 8-15% of Total Elapsed Time .Also, one of the major thing which was noticed was runqueue length on the server.The runqueue lenght was almost ~100-150 during the peak hours and hence more powerful server was recommended to the client.Few days later new server was bought by client and now there is no more latch free wait event in the statspack report. 🙂
In the end,i would like to thanks few Oracle Gurus who were helpfull in some or the other ways:-
1. Randolf Geist
2. Tanel Poder
3. ViveK Sharma
4. Kyle Hailey
5. Riyaj Shamsudeen
References :-
http://sites.google.com/site/embtdbo/wait-event-documentation/oracle-latch-cache-buffers-chains
http://viveklsharma.wordpress.com/2010/05/13/cache-buffers-chains-latch-contention/
http://www.pythian.com/news/1135/tuning-latch-contention-cache-buffers-chain-latches/
http://tech.e2sn.com/oracle/troubleshooting/latch-contention-troubleshooting
http://orainternals.wordpress.com/2008/07/30/tuning-latch-contention-cache-buffers-chain-latches/
Have you finally solved this problem for your client.
If you want to change the plan of this SQL , your can manually change the clustering factor of the index ,or use stored outline/SQL profiles/SQL Plan Baseline etc..
Hi,
Thanks for visiting the blog.Yes, issue is resolved. Tried playing with the clustering factor but was not able to get the accurate value which would have made the query use index.So, then created a composite index and checked the tkprof for module and it worked.Also, changing the server helped in reducing in run queue length.
Regards,
Anand