I had an interesting encounter with latch: cbc contention early this week. During my oncall I received page for Load of 206.81 exceeded threshold of 150. After I logged into server , the server load average was continously increasing and all the top PIDs were of oracle processes. After logging into database, I saw multiple sessions waiting on latch: cache buffer chains wait event
load average: 258.52, 244.27, 226.15 select username,sql_id,event,count(*) from v$session where event like 'latch%' group by username,sql_id,event order by 4; USERNAME SQL_ID EVENT COUNT(*) ------------------------- ------------- ------------------------------ ---------- USER_EAS4 5j9xrj07xqysz latch free 1 USER_EAS7 4f6dfhuxbz18y latch free 1 USER_EAS7 a1wwm6tpukyh8 latch free 1 USER_EAS7 c1sa78t1rvxkq latch free 1 USER_EAS7 gc4xfk544n98v latch free 2 USER_EAS7 bmhmdqvt9c2qz latch free 2 USER_EAS7 4ycfn45y3vj5p latch free 8 USER_EAS7 4ycfn45y3vj5p latch: cache buffers chains 23 USER_EAS7 9cg18w6mfqphn latch: cache buffers chains 51 USER_EAS7 9cg18w6mfqphn latch free 144
sql monitor report of one session for sqlid 9cg18w6mfqphn
Global Stats ======================================================= | Elapsed | Cpu | Concurrency | Other | Buffer | | Time(s) | Time(s) | Waits(s) | Waits(s) | Gets | ======================================================= | 2243 | 2132 | 0.00 | 111 | 260M | ======================================================= SQL Plan Monitoring Details (Plan Hash Value=2956480150) ============================================================================================================================================================= | Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail | | | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) | ============================================================================================================================================================= | 0 | SELECT STATEMENT | | | | | | 1 | | | | | 1 | FILTER | | | | | | 1 | | | | | 2 | PARTITION RANGE SINGLE | | 18 | 133 | | | 1 | | | | | 3 | TABLE ACCESS BY LOCAL INDEX ROWID | MESSAGE_DEMO_CTC | 18 | 133 | | | 1 | | | | | -> 4 | INDEX RANGE SCAN | MESSAGE_DEMO_CTC_ID6 | 18 | 133 | 2246 | +1 | 1 | 0 | 100.00 | Cpu (2194) | =============================================================================================================================================================
AWR performance trend for the sqlid for past 5 days:-
@awr_sqlid_perf_trend 9cg18w6mfqphn 5 .25 TIME EXECUTIONS ET_S_1EXEC CPU_TIME_S_1EXEC IOWAIT_S_1EXEC CLWAIT_S_1EXEC APWAIT_S_1EXEC CCWAIT_S_1EXEC ROWS_PROCESSED_1EXEC BUFFER_GETS_1EXEC DISK_READS_1EXEC ------------------- ----------- ------------ ---------------- -------------- -------------- -------------- -------------- -------------------- ----------------- ----------------- 23.09.2015 00:15:00 12 117.923 117.667 .001 .000 .000 .000 .000 23049569.333 .083 23.09.2015 00:30:00 14 1284.493 1280.963 .002 .000 .000 .000 .000 243817703.571 1.214 23.09.2015 00:45:00 26 1579.532 1574.306 .016 .000 .000 .000 .000 .000 5.962 23.09.2015 01:00:00 41 1729.918 1718.699 .002 .000 .000 .000 .000 26916357.415 .317 23.09.2015 01:15:00 26 3850.767 3739.324 .011 .000 .000 .000 .000 .000 1.731 23.09.2015 01:30:00 27 4565.527 3868.424 .017 .000 .000 .000 .000 82060590.481 4.407
At the first glance it seemed to me as the sql was newly introduced to the database as AWR showed it from the same day, even though I had asked from 5 days snapshots.
The buffer gets per execution for the sql was too high and with multiple sessions with the same sqlid caused latch cbc and in turn high load. So, had to find what was causing this high buffer gets.
The execution plan for sql looked fine. The stats were fine and the index picked by the optimizer was fine too.At this point came in “latchprofx” by Tanel Poder 🙂
SELECT * FROM ( SELECT event , TRIM(TO_CHAR(p1, 'XXXXXXXXXXXXXXXX')) latch_addr , TRIM(ROUND(RATIO_TO_REPORT(COUNT(*)) OVER () * 100, 1))||'%' PCT , COUNT(*) FROM v$active_session_history WHERE event = 'latch: cache buffers chains' AND session_state = 'WAITING' GROUP BY event , p1 ORDER BY COUNT(*) DESC ) WHERE ROWNUM <= 10 / EVENT LATCH_ADDR PCT COUNT(*) ----------------------------------- ----------------- ---------------- ---------- latch: cache buffers chains 7B9A57DB50 92.2% 448 latch: cache buffers chains 7AE4EBCD68 2.5% 12 latch: cache buffers chains 7B682705B8 1% 5 latch: cache buffers chains 7B6FB0AFD0 .8% 4 latch: cache buffers chains 7ADA39D6B8 .4% 2 latch: cache buffers chains 7AE4D206F8 .2% 1 latch: cache buffers chains 7B4A1A04B0 .2% 1 latch: cache buffers chains 7B475726C0 .2% 1 latch: cache buffers chains 7ACB819888 .2% 1 latch: cache buffers chains 7AC6958BA8 .2% 1 @latchprofx sid,name,sqlid,object % 7B9A57DB50 100000 -- LatchProfX 1.21 by Tanel Poder ( http://www.tanelpoder.com ) SID NAME SQLID OBJECT Held Gets Held % Held ms Avg hold ms ----- ----------------------------------- --------------- ----------------- ---------- ---------- ------- ----------- ----------- 7155 cache buffers chains 9cg18w6mfqphn 1ECA461B 4 4 .00 2.025 .506 1696 cache buffers chains 4ycfn45y3vj5p 1ECA461B 3 3 .00 1.519 .506 8689 cache buffers chains 9cg18w6mfqphn 1ECA461B 3 3 .00 1.519 .506 3809 cache buffers chains 9cg18w6mfqphn 1ECA461B 3 3 .00 1.519 .506 3969 cache buffers chains 9cg18w6mfqphn 1ECA461B 3 3 .00 1.519 .506 2742 cache buffers chains 9cg18w6mfqphn 1ECA461B 2 2 .00 1.013 .506 2482 cache buffers chains 4ycfn45y3vj5p 1ECA461B 2 2 .00 1.013 .506 439 cache buffers chains 9cg18w6mfqphn 1ECA461B 2 2 .00 1.013 .506 1514 cache buffers chains 9cg18w6mfqphn 1ECA461B 2 2 .00 1.013 .506 6894 cache buffers chains 9cg18w6mfqphn 1ECA461B 2 2 .00 1.013 .506 6524 cache buffers chains 9cg18w6mfqphn 1ECA461B 2 2 .00 1.013 .506 6101 cache buffers chains 9cg18w6mfqphn 1ECA461B 2 2 .00 1.013 .506 3545 cache buffers chains 9cg18w6mfqphn 1ECA461B 2 2 .00 1.013 .506
All rows point to object “1ECA461B”
@dba 1ECA461B RFILE# BLOCK# DUMP_CMD ---------- ---------- ------------------------------------------------ 123 673307 -- alter system dump datafile 123 block 673307 select name from v$datafile where file#=123; NAME --------------------------------------------------------------- +DATA01/usrprd04/datafile/undotbs1.1134.818027637
Ok, so something related to undo!!!!.
The alert log showed ORA-01555 for an Insert SQL in the same table
Wed Sep 23 02:44:32 2015 ORA-01555 caused by SQL statement below (SQL ID: 72pwjk9bxyygw, Query Duration=13521 sec, SCN: 0x0891.8fff8587): INSERT INTO message_demo_ctc (.....) select ........ Wed Sep 23 02:45:26 2015 Completed checkpoint up to RBA [0x2a16.2.10], SCN: 9421305667060 User/Pid Megs Used MINUTES_ACTIVE MODULE Recs upd/del -------------------- --------------- -------------- -------------------------------------------------- ------------ USER_EAS7/(3858) 392.82 264 cli@mesagae_demo_ctc.php 3,945,675
Once the rollback completed, eventhing was back to normal. The buffer gets for sqlid ‘9cg18w6mfqphn’ was down to 24,950 from ~652617k. We had waited for almost 30-40mins for the rollback, could have killed the SID and SMON would have made rollback using parallel processes, which could have been faster.
Nice analysis !! Thanks
Thanks Advait for visiting the blog. Cheers!!