Friday, November 18, 2011

CBC Latch Diagnosis & Acquisition Patterns


Cache buffer chain (CBC) latch contention is a common top Oracle wait event. There are a number of interrelated causes but also a number of solutions. The trick is to properly diagnose the problem which results in a short list of solutions.

The CBCs are created as a hashing structure and are primarily used to determine if a block currently resides in the buffer cache. (More->>) As you can image, even the smallest Oracle systems ask, "Is a block in the buffer cache?" a ga-zillion times each day. If CBC access continues to intensify, at some point the time to acquire the desired CBC latch will be a performance problem.

Is there a CBC issue?

When CBC latch contention is raging, you're system is likely to have a crippling CPU bottleneck because the application SQL is concurrently and repeatedly checking if specific blocks reside in the buffer cache. If the answer to, "Is the block in the buffer cache?" is usually, "Yes" then IO reads are minimized and memory structure access is maximized...hence the crippling CPU bottleneck and CBC latch contention.

The OraPub System Monitor (OSM and OSM) script I use to interactively determine overall time situation is rtpctx.sql. Here's an example of a 707 second interval.
Figure 1. Typical CBC latch contention result based on rtsysx.sql, response time report.

In Figure 1, the rtsysx.sql output the "% WT" column shows the percentage of wait time by wait event over the report interval. The "% RT" column shows the percentage of the total response time (CPU time and Wait time). The values in Figure 1 are typical when there is a very serious CBC latch contention issue. An AWR or Statspack report will tell a similar story; the top wait event being "latch: cache buffers chains" and most of the system's CPU resources being consumed by Oracle.

While there are a number of causes for CBC latch contention, I tend to see two CBC acquisition patterns. The first pattern is when many CBC latches are very active, that is, the access pattern is very disperse. The second pattern when a single CBC latch is very active. So once you know there is a significant CBC issue, the next step is to determine the acquisition pattern characteristic. Read on!

Determining CBC Wait Pattern

To determine the CBC wait pattern, you can run a very simple script like this:
select p1 latch_addr,
       p2 latch_#
from   v$session
where  status    = 'ACTIVE'
  and  wait_time = 0
  and  event     = 'latch: cache buffers chains'
/

LATCH_ADDR    LATCH_#
---------- ----------
1993834384        150
1993834384        150
1993834384        150
1993834384        150

4 rows selected.
Notice that all four sessions are sleeping (i.e., they are posting the wait event) while trying to acquire the same CBC latch (note latch address is identical). While the above snippet and the result are interesting, you could easily misled from this single sample. A more statically sound method is to gather multiple samples. Using my OSM script, latchchild.sql we can a sample each second to glean from statistical analysis. The latchchild.sql script essentially does this:
def latch=150
def sleep=300
drop table op_interim
/
create table op_interim as
select addr,gets,sleeps
from   v$latch_children
where  latch# = &latch
/
exec dbms_lock.sleep(&sleep);
select t1.addr,
       t1.gets-t0.gets delta_gets,
       t1.sleeps-t0.sleeps delta_sleeps
from   op_interim t0,
       (
         select addr,gets,sleeps
         from   v$latch_children
         where  latch# = &latch
       ) t1
where  t0.addr = t1.addr
order by 3,2
/
Below is some actual output. The "delta" columns are simply the difference between the beginning and ending values for gets and sleeps. Notice there is not a massive gap between the delta_gets and delta_sleeps  and there is not a single (or a few) latch that is significantly more active then the others. This would be classified as dispersed CBC latch contention
ADDR             DELTA_GETS     DELTA_SLEEPS
-------- ------------------ ----------------
...
76D1ABC4            30,2356                5
76D23210            30,9631                5
76C8413C            32,4284                5
76C75428            23,2780                6
76CFCFAC            24,7324                6
76DB79B0            24,7332                6
76BFE7A8            25,3808                6
76DB9C98            28,1330                6
76C534AC            32,3395                6
76D0BCC0            33,7938                6
76C17DF0            20,3694                7
76C04DF4            24,0050                7
76DE64CC            29,5872                7
76BF5EF0            23,2782                8
76D05864            27,4886                8

1024 rows selected.
To really grasp the situation, a visual histogram based on the sleeps is very useful.

Figure 2. Histogram of the number of CBC latch address and their respective sleep activity.
Figure 2 is a standard histogram I copied from the statistical analysis Mathematica notepad (you can download below). Just over 600 CBC latches have zero sleeps while only two CBC latches had eight sleeps. Notice that while there are differences in the number of sleeps, we don't see a pattern with a massive jump like; 0,0,0,1,2,3,5,6,7,1021. Again, this is an example of dispersed CBC latch contention. But sometimes the situation is not very dispersed, but singular towards just a couple or perhaps even one single CBC latch!
Figure 3. Histogram of the number of CBC latch address and their respective sleep activity.
Figure 3 is the result, as we'll see, of three very popular buffers which are each related to three different three buffer chains. While nearly 1000 CBC latches have zero sleeps (far left vertical bar in histogram), just as important is there is an obvious large delta_sleeps gap near the most active delta_sleeps CBC latches. For some people, the histogram tells a better story, but other others the below numeric snippet better captures the situation. (I personally like to use both.)
ADDR             DELTA_GETS     DELTA_SLEEPS
-------- ------------------ ----------------
...
76C33818            19,8038                3
76C644A8             8,1535                4
76C7759C             9,6993                4
76C2D1CC            14,5096                4
76E03FEC            14,9355                4
76CDD394            16,0718                4
76C68904            18,2300                4
76D69374             6,5250                5
76D7800C            13,3134                5
76DA2650            15,6578                5
76C02B88            15,8293                5
76CE59E0            14,5169                7
76C7B9F8            15,8243                7
76CFF120             6,5228                9
76CFCF30            14,8187               11
76D38668            96,6345               62
76CDF508            98,1384               96
76DAF26C          1,96,2752              187

1024 rows selected.
What is important to us is that there is a substantial delta_sleeps gap separating the top three CBC latches from the rest of the pack.

I'm hoping you can see the differences in these access patterns. I would suggest running the latchchild.sql script on one of your systems to see this for yourself.

If you want to see the latchchild.sql output (latchchild.txt) and the statistical details for above CBC activity, here are the links:
  1. Disperse CBC latch contention (latchchild.txt, PDF, Mathematica Notepad)
  2. Singular CBC latch contention (latchchild.txt, PDF, Mathematica Notepad)
Solutions for Disperse CBC Latch Contention

Figure 2 and the code snippet directly above Figure 2 show a typical example of what you'll likely see when many CBC latches are active enough to cause a significant performance problem. Here's the likely situation: There is an intense CBC latch situation along with a raging CPU bottleneck, and you can probably easily see the heavy logical IO (sysstat.session logical IO) SQL as well. There are a number of solutions, with some of them listed below.

An Oracle focused solution is to increase the number of CBC latches by increasing the hidden instance parameter, _db_block_hash_latches.

An application focused solution is to find the most logical IO intensive SQL and reduce the LIO's by executing it less often or tuning it. Either way, your objective is to reduce the LIOs generated during times of critical performance.

An operating system focused solution is to increase CPU resources by removing CPU consuming process if possible, adding more CPU cores, or increasing CPU speed.

There are of course other solutions, but I think you get the idea.

Solutions for Singular CBC Latch Contention

Figure 3 and the output snippet directly below it are typical when there is intense CBC contention focused on one or perhaps a few CBC latches. When this is the situation, additional diagnosis is needed to determine specifically why the intense singular activity is occuring. This is the topic of my next posting...

Summary

The Cache buffer chain (CBC) structure is used to answer the question, "Is this block in the buffer cache?" At some point, this question can get asked enough to cause significant performance problems known as CBC latch contention. While there are a number of causes for CBC latch contention, I tend to see two CBC acquisition patterns: The first pattern is when many CBC latches are very active, that is, the access pattern is very disperse. The second pattern when a single CBC latch is very active. So once you know there is a significant CBC issue, the next step is to determine the acquisition pattern characteristic.

In this posting I focused on how to determine the CBC latch contention acquisition pattern; disperse or singular. I then presented some disperse CBC latch contention solutions. In my next posting I'll focus on additional steps to diagnose singular CBC latch contention, two common situations, and some possible solutions.

Thanks for reading!

Craig.

Friday, November 11, 2011

Are you sure it's the index root block?

The Situation

Suppose you want to check if a specific Oracle block is an index root block. Why? Here are two very real situations. You notice a specific block is very active and want to know if it's an index root block. Even more common is, perhaps there is a very active cache buffer chain latch related to a specific block/buffer and you want to know if this hot buffer is an index root block. Besides these very real examples, it's also an interesting journey into Oracle internals!

Folklore States...

Some very respectable blogs and a simple test I ran indicate an index root block is the block after it's segment header block.

Figure 1. Diagram of an Oracle index segment, highlighting the index root block.
Figure 1 is a diagram of an Oracle index segment. If it wasn't for the index root block, Figure 1 would be a good diagram for any Oracle segment. The light blue colored block is the segment header block. Notice the orange colored index root block follows the segment header bock.

As mentioned above, folklore says if the segment is indeed an index, then the orange block will be the index root block. And not just now, but for the life of the index! Wow... This is a pretty strong statement and one that needs to be tested. So that's what I did and what this posting is all about.

It's Kind of Complicated

We need to determine if the block following an index segment header block is the index root block... for always and forever until the index is dropped. First, just dump the index and locate the root block's data block address (DBA). Second, get the DBA for the block following the index segment header block. And finally, compare them. If they match, then we have shown a situation where the block following the index segment header block is indeed the index root block. So let's do that.

Once we get the object_id from dba_segments, here's how to dump an index:
alter session set events 'immediate trace name treedump level :ObjectId';
And here's a snippet of the trace file from near the top.
...
----- begin tree dump
branch: 0x4c5461 5002337 (0: nrow: 8, level: 2)
   branch: 0x4c575e 5003102 (-1: nrow: 141, level: 1)
      leaf: 0x4c5462 5002338 (-1: nrow: 96 rrow: 96)
      leaf: 0x4c63b7 5006263 (0: nrow: 78 rrow: 78)
...     
      leaf: 0x4c554d 5002573 (139: nrow: 100 rrow: 100)
   branch: 0x4c63c7 5006279 (0: nrow: 213, level: 1)
      leaf: 0x4c629d 5005981 (-1: nrow: 88 rrow: 88)
      leaf: 0x4c554e 5002574 (0: nrow: 60 rrow: 60)
      leaf: 0x4c62a0 5005984 (1: nrow: 54 rrow: 54)
...
The first/top mentioned "branch" block is the index's root block. In this case, the index root block has a data block address (DBA) of 5002337. Now let's get the data block address for the block after the index's segment header block. But first we need to get the file number and block number of the index segment header block.
SQL> select header_file, header_block
  2  from dba_segments
  3  where segment_name = 'CH_6_IRB_I';

HEADER_FILE HEADER_BLOCK
----------- ------------
          1       808032
Now let's get the data block address (DBA) for the block just following the header block. We must remember to add one to the header block number, so the block number we are interested in is 808033.
SQL> select dbms_utility.make_data_block_address(1,808033) from dual;

DBMS_UTILITY.MAKE_DATA_BLOCK_ADDRESS(1,808033)
----------------------------------------------
                                       5002337
Do you see it? The DBA just above (with the header block + 1) matches the first/top "branch" block's DBA (5002337) from the index trace file! So now we know how to check if the block following the index's segment header block is truly the index root block.

Now the question becomes, does it always remain this way? For example, what if create the table, create the index, and then insert rows into the table? Or what if we create the table, then insert rows, and finally create the index? If that's not enough, how about this: What if the index grows and splits? Or how about if we delete all the table's rows, insert rows until the index splits? Or how about if it we truncate the related table? As you can see, there are an infinite number of possibilities and there is no way we can test all of them.

The Experimental Setup

I created a number of tests that could be repeatedly run and easily modified and extended. There are two related scripts. The driving script is a SQL script called, doRbExpr.sql and takes a single argument, called the prefix. This prefix is the begining name of all the objects the script creates. This allows you to quickly and easily re-run the script without first removing all the objects from the previous run. The second script, getIdxRtBlk.sql, retrieves the index root block's DBA from both the data dictionary and by dumping the index, and then nicely displays them so you can easily see if there is a difference. I also show the index depth (blevel) as an added test to help ensure I'm looking at the current statistics.

The Experimental Results

Click here to see the results. As you can see, in every case the DBA of the index segment header block plus one, matches the index trace file's root block. I have rerun this test many times, and the results are always the same.

What Does This Prove?

Actually the experiments prove very little, yet they yield a tremendous value. The experiments clearly and repeatedly demonstrate that I have not found a way to disprove an index root block is the block immediately following its segment header block. All it would take is just one of my experiments to break the "block after" rule... but I could not break the rule!  If you can devise a situation to break the rule, please let me know and I'll post it.

So next time you need to check if a particular block is an index root block, simply get it's segment header file and block number, add one to the block number, and compare. In my opinion, that's much easier and faster than dumping the index, parsing it, etc.

Thanks for reading!


If you enjoyed this blog entry, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting and Advanced Oracle Performance Analysis. I teach these classes around the world multiple times each year. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond with a comment or you have a question, please feel free to email me directly at craig@orapub .com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email. Another option is to send an email to OraPub's general email address, which is currently orapub.general@gmail .com.