Monday, November 12, 2012

Total Time and Bits of Time

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Our Experience Tells Us So...

We know from hard-fought experience that the more work we push through a system the more processing time involved. That's not rocket science... but to mathematically represent the relationship between work and time is rocket science.

Decades ago Operations Research queuing theory researchers quantitatively established the relationship between work and time. They put into math what we all have experienced. In our DBA world this means that as the system activity increases, then so does CPU consumption and eventually Oracle non-idle wait time.

Find The Big Bar

The classic way to graphically represent all the database related time is by a vertical bar. (Currently in vogue is relating the bar to the number of average active sessions.) We have all seen Oracle performance monitors (some much better than others) representing the database related time over a period of time (e.g., 10 minutes or one hour) as big vertical bar. Many of us have been told when there is a performance issue, look for the "big bar." Why? Because "the database is busier." And usually this works. But there is a hidden assumption that is not accounted for in this line of thinking which can through off our analysis.

Just because the bar gets bigger doesn't mean there is a performance problem. It can simply mean the system is processing more work; which requires more CPU resources and perhaps there is more wait time involved. So it's more than just the "big bar." It's about both the processed work and the associated time.

Little Bits of Time

For each little piece of work processed (think: buffer get) some CPU is consumed and perhaps there is some associated non-idle wait time. An operations research person would call this response time, that is the time it takes to process a single piece of work (think: buffer get). Upon closer inspection, which I won't go into here, the CPU consumed can be represented by what is called service time and the wait time can be represented by what is called queue time. Therefore, response time is queue time plus service time.

Rt = Qt + St = Qt (time/work) + St (time/work) = Qt (wait time/work) + St (cpu time/work)

Summing the Littles Bits of Time

The big bar, that is the total time, is the sum of all the little bits of time or some interval, such as 10 minutes or perhaps an hour. This can be represented a number of ways, such as:

Total Time = Work Processed X Time to process one piece of work
Total Time = Work Processed X Response Time
Total Time = Work X Time/Work = Time to process all the work

When you think about it, the big bar is made up of little bits of time all stacked on top of each other. Each time a piece of work is processed the bar will get a little taller. Because each time a piece of work is processed a little CPU is consumed and there may be some associated wait time.

Suppose when each piece of work is processed 10 ms of CPU is consumed along with 5 ms of non-idle wait time. It is logical to assume/predict/anticipate that processing 20 pieces of work will take 300 ms. ( 300 ms = 20 work pieces * ( 10 ms/ work piece + 5 ms / work piece ) I demonstrated this in my post entitled Anticipated SQL Elapsed Times.

Workload Intensity

The workload intensity is a rate. It is the amount of work processed over a specific interval of time. For example, if 10500 pieces of work are processed over a 10 minute interval, the average workload intensity is 17.5 pieces of work per second.  ( 10500 pieces of work / 10 minute interval) In the land of queuing this is called the arrival rate. (Technically the arrival rate is what actually arrives into the system, but in an operationally stable system what arrives equals what departs.) By the way, the reason you see the arrival rate represented by the letter L is because when convenient it is always shown as the Greek letter lambda.

What's In The Big Bar

Suppose little pieces of work arrive into the system at a rate of 50 per ms, that is 50 work/ms. It follows that after 1000 ms, 50000 pieces of work will have entered the system. Also, assume each piece of work takes 15 ms to process (it's response time). Therefore the accumulated service time and non-idle wait time will be 750000 ms. ( 750000 ms = 1000 ms X 50 work/ms X 15 ms/work) So the "big bar" will be 750000 ms "tall."

Relating Work with Time

Decades ago, the queuing theory guys mathematically described the relationship between work and time. When there is a single queue (think: CPU run queue, not run queues) sending requests to many "servers" (think: CPU cores) then response time is represented like this:

Rt = St / ( 1 - (( St * L )/M)^M )

where:

St is service time
L is the workload intensity, called the arrival rate.
M is the number of "servers" processing, that is, servicing the work.

Let's related the time to process a single piece of work (Rt) to the total time (big bar). Suppose over a 10 minute snapshot interval 10500 pieces of work (e.g., logical IOs) where processed and it took 15 ms to process each piece of work. Therefore, the total time (big bar height) would be:

Total Time = Work Processed X Rt = 10500 lio X 15 ms/lio = 157500 ms = 157.500 sec

So the height of the "big bar" for this 10 minute interval would be 158 seconds "tall."

Seeing is Believing

Math is great, but pictures are better! If what I have written is correct and the math correct, I should be able to show the Total Time changing as the Work Processed changes. And wouldn't it be great if there was an animated tool so I can visually see and experience this relationship? Well... that's what I did. I created a relatively simple tool to visually show the relationship between work and time.

You can download this free tool from the OraPub.com web site. Just do an OraPub search for "time tool". Below is a picture of the tool as it was initially released (it has seen be updated).
Looking at the figure above, the chart on the left is a classic response time curve. The plotted point is the time it takes to process a single piece of work at the associated arrival rate. The curve represents the response time path the point will take as the arrival rate changes. We know in our gut that as the arrival rate increases (point moves from left to right) eventually the system will not be able to respond as quickly... and this is when the point enters into the "elbow" of the curve.

Technically what is occurring is while the service time is constant, the queue time begins to increase. Think of it like this: If it takes 3 minutes to cook a burger (i.e., service time), it will still take 3 minutes to cook a burger regardless of how many people are waiting (i.e., queuing) to get a burger.

Looking at the figure above, the chart on the right is the total time chart, for a single time interval. Which could be, for example 10 ms or 1 hour. (However in the tool, I set the time interval to one unit of arrival rate time.) What is important to understand is that as the arrival rate increases more CPU is consumed (the green bar gets bigger)... even though the CPU required to service a single piece of work remains constant! (more below) Eventually the system can't process all the arrivals immediately and the pieces of work begin to wait, that is queue. This is when we start seeing the red appear in the bar. And is also when the response time begins to increase.

Looking at the "Big Bar"

What I think is really interesting is while the service time (think: time to cook a burger) does not change regardless of the workload, the total time service (think: time to cook lots of burgers) increases as the workload increases. This is why the green bar continues to increase as the arrival rate increases.

The three figures below show the situation at three different arrival rates; 3.25, 3.50, and 3.75. If you look closely as the arrival rate increases the green bar does in fact get bigger, just not that much.
The response time situation at an arrival rate of 3.25

The response time situation at an arrival rate of 3.50

The response time situation at an arrival rate of 3.75

Looking at and comparing the three above figures, notice once the point enters the elbow of the response time curve the big bar's red components begins to rapidly increase! While the green continues to increase, is does not increase nearly as much as the red bar! This is a graphical way to communicate to others why you do not want your systems operating in the elbow of the curve; relatively slow and highly variable response times means potentially very slow and inconsistent SQL elapsed times and batch job durations.

Creating a number of pictures is great, but it's awesome to watch the change before your eyes. If you download this tool (URL at top of this section) you can change the snapshot interval, the arrival rate, and watch the situation change before your eyes. I created a short video using the 1b version of the tool; while making some comments and then uploaded it to YouTube. I also posted the video into this blog below.

video
Pretty cool, eh? I love seeing what I have experienced and know in my gut is backed up with mathematics. That gives me so much satisfaction!

The Take-Aways

There are a number of take-aways that we can apply to our work and to reinforce what we have personally experienced.
  • There is a relationship between work and time.
  • As the workload increases, so does the total time to process the work.
  • Operations Research queuing theory mathematically describes the relationship between work and time.
  • Response time is the time to process a single piece of work.
  • The response time curve is a graphical way to represent the the relationship between workload intensity (arrival rate) and the time to process a single piece of work (response time).
  • Total time is the time to process many pieces of work, over a specified interval.
  • The "big bar" chart is a great way to visually see the total time and it's components.
  • When in the "elbow" of the response time curve, small changes in the arrival rate result in big changes in the response time; erratic response time.
There are many, many more take-aways but that's good enough for now!

Thanks for reading!

Craig.

If you enjoy my blog, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting,  Advanced Oracle Performance Analysis, and my one-day Oracle Performance Research Seminar. I teach these classes around the world multiple times each year. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond to a comment or you have a question, please feel free to email me directly at craig@orapub .com. Another option is to send an email to OraPub's general email address, which is currently orapub.general@comcast .net. 

Monday, September 17, 2012

How to create LOTS of child cursors.

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

The Back Story

With all the blog postings and forum mishegas about mutex issues, bind mismatch, and child cursors, you would think it would be a simple thing to create hundreds if not thousands of child cursors. Well... it wasn't.

My original intention for this posting was to gather experimental evidence about the performance impact of having hundreds or thousands of child cursors. While it was easy to create a few child cursors (details below), it was difficult to create hundreds of child cursors. I'm also noticing what many DBAs think what is a "bind mismatch" is really not a bind mismatch. So I figured it was worth a separate posting about how to create child cursors using different methods and also how to easily create lots of child cursors.

Why So Difficult?

It was actually very frustrating. Sure it's easy to create a few child cursors. But it was not easy to create hundreds or to cause the infamous "bind mismatch." Why?

I suspect it's a combination of using current versions of Perl's DBI, PLSQL, adaptive cursor sharing, and more current Oracle releases. Oracle knows having hundreds or thousands of child cursors can cause severe performance issues, so I suspect (perhaps hope) they have been and are still working on ways to reduce the number of child cursors.

Plus I'm not using Java in my experiments. When I have talked with DBAs who have personally experienced this mutex contention related to hundreds of child cursors, sloppy Java coding always seems to be at the center.

Seeing What's Happening

This entry is centered on the shared pool's library cache, cursors, and child cursors. For a good introduction including jaw-dropping visuals, check out this blog posting. Every one of our production SQL statements has a related cursor consuming memory in the library cache along with at least one child cursor. Here is a simple example of how you can check this out.
SQL> ALTER SYSTEM FLUSH SHARED_POOL;

System altered.

SQL> select count(*) from dba_objects where object_id=123;

  COUNT(*)
----------
         1

SQL> select sql_id, plan_hash_value, child_number, executions
  2  from   v$sql
  3  where  sql_text like 'select count%%object_id=123';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
bx8xqbbk0q4hh      3299497030            0          1
Let's create another child cursor... it's easy: change the optimizer mode and then re-run the exact same SQL statement.
SQL> alter session set optimizer_mode=first_rows_1;

Session altered.

SQL> select count(*) from dba_objects where object_id=123;

  COUNT(*)
----------
         1

SQL> select sql_id, plan_hash_value, child_number, executions
  2  from   v$sql
  3  where  sql_text like 'select count%%object_id=123';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
bx8xqbbk0q4hh      3299497030            0          1
bx8xqbbk0q4hh      3299497030            1          1
But why was another child cursor created? Let's ask Oracle. Each SQL ID child cursor has a row in v$sql_shared_cursor. For each cursor other than the initial (child number 0), one of the 63 possible flags will be set to a Y.

For example, let's find out, by the way of v$sql_shared_cursor, why the above example SQL statement has two cursors.
SQL> select *
  2  from   v$sql_shared_cursor
  3  where  sql_id='bx8xqbbk0q4hh';

SQL_ID        ADDRESS          CHILD_ADDRESS    CHILD_NUMBER U S O O S L F E B P I S T A B D L T B I I R L I O E M U T N F
------------- ---------------- ---------------- ------------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
A I T D L D B P C S C P T M B M R O P M F L P L A F L R L H P B
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
bx8xqbbk0q4hh 00000002B3A6BE50 00000002B3AAF320            0 N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N

bx8xqbbk0q4hh 00000002B3A6BE50 00000002B393A718            1 N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
N N N N N N N N N N N N N N N N N Y N N N N N N N N N N N N N N
Notice the first child cursor (0) is all N's. That's because you will always have a child cursor for our SQL statements. (I wrote "our" because it is common for LC objects to not have a child cursor. If you do a library cache dump, you'll notice this surprising phenomenon.)

If you look closely at child cursor 1, you will notice the "Y" is at the 49th position. What is the 49th position mean? Look at the 49th reason. I added some numbers to the columns to make finding the reason (column name) easier:
SQL> desc v$sql_shared_cursor
 Name                       Type
 -------------------------- -------------
 SQL_ID                     VARCHAR2(13)
 ADDRESS                    RAW(8)
 CHILD_ADDRESS              RAW(8)
 CHILD_NUMBER               NUMBER
1 UNBOUND_CURSOR            VARCHAR2(1)
 SQL_TYPE_MISMATCH          VARCHAR2(1)
 OPTIMIZER_MISMATCH         VARCHAR2(1)
 OUTLINE_MISMATCH           VARCHAR2(1)
 STATS_ROW_MISMATCH         VARCHAR2(1)
 LITERAL_MISMATCH           VARCHAR2(1)
 FORCE_HARD_PARSE           VARCHAR2(1)
 EXPLAIN_PLAN_CURSOR        VARCHAR2(1)
 BUFFERED_DML_MISMATCH      VARCHAR2(1)
10 PDML_ENV_MISMATCH        VARCHAR2(1)
 INST_DRTLD_MISMATCH        VARCHAR2(1)
 SLAVE_QC_MISMATCH          VARCHAR2(1)
 TYPECHECK_MISMATCH         VARCHAR2(1)
 AUTH_CHECK_MISMATCH        VARCHAR2(1)
 BIND_MISMATCH              VARCHAR2(1)
 DESCRIBE_MISMATCH          VARCHAR2(1)
 LANGUAGE_MISMATCH          VARCHAR2(1)
 TRANSLATION_MISMATCH       VARCHAR2(1)
 BIND_EQUIV_FAILURE         VARCHAR2(1)
20 INSUFF_PRIVS             VARCHAR2(1)
 INSUFF_PRIVS_REM           VARCHAR2(1)
 REMOTE_TRANS_MISMATCH      VARCHAR2(1)
 LOGMINER_SESSION_MISMATCH  VARCHAR2(1)
 INCOMP_LTRL_MISMATCH       VARCHAR2(1)
 OVERLAP_TIME_MISMATCH      VARCHAR2(1)
 EDITION_MISMATCH           VARCHAR2(1)
 MV_QUERY_GEN_MISMATCH      VARCHAR2(1)
 USER_BIND_PEEK_MISMATCH    VARCHAR2(1)
 TYPCHK_DEP_MISMATCH        VARCHAR2(1)
30 NO_TRIGGER_MISMATCH      VARCHAR2(1)
 FLASHBACK_CURSOR           VARCHAR2(1)
 ANYDATA_TRANSFORMATION     VARCHAR2(1)
 INCOMPLETE_CURSOR          VARCHAR2(1)
 TOP_LEVEL_RPI_CURSOR       VARCHAR2(1)
 DIFFERENT_LONG_LENGTH      VARCHAR2(1)
 LOGICAL_STANDBY_APPLY      VARCHAR2(1)
 DIFF_CALL_DURN             VARCHAR2(1)
 BIND_UACS_DIFF             VARCHAR2(1)
 PLSQL_CMP_SWITCHS_DIFF     VARCHAR2(1)
40 CURSOR_PARTS_MISMATCH    VARCHAR2(1)
 STB_OBJECT_MISMATCH        VARCHAR2(1)
 CROSSEDITION_TRIGGER_MISMATCH VARCHAR2(1)
 PQ_SLAVE_MISMATCH          VARCHAR2(1)
 TOP_LEVEL_DDL_MISMATCH     VARCHAR2(1)
 MULTI_PX_MISMATCH          VARCHAR2(1)
 BIND_PEEKED_PQ_MISMATCH    VARCHAR2(1)
 MV_REWRITE_MISMATCH        VARCHAR2(1)
 ROLL_INVALID_MISMATCH      VARCHAR2(1)
 OPTIMIZER_MODE_MISMATCH    VARCHAR2(1)
50 PX_MISMATCH              VARCHAR2(1)
 MV_STALEOBJ_MISMATCH       VARCHAR2(1)
 FLASHBACK_TABLE_MISMATCH   VARCHAR2(1)
 LITREP_COMP_MISMATCH       VARCHAR2(1)
 PLSQL_DEBUG                VARCHAR2(1)
 LOAD_OPTIMIZER_STATS       VARCHAR2(1)
 ACL_MISMATCH               VARCHAR2(1)
 FLASHBACK_ARCHIVE_MISMATCH VARCHAR2(1)
 LOCK_USER_SCHEMA_FAILED    VARCHAR2(1)
 REMOTE_MAPPING_MISMATCH    VARCHAR2(1)
60 LOAD_RUNTIME_HEAP_FAILED VARCHAR2(1)
 HASH_MATCH_FAILED          VARCHAR2(1)
 PURGED_CURSOR              VARCHAR2(1)
 BIND_LENGTH_UPGRADEABLE    VARCHAR2(1)
The 49th column is, "OPTIMIZER_MODE_MISMATCH". So the child cursor was created because we set the optimizer mode to something different than when the initial child cursor was created.

For the more hard core DBAs, a library cache dump is even more satisfying!
SQL> alter session set max_dump_file_size=unlimited;

Session altered.

SQL> alter session set events 'immediate trace name library_cache level 10';

Session altered.

SQL> select tracefile from v$process where 
  2  addr=(select paddr from v$session where sid=(select sid from v$mystat where rownum=1));

TRACEFILE
-----------------------------------------------------------------------------
/home/oracle/base/diag/rdbms/prod23/prod23/trace/prod23_ora_23631.trc
Looking in trace file and searching for "object_id=123", I found this (and removed most of the contents):
Bucket: #=70160 Mutex=a8958e00(0, 20, 0, 6)
  LibraryHandle:  Address=b3a6be50 Hash=e40b1210 LockMode=0 PinMode=0 LoadLockMode=0 Status=VALD
    ObjectName:  Name=select count(*) from dba_objects where object_id=123

      FullHashValue=158e40ec5d962fc9bea3b65ae40b1210 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=3825930768 OwnerIdn=5
    LibraryObject:  Address=9fea1c60 HeapMask=0000-0001-0001 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000]
      DataBlocks:
        Block:  #='0' name=PCUR^e40b1210 pins=0 Change=NONE
          Heap=b3ae3c38 Pointer=9fea1d48 Extent=9fea1c30 Flags=I/-/P/A/-/-
          FreedLocation=0 Alloc=1.515625 Size=4.000000 LoadTime=8700560730
      ChildTable:  size='16'
        Child:  id='0' Table=9fea2b60 Reference=9fea25b0 Handle=b3aaf320
        Child:  id='1' Table=9fea2b60 Reference=9fea28d0 Handle=b393a718
Do you see the above two child cursor entries? I find seeing this in the LC dump very satisfying. Much more so than looking at a v$ view.

While the "optimizer mode mismatch" is an interesting and very easy way to learn about child cursors, it's not likely to be a performance issue.

The BIND_EQUIV_FAILURE situation

While trying to cause the infamous bind mismatch reason for creating child cursors, I stumbled across the bind equivalent failure situation. This occurs, in part, when a child cursor cursor was created with a column value selectivity of x and now a new column value is supplied with a very different selectivity. For example, suppose there are 99 rows where column c1 equals "F" and 1 row where column C1 equals "M". If the initial query is something like "...where c1=:b1" where :b1 is "F" and then another execution has the bind variable :b1 set to "M" a bind_equiv_failure can result... but not always based on my testing. (Hope that made sense.)

Here's how I caused this:
SQL> set tab off
SQL> set linesize 180
SQL> drop table bogus;
SQL> create table bogus as select * from dba_objects;
SQL> exec DBMS_STATS.GATHER_TABLE_STATS('SYSTEM', 'BOGUS', METHOD_OPT => 'FOR COLUMNS SIZE 3 OBJECT_TYPE');
SQL> select object_type,count(*) from bogus group by object_type order by 2 desc;
OBJECT_TYPE           COUNT(*)
------------------- ----------
SYNONYM                  27800
JAVA CLASS               22920
VIEW                      5074
INDEX                     4156
TABLE                     3084
TYPE                      2783
PACKAGE                   1313
PACKAGE BODY              1253
LOB                        900
JAVA RESOURCE              834
TRIGGER                    613
JAVA DATA                  328
FUNCTION                   303
INDEX PARTITION            300
TYPE BODY                  237
SEQUENCE                   232
LIBRARY                    183
PROCEDURE                  160
TABLE PARTITION            144
OPERATOR                    55
XML SCHEMA                  51
QUEUE                       39
CONSUMER GROUP              25
RULE SET                    23
PROGRAM                     19
JOB                         14
EVALUATION CONTEXT          14
JOB CLASS                   13
CLUSTER                     10
RESOURCE PLAN               10
UNDEFINED                    9
DIRECTORY                    9
INDEXTYPE                    9
WINDOW                       9
CONTEXT                      7
DIMENSION                    5
SCHEDULER GROUP              4
SCHEDULE                     3
MATERIALIZED VIEW            3
JAVA SOURCE                  2
DESTINATION                  2
LOB PARTITION                1
EDITION                      1
RULE                         1

44 rows selected.

SQL> ALTER SYSTEM FLUSH SHARED_POOL;

System altered.

SQL> variable obj_type_var varchar2(100);
SQL> exec :obj_type_var := 'SYNONYM';

PL/SQL procedure successfully completed.

SQL> select count(*), max(object_id) from bogus where object_type = :obj_type_var ;

  COUNT(*) MAX(OBJECT_ID)
---------- --------------
     27800          74718

SQL> select sql_id, plan_hash_value, child_number, executions
  2  from   v$sql
where  sql_text like 'select count%bogus%';

  3  
SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
b4bbq6byn8dsd      3200977001            0          1
There are 27800 rows where the object_type is "SYNONYM". Now let's try and force the creation of a second child cursor by setting the bind variable :obj_type_var to something where there are no rows! Like this:
SQL> exec :obj_type_var := 1;

PL/SQL procedure successfully completed.

SQL> select count(*), max(object_id) from bogus where object_type = :obj_type_var ;

  COUNT(*) MAX(OBJECT_ID)
---------- --------------
         0

SQL> select sql_id, plan_hash_value, child_number, executions
  2  from   v$sql
  3  where  sql_text like 'select count%bogus%';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
b4bbq6byn8dsd      3200977001            0          2
Nope... Didn't happen. Let's try again with the bind variable value of "JOB" with which there are 14 rows.
SQL> exec :obj_type_var := 'JOB';

PL/SQL procedure successfully completed.

SQL> select count(*), max(object_id) from bogus where object_type = :obj_type_var ;

  COUNT(*) MAX(OBJECT_ID)
---------- --------------
        14          57372

SQL> select sql_id, plan_hash_value, child_number, executions
  2  from   v$sql
  3  where  sql_text like 'select count%bogus%';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
b4bbq6byn8dsd      3200977001            0          2
b4bbq6byn8dsd      3200977001            1          1
Great! Another child cursor was created. (Interestingly, if I switch the order of the bind variable settings (JOB vs 1), the exact same thing will occur.) Why did Oracle decide to create another the child cursor? Let's check using v$sql_shared_cursor. If you count starting at the N's you'll notice the Y occurs in child cursor 1 at position 19.
SQL> select * from v$sql_shared_cursor where sql_id='b4bbq6byn8dsd';

SQL_ID        ADDRESS          CHILD_ADDRESS    CHILD_NUMBER U S O O S L F E B P I S T A B D L T B I I R L I O E M U T N F A I T D
------------- ---------------- ---------------- ------------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
L D B P C S C P T M B M R O P M F L P L A F L R L H P B
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
b4bbq6byn8dsd 00000002B7D2F810 00000002B7B98410            0 N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N Y N N N N N N N N

b4bbq6byn8dsd 00000002B7D2F810 00000002B7C4C210            1 N N N N N N N N N N N N N N N N N N Y N N N N N N N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N N N N N N N N N N
Looking at the describe of v$sql_shared_cursor above, position 19 is "BIND_EQUIV_FAILURE". So, this is yet another way to create child cursors. But still I suspect this is unlikely to cause crazy poor performance.

Since counting N's is actually pretty irritating, here is another simpler query.
SQL> select sql_id, child_number, bind_equiv_failure
  2  from   v$sql_shared_cursor 
  3  where  sql_id='b4bbq6byn8dsd';

SQL_ID       CHILD_NUMBER B
------------- ------------ -
b4bbq6byn8dsd            0 N
b4bbq6byn8dsd            1 Y

The BIND_LENGTH_UPGRADEABLE situation

When trying to force a bind mismatch, I also stumbled across what's called a bind length upgradeable. There is a lot on the internet about this, so I won't get into the details. But I will say this: When a child cursor is created Oracle appears to pre-allocate some internal bind variable value space based on the variable definition. If the existing bind variable (variable b1 varchar2(5)) is redefined (variable b1 varchar2(33)) and the space is significantly larger (and that's what the other blog entries get into), a new child cursor will be created. Interestingly, the trigger lengths are around 1, 33, 129, and 2001. Here is a simple example of this:
SQL> alter system flush shared_pool;

System altered.

SQL> variable b1 varchar2(5);
SQL> exec :b1 := 'abc';

PL/SQL procedure successfully completed.

SQL> update bogus set owner='zzz' where object_type = :b1;

0 rows updated.

SQL> select sql_id, plan_hash_value, child_number, executions
  2  from   v$sql
  3  where  sql_text like 'update bogus set%';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
bcbtftw6dur97      3773790677            0          1
Now I will redefine bind variable :b1 but notice the contents is exactly the same ("abc").
SQL> variable b1 varchar2(33);
SQL> exec :b1 := 'abc';

PL/SQL procedure successfully completed.

SQL> update bogus set owner='zzz' where object_type = :b1;

0 rows updated.

SQL> select sql_id, plan_hash_value, child_number, executions
  2  from   v$sql
  3  where  sql_text like 'update bogus set%';

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
bcbtftw6dur97      3773790677            0          1
bcbtftw6dur97      3773790677            1          1

SQL> select sql_id, child_number, BIND_LENGTH_UPGRADEABLE
  2  from   v$sql_shared_cursor 
  3  where  sql_id='bcbtftw6dur97';

SQL_ID       CHILD_NUMBER B
------------- ------------ -
bcbtftw6dur97            0 N
bcbtftw6dur97            1 Y
I can see this situation perhaps could cause a problem, if you had many columns and different programs with the same exact SQL but defining the variable many different lengths. While this may seem like a potential problem, it's not as bad as you might think. Once a variable size has been increased, resetting it smaller does not appear to shrink it back down. This will eliminate the exponential situation that we'll see in the bind variable mismatch situation below. To prove this non-exponential child cursor growth point, consider the code below:
outfile=doit.sql
rm $outfile

echo "set tab off" >> $outfile
echo "set linesize 130" >> $outfile
echo "ALTER SYSTEM FLUSH SHARED_POOL;" >> $outfile

for b1 in 1 33 129 2001
do
  for b2 in 1 33 129 2001
  do
        echo "variable b1 varchar2($b1);" >> $outfile
        echo "variable b2 varchar2($b2);" >> $outfile
        echo "exec :b1 := 'abc';" >> $outfile
        echo "exec :b2 := 'abc';" >> $outfile
        echo "update bogus set owner='zzz' where object_type = :b1 or object_type = :b2;" >> $outfile
  done
done
echo ""
echo "select sql_id, plan_hash_value, child_number, executions from v\$sql
where sql_text like 'update bogus set%';" >> $outfile
echo "exit;" >> $outfile
echo "------------------------------------------------------"
sqlplus system/manager @$outfile
If you run this, this is what you'll see:
SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
9kxv8qtnq8mmw      3773790677            0          1
9kxv8qtnq8mmw      3773790677            1          1
9kxv8qtnq8mmw      3773790677            2          1
9kxv8qtnq8mmw      3773790677            3          1
9kxv8qtnq8mmw      3773790677            4          4
9kxv8qtnq8mmw      3773790677            5          4
9kxv8qtnq8mmw      3773790677            6          4

7 rows selected.
Most people would incorrectly assume (as I did) there would be 2^4 or 4^2 child cursors. So it turns out that the bind length upgradable situation is not likely to cause a severe performance problem.

The Bind Mismatch Situation

Over the past few years, there has been a lot of talk about a SQL statement cursor having hundreds or even thousands of child cursors. Why? Because performance comes to a standstill and sometimes the instance has to be restarted. Yes, it's that bad.

Simply stated, a bind variable mismatch occurs when a bind variable is defined one way and then another, but the "another" is not different enough to cause an error yet it is different enough to cause Oracle to create an additional child cursor.

This may seen like an easy situation to cause, but it is not. PLSQL and Perl are very good at changing or adapting a bind variable type to the currently existing definition. What I'm told is that with Java, it's actually very easy to cause the problem. In fact, when I talk with DBAs who have experienced mutex contention because of hundreds of child cursors, sloppy Java coding always seems to be somehow involved.

But why a performance problem?

To understand why a mismatch can cause severe performance problem, we need to understand what the server process is doing. To parse a SQL statement the server process must find, and if not found then create, a suitable child cursor. But here is the issue: When a SQL statement is hashed, it is hashed to the parent cursor (mutex in shared mode required) and then it appears each existing child cursor is sequentially accessed trying to find a suitable match (mutex in shared mode required). [Note: my next blog entry will address this "it appears" issue specifically.] If a suitable child cursor is not found, then a new child cursor is created, which requires parent and child mutex access in exclusive mode. So if there are hundreds or thousands of child cursors involved you can see that even with very fast CPU cores a popular SQL statement could result in a nightmarish performance situation.

Here is how to quickly demonstrate a bind variable mismatch. Save the below code snippet to t1.sql:
select count(*), max(object_id) 
from   bogus 
where  object_id = :obj_id1_var or object_id = :obj_id2_var ;
Now in SQL*Plus do this:
ALTER SYSTEM FLUSH SHARED_POOL;
variable obj_id1_var nchar(5);
exec :obj_id1_var := '1';
variable obj_id2_var nchar(5);
exec :obj_id2_var := 1;
@t1.sql
Now run this:
SQL> select sql_id,plan_hash_value,child_number, executions "execs"
  2  from   v$sql
  3  where  sql_text like 'select count%max%bogus%';

SQL_ID       PLAN_HASH_VALUE CHILD_NUMBER execs
------------- --------------- ------------ ----------
f5qpt4fmb4hkk      3200977001            0          1
Just what we expected; parent cursor with a single child cursor. Now lets redefine bind variable :obj_id2_var, run the select statement, and check the child cursor situation like this:
variable obj_id2_var number;
exec :obj_id2_var := 1;
@t1.sql

SQL> select sql_id,plan_hash_value,child_number, executions "execs"
  2  from   v$sql
  3  where  sql_text like 'select count%max%bogus%';

SQL_ID       PLAN_HASH_VALUE CHILD_NUMBER execs
------------- --------------- ------------ ----------
f5qpt4fmb4hkk      3200977001            0          1
f5qpt4fmb4hkk      3200977001            1          1
Why was the new child cursor created?
SQL>select * from v$sql_shared_cursor where sql_id='f5qpt4fmb4hkk';
 
SQL_ID       ADDRESS        CHILD_ADDRESS CHILD_NUMBER U S O O S L F E B P I S T A B D L T B I I R L I O E M U T N F
------------- ---------------- ---------------- ------------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
A I T D L D B P C S C P T M B M R O P M F L P L A F L R L H P B
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
f5qpt4fmb4hkk 00000002B3B30280 00000002B3A93F18     0 N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N

f5qpt4fmb4hkk 00000002B3B30280 00000002B3AA80D8     1 N N N N N N N N N N N N N N Y N N N N N N N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
Notice the "Y" is at position 15 which, referring back near the top of the blog, is the "BIND_MISMATCH" reason for the child cursor creation!

Here is a simpler query:
SQL> select sql_id, child_number, BIND_MISMATCH
  2  from   v$sql_shared_cursor 
  3  where  sql_id='f5qpt4fmb4hkk';

SQL_ID       CHILD_NUMBER B
------------- ------------ -
f5qpt4fmb4hkk            0 N
f5qpt4fmb4hkk            1 Y
This clearly does NOT demonstrate a problem. It simply demonstrates that in this closely orchestrated situation I was able to cause two child cursors by changing the bind variable definition enough (not it's length as with BIND_LENGTH_UPGRADEABLE situation) to cause another child cursor creation but not changed enough to cause a SQL statement failure.

The real question is, could this situation become an exponential situation? For example, if there were 3 bind variable could the situation have the potential of causing a 2^3 child cursors. The only way to really know this is create a very simple and very repeatable experiment that anyone could run for themselves... and that's what I've done below.

Here is what I did and what you can do. I basically automated the steps I used above to create a bind mismatch resulting in two child cursors and scaled it up to 5 bind variables, not just two.

Here what I did: As the Oracle Linux user, I copy and pasted the below text on the Linux command line:
outfile=doit.sql
rm $outfile

echo "set tab off" >> $outfile
echo "set linesize 130" >> $outfile
echo "ALTER SYSTEM FLUSH SHARED_POOL;" >> $outfile

for b1 in number nchar\(5\)
do
  for b2 in number nchar\(5\)
  do
    for b3 in number nchar\(5\)
    do
      for b4 in number nchar\(5\)
      do
        for b5 in number nchar\(5\)
        do
          echo "variable b1 $b1;" >> $outfile
          echo "exec :b1 := '1';" >> $outfile
          echo "variable b2 $b2;" >> $outfile
          echo "exec :b2 := '1';" >> $outfile
          echo "variable b3 $b3;" >> $outfile
          echo "exec :b3 := '1';" >> $outfile
          echo "variable b4 $b4;" >> $outfile
          echo "exec :b4 := '1';" >> $outfile
          echo "variable b5 $b5;" >> $outfile
          echo "exec :b5 := '1';" >> $outfile
echo "
select count(*), max(object_id) 
from   bogus 
where  object_id = :b1 or object_id = :b2 or object_id = :b3 or object_id = :b4 or object_id = :b5;
" >> $outfile
        done
      done
    done
  done
done

echo ""
echo "select sql_id,plan_hash_value,child_number, executions from v\$sql where sql_text like 'select count%bogus%';" >> $outfile
echo "exit;" >> $outfile
sqlplus system/manager @$outfile
The final select statement will look something like this:
SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS
------------- --------------- ------------ ----------
acxkxa5d11ffz      3200977001            0          1
acxkxa5d11ffz      3200977001            1          1
. . . 
acxkxa5d11ffz      3200977001           30          1
acxkxa5d11ffz      3200977001           31          1

32 rows selected.
Is that cool or what! But it gets better because look at why Oracle created the 32 child cursors.
SQL> select sql_id,address,child_number,child_address,bind_mismatch
  2  from   v$sql_shared_cursor
  3  where  sql_id='acxkxa5d11ffz';

SQL_ID                 ADDRESS CHILD_NUMBER CHILD_ADDRESS    B
------------- ---------------- ------------ ---------------- -
acxkxa5d11ffz 00000002B7C57E60            0 00000002B75F7448 N
acxkxa5d11ffz 00000002B7C57E60            1 00000002B7B39998 Y
. . .
acxkxa5d11ffz 00000002B7C57E60           30 00000002B7C400C0 Y
acxkxa5d11ffz 00000002B7C57E60           31 00000002B7C498E0 Y

32 rows selected.
The potentially performance killer point here is with only 5 bind variables, 32 child cursors were created. Let's put a formula to this.

child cursors = bind variable definitions ^ number of bind variables

Which in our case was 2^5 which equals 32. If you're getting nervous, I understand. Anything to the power of something should catch our attention.

Scaling up to production...

There are three points I want to highlight:

1. While 32 child cursors is not likely to cause a performance problem, it could. In very high concurrency situations the shared mutex acquisition and child cursor scanning could result in CPU subsystem saturation and mutex contention.

2. Based on my experiments and the difficulty in creating bind mismatch using PLSQL or Perl, I will always ask if there is Java code involved in the mutex constrained application.

3. A more likely and very serious performance situation is when there is a SQL statement with perhaps 10, 20, 30, or perhaps 40 bind variables written in Java where the SQL is the same but the bind variable definition is inconsistent resulting in only 2 variations. This means where could be up to 2^10 , 2^20 , 2^30 or 2^40 that is, 1024, 1048576, 1073741824, or 1099511627776 child cursors! Unless Oracle limits the number of child cursors (which I'm told it does in certain Oracle 11g versions to 1024 child cursors), you're in serious trouble.

What does this mean for us?

There are many ways multiple child cursors are created. If you have mutex related contention, check the version count of your active SQL statements (i.e., being executed) AND determine why Oracle is creating the child cursors. Don't assume it's a bind mismatch, check in v$sql_shared_cursor.

This begs the questions: How many child cursors are OK? And are there other factors, like concurrency? And then what are our options? These questions are what my next posting will focus on.

But before I could study this I needed a simple, repeatable, and verifiable way to create lots and lots of child cursors. As usual I learned more than I expected and I hope enjoyed this posting!

Thanks for reading!

Craig.

Thursday, August 30, 2012

Variable Load Impact on Hard Parsing

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Over 1100 Times Slower is... Bad!

In my previous posting I experimentally demonstrated hard parsing was not just bad, but really bad. In fact, my non-load test showed the initial parse (hard parse) was 180 times longer than the second parse (soft parse)! Also, once session cache cursors kicked in after the third parse, parse time dropped over 1100 times when compared to the initial hard parse! While that's useful (and interesting), rarely is parsing an issue in a no-load situation, that is, single session situation. That brings us to this posting...

This posting is focused on parse time when there is an intense load that completely consumes all the available CPU combined with massive library cache mutex related contention.

I should also note that the load placed on the system is not from the SQL statement under observation but from other SQL statements (detailed below).

Review Parsing at Various Levels of Intensity

From a performance perspective, I categorize parsing as follows:

No parse. There is actually a "no parse." I have not actually tested this, so I'll say nothing more about it. (I feel another blog posting...)

Softer parse. Question: Does the statement exist in my session cache cursor memory? If so, where is the cursor in the library cache? (i.e., What's the address?) This does not require a LC latch or mutex because the server process's private memory is being accessed, not the shared LC searching (i.e., hashing) structure.  If not, then continue to Soft Parse.

Soft parse. Question: Does the statement already exist in the library cache? If so, where is it? This requires LC latch and/or mutex access. If not, continue to Hard parse.

Hard parse. No questions, just get to it! But there are more steps involved. Allocate shared pool memory (requires shared pool sub-pool latch(es), create the cursor parent/child (requires mutex and/or LC latch), bring needed information into the shared pool (think: recursive SQL) (requires mutex and/or LC latches), and create all the associated memory links (requires mutex and/or LC latches). Wow, that's a lot of stuff and massive serialization control.

As one would expect, my previous posting documented that my experiments showed hard parsing elapsed time took 180 times longer than soft parsing and over 1100 times longer than softer parsing. But again, that was in a no-load situation in a pretty serial-like situation. This experiment and posting interjects competition for serialization control structures and for CPU resources.

Experimental Setup

In summary, I gathered the elapsed parse time when a statement is executed; once, twice, ... seven times. However, I also injected an increasing load so we could see the impact of a load on the initial hard parse (execution one) and the soft-type parses.

The load I placed on the system was very simple. A number of sessions ran the below script:
def sleepT=&1
declare
  i number;
  cursor cur_one is
    select count(*) from dba_objects where 1=0;
begin
  loop
    for i in 0..9999
    loop
      open cur_one;
      close cur_one;
    end loop;
    sys.dbms_lock.sleep(exp(&sleepT*dbms_random.normal));
  end loop;
end;
/
The strange looking sleep time determination above creates a much more realistic sleep/think time. I detail this in my May 2012 posting entitled, Creating Realistic Think Times for Research Projects.

Opening the cursor forces a parse to occur requiring serialization control via mutexes (cursor: pin S). Just to double check, I SQL traced a very simple open and close, clearly revealing there was indeed parse time related to the open and close.

As I increased the number of sessions concurrently running the above script, CPU consumption increased, CPU utilization increased, and finally mutex related wait time began to appear. Once the experimental load reached a load level of 10, the timing situation looked like this:


The above script rtpctx.sql comes from my free OSM toolkit. The "% RT" column is the percentage of time related to the total CPU and non-idle wait time over the interval (38 seconds in this situation). The "% WT" is the percentage of time related to all the non-idle wait time over the interval. You can see the wait event "cursor: pin S" was 99.36% of all the non-idle wait time and 63% of all the CPU plus wait time. And as you might expect, the 6 CPU cores where averaging 100% busy!

The experiment was run at 10 different load levels, signified as level 3 to level 13. (At load level 1 and 2, nothing much was happening.) The load level is a relative term. Each load unit is related to 7 concurrent sessions each with a different sleep time (see code above).

About the SQL under observations: At each load level a similar yet unique SQL statement (different objects but otherwise exactly the same SQL) was run seven times; the first is the hard parse, the second is the soft parse... At each load level and for each of the seven execution times, I took 100 samples. Not 30, but 100! I gathered 30 the first time, but when I created a few histograms they looked really interesting, so I decided to gather more samples. Because the SQL statements execute so quickly, gathering all the data only took around an hour.

The experimental hardware and software is as follows: Oracle 11.2.0.1.0 Enterprise Edition running on (uname -a): Linux sixcore 2.6.32-300.3.1.el6uek.x86_64 #1 SMP Fri Dec 9 18:57:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux.

Experimental Results

This experiment generated interesting plots, histograms, and numerics. Everything has been placed on-line for you spend countless hours analyzing. However, I will highlight just a few things below.

If you want to look at the experimental details; including all the scripts, output including the data, plots, Mathematica notepad, etc. click here (3.6 MB).

I'll report my findings through a series of questions.

Does hard parse time increase as the load increases?

Yes, but it's even more interesting than you might think! First, remember the load on the system was competing for CPU resources and hard parsing requires CPU resources. If there was an IO bottleneck, the hard parse time might not increase as the load increases. Here is a plot which clearly shows as the load increases, so does the hard parse time.
Each dot (and many are stacked) represents one of the 100 samples taken at each load... for the initial execution, that is, the hard parse. For example, at load 3 while it may look like there are only perhaps five samples, there are actually 100... not a relatively large variance or data spread.

The above plot clearly shows two things:

1. Hard parse elapsed time increased as the load increased. Once the load level hit 10, CPU utilization averaged around 100% and 99% of the wait time was cursor: pin S. Here are the median elapsed parse times (ms) starting at load level 3: 24.5 , 25.5 ,  31.3 , 55.3 , 81.0 , 76.4 , 132.1 , 140.0 , 152.0 , 134.2 , 190.5 .

2. Hard parse elapsed time variance increased as the load increased. The plot above clearly shows the spread of sample time increases as the load increases. It's actually much more than I expected. Keep in mind, that because the samples can overlap, looking at the above plot it's not possible to get a good understanding of the distribution. For that, we need a histogram as I show below.
The two tall above smoothed histograms are the hard parse at Load 3 and Load 4. They are so tall because their spread is relatively small so most of the samples end up being stacked on each other. Load 5 peaks far below Load 3 and Load 4 but does have a peak (i.e., mode). Load 5 and Load 6 are tremendously spread out and are barely visible. The spread was massive for Loads 7 to 13, I did not show them in the histogram. Large data spreads produce inconsistent performance... something that users do not enjoy.

I also wanted to know if the relationship between hard parse elapsed time to load is linear. All the needed data is in the Mathematica notepad, but looking at the plot below we can see the relationship appears to be linear. 

The relationship between hard parse elapsed time and system load appears to be linear.

Certainly there is not an exponential-type bend to the curve. I incorrectly assumed that as the load increased the hard parse time would increasingly increase resulting in a non-linear relationship between hard parse times and load. Here's the plot.

Is there is significant difference between hard, soft, and softer parse times at the various loads?

Yes. Below are the plots clearly demonstrating this! Going left to right and top to bottom, the upper left plot are the elapsed parse times for the initial parse (hard) at various load levels. And yes, this is the same chart at the one above. The bottom right plot is the seventh parse (very soft) at various load levels.
Just by looking at the above plots we can easily see:

1. The hard parse is always much longer than a soft parse, regardless of the load level. In fact, here are the median hard parse times (ms) at load level six: 48.7 (hard), 0.142 (soft), 0.026 (soft), 0.020 (softer), 0.020 (softer), 0.020 (softer), 0.020 (softer). At load level six, the difference between the hard (first execution) and soft parse (second execution) is 348 times! (348=486.8ms/1.4ms) In my previous posting related to a no-load system, the difference hard parsing was 180 times longer than soft parsing (second execution), so a load definitely accelerates the situation... about double in this situation.

2. At load level six, the difference between the hard parse and the softer parse (execution 4, lower left plot) is 2435 times! (2435=48.7/0.020)  In my previous posting related to a no-load system, the hard parsing was 1129 times longer than softer parsing (fourth execution), so a load definitely accelerates the situation... about double in this situation.

When does the elapsed parse time stabilize?

The fourth execution is about the best Oracle can do in terms of elapsed parse time. While there are significant drops from the first (hard) to the second (soft) parse and also from the second (soft) and third (soft) parses, once the fourth execution (softer) occurs the parse time stabilizes. This is exactly what we saw during the no-load experiment and demonstrates the the session cached cursor performance is the best Oracle can do. This is why we don't continue to see a parse time drop when comparing the fourth to the fifth, sixth, and seventh executions.
In the above plot, the third execution (soft parse) elapsed times are the top round blue circles. The fourth to seventh executions are clumped together fairly consistently below the third execution soft parses.

Is soft(er) parsing elapsed time immune to the system load?

Yes, in my specific experiment. If you look at the above plot and one above it (with the seven plots), you'll notice that the elapsed parse time is very consistent  and did not increase much (it does a little) as the load increased. The obvious exception is the hard parse. This is another example of why we want soft parsing to occur and just how performance harming hard parsing can be.

Summary

While there are lots of insights we can glean from this experiment, to me these are the most important:

1. Hard parsing elapsed time increases as the (CPU) system load increases. That is, it gets worse and worse! While the relationship is linear, in my experiment hard parse times increased from around 25 ms to 190 ms. That is around a 7.6X increase or if you prefer, a 760% increase!

2. By sharing SQL, Oracle is able to soft and softer parse dramatically decreasing parse time, reducing CPU consumption, and reducing memory serialization control issues (e.g., mutex wait events). At an intense CPU workload (load 6), the parse time drop from a hard parse (first execution) to a softer parse (fourth execution) was around 2435 times! Note: There are rare and very specific situations where we want a similar SQL statement to be slightly textually different reducing the likelihood of many sessions executing the same exact SQL statement. I will focus on this in a future blog post.

3. Once a statement is executed four times, parse time is unlikely to decrease any further and is not materially affected by the CPU load.

In my next parsing related posting, I am going to delve into the situation where hundreds and hundreds of child cursors can be created causing massive serialization and parse time issues. If you're not familiar with this situation it can seem obscure and unusual. But it occurs more often than people think and can be a very serious performance issue.

Thanks for reading!

Craig.

Tuesday, August 7, 2012

What I do in my spare time...

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

This is very unusual indeed. This post is not about Oracle, but about one of the things that I do besides Oracle performance analysis.

Many people ask me what I do besides Oracle and I reply with a long list of things like coffee roasting, beekeeping, spiritually focused youth work, and the list goes on (just ask my wife).

One of the things I love is riding my motorcycle and this is what this post is about. I just returned from a 14 day trip from where I live (Portland, OR USA area) riding down south about 1000 miles to a big motorcycle race (MotoGP at Laguna Seca) and then back home. I took some video and pictures and made comments along the way. I put it all together into a three part series and posted it on youtube for anyone to see.

If you want to take a look and see "the real Craig" (not even one mention of "oracle" or "orapub"), here are the direct links:


Monday, July 16, 2012

How Bad Is Oracle SQL Hard Parsing?

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Could Someone Please Quantify, "Bad"...

It seems like everywhere I turn someone says, "You don't want to hard parse. It's bad." But how bad is the bad? And why is it bad? And what can I do about it? That's what I want to turn my attention to in the next few postings. This will lead us down the path to some very interesting Oracle performance topics; including issues relating to hundreds or thousands of child cursors, mutexes, etc. So it should be exciting!

Visualizing the Library Cache (LC)

Before we look at the experimental evidence (you know there's got to be some in this posting), we need to review just what occurs during parsing. While there are many ways to abstract and explain parsing, in summary: parsing takes raw text and makes it ready to be shared and executed. Wow, that is a massive summary for sure!
Cool picture, eh? This is a highly abstracted view of the library cache. Cursors are in blue, child cursors in read, and in the two massive clusters are the tables are white, which are being referrenced/shared by many child cursors. If you want to see more visual pictures of the library cache, check out my January, 2011 posting entitled, Library Cache Visualization Tool: How To. It's also great introduction to the LC. Plus the tool I use to create the jaw-dropping visualizations is freely available on my web-site.

Hashing to Find Something

Key to understanding parsing is to understand the Oracle LC search structure/method. Regardless of whether latches, mutexes, or both are used, Oracle uses a hashing structure to find things in the LC; cursors, child cursors, tables, sequences, views, packages, etc. If you do a search for "hash" on my blog, you'll see a number of references.

Essentially, hashing can be an extremely fast way to find something. Picture the traditional library and how you find a book using the card catalog and you have a very good conceptual view of Oracle's LC. Also, if you understand Oracle's buffer cache, cache buffer chain structure you have a good conceptual view of Oracle's LC. For a wonderful and relatively short explanation of LC hashing, look in the "It's Searched" section in my January 2011 posting entitled, Library Cache Visualization Tool: How To.
Above is a abstracted visualization of the library cache with the focus on the searching, hence the hashing structure. Here's a quick summary of how the searching works. Suppose a server process wants to know if "select status from employee" is already in the library cache. As I mention below, if the cursor is not found in the server process's session cursor cache, the SQL text will be hashed to one of the far left green "LC xB" (library cache chain number x beginning), the appropriate serialization structure retrieved (think: latch or mutex), and then the associated chain is searched until the cursor is found or the end of the chain is reached (think: "LC xE" is reached). So if we are looking for "CSR 2" will be hashed to LC chain "LC 1B" and then look for "CSR 2" on this chain. Once we hit "CSR 2" we have location information about the cursor and then its child cursor(s). But this posting is about the impact of hard parsing, so let's move on...

Parsing at Various Levels of Intensity

From a performance perspective, I tend to focus on these steps:

No parse. There is actually a "no parse." I have not actually tested this, so I'll say nothing more about it. (I feel another blog posting...)

Softer parse. Question: Does the statement exist in my session cache cursor memory? If so, where is the cursor in the library cache? (i.e., What's the address?) This does not require a LC latch or mutex because the server process's private memory is being accessed, not the shared LC searching (i.e., hashing) structure.  If not, then continue to Soft Parse.

Soft parse. Question: Does the statement already exist in the library cache? If so, where is it? This requires LC latch and/or mutex access. If not, continue to Hard parse.

Hard parse. No questions, just get to it! But there are more steps involved. Allocate shared pool memory (requires shared pool sub-pool latch(es), create the cursor parent/child (requires mutex and/or LC latch), bring needed information into the shared pool (think: recursive SQL) (requires mutex and/or LC latches), and create all the associated memory links (requires mutex and/or LC latches). Wow, that's a lot of stuff and massive serialization control.

One would think that hard parsing would require significantly more resources and control than soft and softer parsing. But how much "significantly more" and is this a true performance impact, and does the impact change as concurrency increases? I am going to save the "as concurrency increases" for the next posting. It's time to look at some experimental evidence, so read on...

Experimental Setup

In summary, I gathered parse time when a statement is repeatedly executed. The first execution is clearly a hard parse and the subsequent parses will either be soft or softer. By comparing their parse times, I can tell the difference between hard and soft(er) parsing.

While the experimental setup may seem simple, it's rarely like that... In fact, since I'm going to be doing considerable research related to parsing over the next few months (and will be speaking about this in my to-be-announced one day Performance Research Seminar), I took extra time to develop a nice process and facility to gather the experimental data. Kinda like when real scientists look for dark matter... a little less intense I guess.

I originally gathered the parse times from v$sess_time_model. But I was not convinced the view of data was fresh enough. I use v$sys_time_model a lot when gathering session and system CPU consumption, but I'm usually either looking at the system over multiple minutes and not focusing only on just parse time. Because the parse time is relatively small (perhaps micro seconds), I'm looking at a specific session for a single SQL statement, and I'm not confident the view data is current, I opted to trace the session. But that meant I had more work to do. Because I want to easily repeat the experiment and use this setup in the future, I created the tools to determine the session's trace file, get the trace file, parse out the parse time, and log it.

There is much I could say about this process and the associated tools, but I won't. If you want to look at the experimental details, here are links to the various scripts.

Main experimental file (notes, code snippets, etc.) here.
Driving shell script, here.
Perl script to parse out parse time from an Oracle trace file, here.

The experimental hardware and software is as follows: Oracle 11.2.0.1.0 Enterprise Edition running on (uname -a): Linux sixcore 2.6.32-300.3.1.el6uek.x86_64 #1 SMP Fri Dec 9 18:57:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux.

But designing and creating the experiment was not that easy or simple...

Experimental Design Challenges

This happens every time! I think I've narrowed the experimental scope down to something achievable and simple... but then once I get into it, all sorts of questions and issues come up. And these issues have to be addressed or the experiment really won't mean anything or worse mislead me and you. But the good news is, these challenges always enlighten me and force me to deal with difficult issues from a statistical, experimental, coding, and Oracle internals perspective.

I will mention a few interesting challenges I had to overcome.

- SQL statement must contain a decent amount of text. Part of the soft parse is creating a hash value based, in part at least, on the actual SQL text. I wanted something a little more realistic, so I created 1000+ character long SQL statements.

- SQL must reference a multiple objects. Again, "select * from dual" just wouldn't cut it, so I created over 50 tables, each two indexes, and used them in the SQL. Again, I wanted to ensure some realistic soft parsing occurred.

- SQL statement must be unique to insure a hard parse. I had to create textually unique SQL or a soft parse would occur when I wanted a hard parse.

- SQL statements must be similar to compare fairly. If I made the SQL very different, then I couldn't gather a bunch of samples to get a fair comparison. My solution was to use the same SQL except for bind filtering and table name. But the tables are exactly the same and contain no rows. As you'll from the results, the times from one SQL statement to the next are very similar.

- Flushing the shared pool. One way to ensure a hard parse is to flush the shared pool. However, the next statement forces Oracle to bring in all sorts of optimization information, like histograms, etc. I think much of this information is used for both my SQL and other SQL. (This is easily seen as the resulting trace file includes lots of recursive SQL.) I felt this was not a fare test because in real production systems the hard parse issue is not focused on after flushing the shared pool, but during normal production operation. So I did not flush the shared pool, but relied on unique (but similar) SQL statements to ensure a hard parse.

- Gathering only the parse time. I couldn't simply gather the time before and after the SQL statement ran (i.e., elapsed time). I needed a specific component of the statement's elapsed time; the parse time. To do this I had two basic options; use the parse time columns from v$sess_time_model or SQL trace and grab from the trace file the parse time for the specific statement. For reasons I mentioned above, I opted for the trace file approach.

- Parse time; CPU and/or/versus elapsed. Keeping in mind from an Oracle perspective a server process is either consuming CPU or waiting, I'm interested in both. As you'll see from the experimental results in this posting (and the next post when I crank up the concurrency), the results are intriguing.

- Concurrency (i.e., competing load). The experiment in this posting is really a test of Oracle's parsing speed in a best case scenario, which is basically a single user system. That's nice because concurrency brings variance and is messy. But concurrency is also reality. So while the experiment in this posting is single user system, the next will contain the results based on various levels of concurrency and load.

It's questions like these that cause what appeared to be a simple experiment into a multi-day ordeal. It's not easy or simple to create good experiments that mean something.

Experimental Results

The raw experimental data is available as raw insert statements or the SQL selected data after the inserts where completed. But a much more interesting and enlightening view requires some statistics and visualization.
[Note: All but the "Elapsed time P-Value" columns should be familiar. A P-value greater than 0.025 indicates the sample set is likely to be normally distributed. For this blog posting, it's not important.]

Clearly the first execution, which is a hard parse, takes much longer than subsequent executions. In fact, the second execution is around 180 times faster than the first execution and the third execution is around 860 times faster than the initial execution. That is a massive difference!

So now when some says that hard parsing is expensive, you can confidently add:

Craig's no-load test showed the second (soft parse) execution was over 180 times faster than the first (hard parse) execution.

But we can say more! Supposedly (i.e., I don't know the source) the softer (not soft but softer) parse kicks in after a session has run the same cursor three times. This means we can't expect an additional parse time reduction until the fourth execution. If you look closely at the above data, you'll notice the fourth, fifth, six, and seventh (which is the last test) execution are all within 1100 to 1200 times faster than the initial execution.

Craig's no-load test showed softer parsing was over 1100 times faster than hard parsing!

In fact, statistically there is no difference between the fourth through the seventh executions. So now you can also confidently state that:

Craig's no-load test showed once a statement is executed four times, the parsing time is no longer significantly reduced. This is, at least in part, a demonstration of Oracle's session cache cursor facility.

That's good stuff!

How the Results Substantiate these Claims

If you're wondering why I can state there is or is not a significant difference between execution X and Y, it's because I ran a statistical significance test for every execution sample set combination, which is results in a 7 by 7 matrix. This matrix is shown below. In each cell is the test's P-value.

Because all but the first (hard parse) sample sets are not normally distributed, I used what's called a location test to determine the P-value. I took a P-value less than 0.025 as meaning the sample sets are truly different. For more details about this plus other statistical and visual goodies, download my Mathematica Notepad in either PDF or native format. I also provide P-value information in my previous blog entries: Consider doing a search for "p-value".
Here's how to interpret the above matrix. If a P-value is greater than 0.025 then it's likely the two sample sets came from the same population. This is more correct way of saying, there is no real difference between them. If a P-value is less than 0.025 then it's likely the two samples did NOT come from the same population. This is a more correct way of saying, their difference can not be explained by the random picking which caused our sample sets to appear similar. It's more than that... they are different! (This is of course, a non statistician way of explaining this.)

For example, when compare sample set 1 to sample set 1 (yes, itself), the P-value is greater than 0.025 and therefore they likely came from the same population. Dah... Now compare sample set three to sample set four through seven: The third execution, row 3, columns 4 to 7. Notice the P-values are all less than 0.025, actually 0! This means the sample set values are significantly different. But there's more!

Notice when comparing the P-values for sample sets four through seven to each other (i.e., 4 to 5, 5 to 7, etc.), their P-values are always greater than 0.025. This means while there is a numeric difference, statistically the difference is not significant! This is why I could state above, "After four executions, parse times are no longer significantly reduced." Cool stuff!

I like pictures. And below is a smooth histogram showing the difference between the third to the seventh execution.

If you feel the difference between the third execution and the fourth or fifth execution is not significant, remember to check the statistical significance test P-values shown in the above matrix.

Summary

There is a lot to take away from this posting. First and foremost, it's very clear that an initial SQL parse, that is the hard parse, takes many times longer than subsequent parses. My non-load test showed the second parse (soft parse) was 180 times longer than the initial (hard) parse! Second, once session cache cursors kick in after the third parse, parse time drop over 1100 times when compared to the initial hard parse!

Reality perspective: I need to mention that even though hard parsing is clearly much more resource intensive and takes much longer than soft and softer parsing, if a statement takes multiple seconds to run, it's a valid question to ask, "Who cares?!" I mean, do we really care if parsing takes 23 ms compared to 0.023 ms if the statement elapsed time is 10 seconds? Perhaps if concurrency stretches these numbers, but otherwise perhaps not. The true answer is to analyze performance to see if hard parsing is impacting performance.

Keep in mind my testing was performed on a no-load system. This is not a bad thing, but by design. I wanted a brut-force non-concurrency code algorithm test without the distraction of serious resource competition.

In my next posting, I'll perform the same tests but with different levels of concurrency intensity. Can't wait to share the results!

Thanks for reading!

Craig.

Thursday, May 31, 2012

Row versus Set Processing, Surprise!

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

You're row processing on the Java client? Surprise!


Two weeks ago I joined a good friend on-site at one of his clients. Upon arriving a discouraged developer wanted to know why his report was currently running so slow, when previously it had run just fine. For sure it was the database, right?

Long story short, the report sucked data from a couple of database tables into a Java client front end placing this data into memory structures, and then the Java code did all sorts of (I'm sure) amazing sequential row/record processing. This worked great with smaller quantities of data, but as the developer confessed, this past run was processing a larger volume of data. The database processing only took a few seconds, while the Java portion took much longer. We're not sure just how much longer since the report never finished!

Once the database and Java processing timing situation was explained and we think understood, the developer was still very visibly unsettled. He was taught that the database was there to simply retrieve, insert, or change mass amounts of data and all the complex business rule processing should be done in client-side Java code. It was obvious to both my friend and I that the benefits of set processing compared to row processing were not something the developer thought about and perhaps understood.

We gently explained that perhaps he could do some of the processing in the database using the benefits of set processing. After a short discussion about this, it was clear his solution was the way it "should be done" and that meant all the business processing should be done in the Java client.

There are two things he just didn't get. First, that set processing rocks! And second, row processing is not likely to scale as data volumes increase. He was encountering both of these truths.

While all Oracle DBAs have at one time or another been told about the benefits of set processing and the scalability issues with row processing, no one has ever pointed me to a real test. And I have never personally created a test to prove (at least in a test situation) the benefits of set processing.

This posting is the results of a simple experiment comparing row to set processing to specifically see for myself:

1. Does set processing truly bring increased performance, in terms of increased throughput?

2. Does set processing scale better then row processing, in relation to increased data processing volume?

If you want to know what happened... read on! (You can also skip to the Conclusion section below.)

Experimental Design


I wanted to keep the experimental design simple and give row processing as much advantage as possible.

View the complete analysis script here, below I pull just the key snippets I need to present.
View the raw experimental data results from the op_results table here.

The processing is pretty straightforward. I'm inserting rows into the op_interim table containing the number and amount of orders for each customer who has a status of '1'.

Below is the set processing query that runs on the database server.

insert into op_interim
  select c.id, c.status, c.name, count(o.amount), sum(o.amount)
  from   op_customer c,
         op_order o
  where  c.status = '1'
    and  c.id = o.cust_id (+)
  group by c.id, c.status, c.name;
commit;

Below is row processing code written in simple plsql. The plsql code was run on the database server, not from a separate machine.
order_sum_check_v := 0;

open cust_cur for
  select id, status, name
  from   op_customer
  order by id;
loop
  fetch cust_cur into
    cust_cur_id_v, cust_cur_status_v, cust_cur_name_v ;
  exit when cust_cur%NOTFOUND;

  order_count_v      := 0 ;
  order_sum_amount_v := 0 ;

  open order_cur for
    select id, cust_id, amount
    from   op_order
    order by cust_id, id;
  loop
    fetch order_cur into
      order_cur_id_v, order_cur_cust_id_v, order_cur_amount_v ;
    exit when order_cur%NOTFOUND;

    if (cust_cur_status_v = '1') AND (cust_cur_id_v = order_cur_cust_id_v) then
      order_count_v      := order_count_v + 1 ;
      order_sum_amount_v := order_sum_amount_v + order_cur_amount_v ;
      order_sum_check_v  := order_sum_check_v + order_cur_amount_v ;
    end if;
  end loop;
  close order_cur;

  insert into op_interim values
  (cust_cur_id_v, cust_cur_status_v, cust_cur_name_v, order_count_v,
     order_sum_amount_v);
end loop;
close cust_cur;
commit;

A big benefit to the row processing is I pull the data from Oracle in sorted order. This makes the row-by-row nested loop easy to code and "fast." To ensure both the row and set processing actually did the equivalent from a business processing perspective, I kept a counter, order_sum_check_v so I knew the total order amount processed. To my relief, comparing set to row processing, the counter always matched perfectly.

Just comparing the two above snippets of code makes me run towards set processing. The chances of buggy code is obviously far greater in the row processing. (...reminds me a lot of my first "MIS" class on COBOL programming! Run away!!!)

For my experiment I varied the number of rows processed from 100K to 1M in 10 sets. That is, 100K, 200K,..., 1000K. I took 12 samples for both row and set processing at each row set level. I would have liked to take more samples, but the row processing took a long time. The entire experiment ran for around 40 hours.

The detailed results were stored in the op_results table. I wrote a couple simple queries to make analyzing the data in Mathematica easier. I ran the test on (cat /proc/version): Linux version 2.6.32-300.3.1.el6uek.x86_64 (mockbuild@ca-build44.us.oracle.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Fri Dec 9 18:57:35 EST 2011. The Dell Precision T5500, 6 CPU core box has 24GB of ram, and Oracle 11.2.0.1.0.

Experimental Results


The experiment was designed to answer these two questions:

1. Does set processing truly bring increased performance, in terms of increased throughput?

2. Does set processing scale better then row processing, in relation to increased data processing volume?

Let's look at each of these in more detail (not that much detail though).

You can download the raw experimental data here.
You can download the throughput and scalability details in PDF or native Mathematica.
You can download the 100K row vs set processing significance test details in either PDF or native Mathematica.

1. Does set processing truly bring increased performance?

Take a quick look at the below chart. Each point represents 12 samples for a set number of rows to process. The red square points are the set processing results and the round blue points are the row processing results.

The chart above clearly shows set processing operates at a much higher throughput than row processing. In fact, with 100K rows to processing, set processing throughput was 3741 times greater than row processing throughput (1526476 rows/sec vs 408 rows/sec).

Our experiment showed from 100K to 1M rows processed, set based processing's throughput was much higher compared to row based processing.

2. Does set processing scale better then row processing, in relation to increased data processing volume?

Have you ever heard someone say, "I can't understand why it taking so much longer now. It took 10 minutes to process 10K rows but with 50K rows it still isn't finished after four hours!" What the person is not understanding is referred to is application scalability (among other things).

The misconception is believing if 10K rows are processing in 10 minutes, then 50K rows will be processed in 50 minutes and on and on. Set processing is much more likely to scale linearity than row processing because it processes in group or sets. In contrast, row processing just plows through the data one row at a time missing out on the natural grouping of rows that set processing provides.

The above chart shows the time related to process x number of rows. As in the first chart, the row processing points are the blue circles and the set processing are the red squares. The set based processing results is difficult to see because all the points on virtually on the x-axis! The set based points are the red square points. So obviously, set based processing is faster and stays faster as the number of rows increases. Don't ever let someone convince you row based processing is faster than set based processing unless they can clearly demonstrate using your real data.

For the statistically minded, the time difference is the smallest with the fewest number of rows processes, that is, 100000. Because I'm plotting the averages, we can't visually tell for sure if there is a statistically significant difference between row and set processing. For example, the variance could be massive so in reality, there really isn't a difference. So I performed a statistical significance test (download link at top of Experimental Results section) using the 12 elapsed time samples from both the row and the set based processing. The p-value is 0.00000 (really) indicating it is highly unlikely the two sample sets came from the same population. This means, they are different. OK... no surprise here.

Contrasting set and row processing scalability is very evident both visually and numerically. Looking at the above graph, you can see that difference between row and set processing increases as the rows to process increases.  Numerically, when creating a linear trend line, the slope of the row based processing is 0.00259 and the slope for the set processing data is 0.00000. (For set processing, the slope is actually 6.17x10-7.)

This means that as the number of rows to processes increases, the time difference between row and set processing continues to increase.

So not only is row based processing slower as the amount of work increases, the time difference continues to increase as the amount of work increases.

Conclusions


If you're a DBA you already knew in your gut what the conclusions would be. But now for both of us, we can stand in front of people and point to experimental evidence clearly showing:

1. Set based processing will likely be much faster than row based processing. Our experiment of processing 100K rows showed row based processing was 3700 times slower than set based processing. Not twice as slower or even 10 times slower... 3700 times slower!

2. Compared to set based processing, row based processing times degrade much quicker than set based processing. That is, row based processing does not scale nearly as well as set based processing. We saw this in that the linear trend line for row based processing was 0.00259 compared to 0.00000 for set based processing.

Thanks for reading!


Craig.


If you enjoyed this blog entry, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting and Advanced Oracle Performance Analysis. I teach these classes around the world multiple times each year. I also occasionally teach my Oracle Forecasting and Predictive Analysis class and also a one day Performance Research Seminar. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond with a comment or you have a question, please feel free to email me directly at craig @orapub .com.