Post Load Processes - Don't Neglect Your Statistics!

Post Views: 1,722

It is quite common for warehouse developers to customise standard processes to add additional fields. Ideally this is done by modifying the appropriate mapping(s) to pass the field through, regardless of the technology use (Informatica, ODI etc). However in some instances, generally due to the complexity of a particular mapping process, developers implement these updates using a Post Load Process (PLP). From a technical perspective this can be done a number of ways, i.e. using a mapping, SQL statement updates and so on. The focus of this blog is the latter; specifically one of the traps that a developer can fall into when doing do.

First let’s assume we have a simple mapping process which populates a table WC_DEMO_F. For demo purposes I can create that table as follows.

Create Table wc_demo_f As    Select level row_wid,          'Row ' || To_Char(level,'FM999999') demo_value,          'N' x_two_multiple     From Dual   Connect By Level <= 1000000;

I’ll also create an index on that column
The x_two_multiple column will be set via a PLP – we’ll default it to N and then flag it later where the row_wid is a multiple of two. This is of course a ridiculous example, please excuse it – I’m merely trying to illustrate a concept. When we run the mapping the last thing that would happen is that indexes would be created and statistics would be gathered (providing this is configured of course). So let’s do that now.

Create Index wc_demo_f_idx On wc_demo_f(x_two_multiple); Exec Dbms_Stats.Gather_Table_Stats(ownname=>user,tabname=>'WC_DEMO_F');

Now for my PLP – I’m going to flag all rows which are a multiple of 2.

Update wc_demo_f Set x_two_multiple='Y' Where Mod(row_wid,2)=0; Commit;

So… If I now try to query my table for x_two_multiple='Y', what would I expect?

Select *    From wc_demo_f  Where x_two_multiple = 'Y';

Well, I have just updated 500,000 rows – exactly half the number of rows in the table. 50% selectivity must give a full table scan with the normal rule of thumb mustn’t it? Let’s see what the optimizer is thinking.

Plan hash value: 267250319   ----------------------------------------------------------------------------------------------------- | Id  | Operation                           | Name          | Rows  | Bytes | Cost (%CPU)| Time     | ----------------------------------------------------------------------------------------------------- |   0 | SELECT STATEMENT                    |               |     1 |    18 |     4   (0)| 00:00:01 | |   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| WC_DEMO_F     |     1 |    18 |     4   (0)| 00:00:01 | |*  2 |   INDEX RANGE SCAN                  | WC_DEMO_F_IDX |     1 |       |     3   (0)| 00:00:01 | -----------------------------------------------------------------------------------------------------   Predicate Information (identified by operation id): ---------------------------------------------------      2 - access("X_TWO_MULTIPLE"='Y')

It thinks that it is going to update 1 row, and that an index range scan is best – for 50% of the data in my table! The key point here of course is I haven’t re-gathered statistics on my table since updating. Given that when statistics were gathered there was only a single value in the column, the optimizer considers the new value of “Y” an outlier.

  Select column_name, low_value, high_value, density, num_nulls, num_distinct     From user_tab_cols    Where table_name='WC_DEMO_F' Order By column_id;

Data

There is a massive overhead of reading half a million index range scans and probing the table for each.

So.. if you are writing custom PLP processes which update flags etc on a table then ensure you gather statistics afterwards!!

Exec Dbms_Stats.Gather_Table_Stats(ownname=>user,tabname=>'WC_DEMO_F');  Select *    From wc_demo_f  Where x_two_multiple = 'Y';      Plan hash value: 4277550765   ------------------------------------------------------------------------------- | Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     | ------------------------------------------------------------------------------- |   0 | SELECT STATEMENT  |           |   500K|  8789K|   888   (2)| 00:00:01 | |*  1 |  TABLE ACCESS FULL| WC_DEMO_F |   500K|  8789K|   888   (2)| 00:00:01 | -------------------------------------------------------------------------------   Predicate Information (identified by operation id): ---------------------------------------------------      1 - filter("X_TWO_MULTIPLE"='Y')

New Data

That’s definitely more like it 🙂
I mentioned earlier that this might seem a stupid example, and it is, however I have seen similar situations in production environments which have resulted completely different execution plans (for the worse) due to exactly this reason. Sometimes it’s a little easy to have a simple Fix-It script; if you’re going to do it then make sure you do it right.

Post Load Processes – Don’t Neglect Your Statistics!

Leave a Reply Cancel reply

Contact us now for a FREE demonstration or discussion about our products or services