Relationship Issues
No, this isn’t the latest agony aunt column – I’m talking about data relationships! Specifically, relationships between columns within single tables. For example, let’s take a simple list of UK towns with their associated county and country. This was the first hit on Google when I did such a search. Now assume we also have a table of people and the town in which they live. So we might want to ask the very simple and reasonable question Get me all the people who live in a town which is in England and in the county of Essex.
So why would this cause a problem? Well first, we need to understand how Oracle retrieves information from the database and its strategy (plan) for doing that. This is a topic for which there are many entire books written on already, and so I’m just going to give a very basic example with just two strategies.
- We could query the towns table to get a list of all the towns matching our predicate (the town is in Essex, England). Then for each town we find, we go to the people table and get the list of people in that town. This kind of approach is commonly known as Nested Loops.
- We query the list of towns matching our predicate. Then we query the list of people. Once we have those two lists together, we match the two up (how this is done can vary). This kind of approach is commonly known as a Hash Join where rows are matched using a hash function.
So which approach is better? It depends. If we found only one town in the county of Essex, then clearly it would be quicker to read that one row, and then go and retrieve just the people living in Essex using an index access path. We are done. However what if we found 10,000 towns in Essex? Fetching the list of people for a single town might be very quick, however doing it 10,000 times likely isn’t so. No matter how quick something is, if we do it enough times then the process as a whole becomes slow. In this case we would be quicker reading out the two list up front and then matching (i.e. hash joining).
So how does Oracle know how many rows (the cardinality) it is going to process before it has actually processed? Well… it doesn’t! It has to estimate this up-front at the time it parses the query based upon the information it knows (statistics of the table and columns). In simple examples this is very basic mathematics:
Number of values in the table / Number of distinct values in the column
So if we have a table of 10,000 records with 500 distinct values, then if we filter on one value Oracle will expect to retrieve 10,000/500 rows = 20.
This basic mathematics extends even further when we filter upon multiple columns. We use standard probability theory that states the probability of two events is the product of the probabilities of those distinct events:
P(X and Y) = P(X) * P(Y)
The probability of retrieving a row is simply the inverse number of rows we expect to retrieve. So again using the example above, say we have the same 10,000 record and column A with 500 distinct values and column B with 20 distinct values. The probability of retrieving a row with single value predicates on A and B is:
(10,000/500)/10,000 * (10,000/20)/10,000 0.002 (0.2%) * 0.05 (5%) 0.0001 (0.1%)
Most of the time this works perfectly fine. However… when we have relationships within our dataset, this breaks down somewhat. For example, if we pick the county of Essex, we naturally know that Essex doesn’t exist in any other country. Oracle doesn’t though. So considering our single data set, we can estimate the cardinality using the above rules without even loading into Oracle, simply by looking at the data. So let’s do that.
By looking at our data in excel we can get the following.
Column | Distinct Values |
---|---|
Country | 4 |
County | 86 |
P(Country = England and County = Essex) = P(Country = England) * P(County = Essex) = (1718/4)/1718 * (1718/86)/1718 = 0.25 * 0.012 = 0.003 0.003 * 1718 [total rows] = 5.154 = 5 rows (rounded)
And just to prove that, let’s do the import and generate a plan. 🙂
Now let’s look at how many rows we actually get:
Select Count(*) From towns t Where t.country = 'England' And t.county = 'Essex'; COUNT(*) --------- 63 SQL>
So clearly the estimates are way out! So let’s create some data for people living in those towns to show out point above.
Create Table ppl As Select ppl.person, t.town From (Select Round(dbms_random.value(1,1718)) town_id, 'Person ' || level person From dual d Connect By level <= 100000) ppl, (Select rownum town_id, town From towns) t Where t.town_id=ppl.town_id; Create Index ppl_town_idx On ppl(town); Exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'ppl');
Now let’s get all those people in Essex, England.
Select p.person, t.town From towns t, ppl p Where t.town = p.town And t.country = 'England' And t.county = 'Essex'
We see a plan which looks like this:
So we think that we’re going to have to loop over 5 different towns, to retrieve 60 index hits, which we’ll then loop over again to get 298 rows in total. However look what actually happens:
Oh dear – we actually execute that first loop 63 times rather than 5, and then get 3823 rows rather than the 298 we expected. Thus we then loop over 3823 tiles rather than the 298 times we expected.
Here’s what we would have wanted to do ideally:
So we read both tables into memory and then hash join them. And ironically if you’re using 12c, that’s likely what will actually happen, despite the initial plan! That’s all thanks to Adaptive Execution Plans which I wrote about a while back. Unfortunately prior to 12c, once the optimizer has made its choice it has to live (and die) by it.
What is interesting is that by omitting the predicate on country, our cardinality estimates are actually a lot closer to the actuals. Clearly we can’t do that as we could never guarantee there isn’t going to be another town called Essex in another country. It shows though that the traditional view of “the more selective the predicates, the better” doesn’t always hold true where inter-column relationships exist.
Whilst our simple example is obvious pretty trivial, obviously cardinality mis-estimates can be multiplied as they are passed though the plan. I’ve just seen an example in a production environment of BI Applications where a number of mis-estimates contributed to two nested loops steps at the top level of the plan over 8 Million rows each. Oracle estimated it would be in the low thousands. That’s the true reason why relationships can cause problems! 🙂
Now of course there are things we can do to overcome this (a topic for a future blog post) however it’s not always an easy task, especially in a warehouse environment where code is being generated automatically and you are working with off the shelf data structures.