2

My table has integer columns "a", "b". There are only few (<30) different values for "a", but for each "a", huge number of different "b" exists (>10**7). To speed things up, I created composite index (a,b). I observe that

select count(*) from tab where a=1; 

runs fast, also

select count(*) from tab where a=2;

runs fast, but

select count(*) from tab where a=1 or a=2;

runs (oh-my-god-so) slow, the same is true for

select count(*) from tab;

running explain explained, that the fast queries use

->  Index Only Scan using idx on tab

but for the slow queries, sequential scan is used.

Why is this so? Why does Postgresql not use the same index for the latter two queries? Is it just imperfection of the query planner, or are there more profound reasons why Index Only Scan cannot be use?

Tomas Kulich
  • 14,388
  • 4
  • 30
  • 35
  • A wild guess: you don't have (valid) statistics on tab.a, or most of the tab.a values are 1 or 2. – joop Jun 18 '15 at 16:52
  • Try changing the second query in this way: `select (select count(*) from tab where a=1)+(select count(*) from tab where a=2);`, the optimizer should use the IndexOnlyScan for the two subqueries. The problem is that optimizers in general do not like `or` conditions. – Renzo Jun 18 '15 at 17:43
  • @joop does it matter, how many values 1 and 2 are there? (there are approx 1/10 of both 1 and 2). If there is composite index built on (a,b), it can speed the query execution no matter how many such records are there, so I'd expect that Postgre use the index. – Tomas Kulich Jun 19 '15 at 09:13
  • @Renzo As expected, your query runs fast. You're probably right about the 'or' being complicated, however, I'd expect that at least the last count( * ) query use the index. Note that iterating over the all possible "a" values and summing the partial count(*) results together is much faster than direct count(*). This surprises me, since this basically force Postgre to iterate over the whole composite index; why don't Postgre does this by itself? I believe, no "special knowledge" but the (approx.) total index size is needed for such decision. – Tomas Kulich Jun 19 '15 at 09:23
  • @TomasKulich you are right for the count(*), but optimizers are strange beasts and try only a limited number of possible access plans. Maybe in Postgresql there is a rule that when a query does not have any condition then no index is considered... – Renzo Jun 19 '15 at 09:33
  • @TomasKulich: of course it matters. An index is only effective in a query if it is selective. Normally the limit is around 10%. IIRC postgres uses 10%, so you probably just exceeded this limit ... It is all heuristics. (and the optimal choice depends on total footprint, rowsize, rowsize of the index, disk seektime, cpu cost, available memory, available statistics and tuning) – joop Jun 19 '15 at 12:57
  • @joop let me put it this way: I'd expect that if query can be resolved just by looking at the index, then the query should be resolved just by looking at the index. (Especially, when index is quite small. Or, may it happen, such policy is not good?) My query obviously can be resolved just by looking at the index; so it makes me curious, why Postgre does not see it. I would say that can-be-resolved-just-by-index is a property that can be easily determined. – Tomas Kulich Jun 19 '15 at 22:50
  • @renzo Thanks for the comment. I think I'll just keep in mind that I should always specify one concrete value of "a" when working with my dataset. – Tomas Kulich Jun 19 '15 at 22:55

1 Answers1

1

Because for the conditions like a=1 there is no need to recheck the condition for each row. So, it just traverses the b-tree and counts all leafs that have a=1. No need for eliciting the actual data in this case. I'd try union for two queries with a=1 and a=2 or maybe the condition like a<=2 will work.

onerror
  • 606
  • 5
  • 20