0

I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this Product.where(:exported => false).count.. .

The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.

So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.

Oh.. I should mention ruby 1.9.3 heroku and postgresql.

Now.. if I'm missing another way to get the count, I'd be excited to try that.

OH SNOT one last thing.. this example is contrived.

baash05
  • 4,394
  • 11
  • 59
  • 97

2 Answers2

2

PostgreSQL doesn't support dirty reads.

You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.

Alternately, use system statistics to get a fast approximation.

Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.

Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • I wasn't really looking for an up-to-date count.. Just something that gets updated once every so often my boss asks me for a count. (how many are changed dave?) A dirty query would have been good enough. – baash05 Mar 19 '14 at 01:54
  • Well, in that case I don't see what was wrong with your original solution. If you need a moderately up to date answer but need it quickly, you can always schedule a periodic count into a history table, so you can just select the history table when you're asked. – Craig Ringer Mar 19 '14 at 02:08
0

As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.

If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.

tadman
  • 208,517
  • 23
  • 234
  • 262
  • 1
    Note that only PostgreSQL 9.2 and above benefit from an index on an unqualified `count`, when they can use an index-only scan. – Craig Ringer Mar 19 '14 at 01:04
  • The field wasn't really that important.. It was a contrived example. But perhaps I can narrow down fields that I'd like to do this on, and use an index. I thought indexes were heavy on fields that change often. – baash05 Mar 19 '14 at 01:56
  • @baash05 They are. It's a performance trade-off between the added write load and potential improvements to I/O costs on read. – Craig Ringer Mar 19 '14 at 02:07